42
Tetherless World Constellation Linked Open Government Data http://logd.tw.rpi.edu Jim Hendler Tetherless World Professor of Computer and Cognitive Science Assistant Dean of Information Technology and Web Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)

RPI Research in Linked Open Government Systems

Embed Size (px)

DESCRIPTION

Open Government systems are changing the way that many governments around the world interact with their citizenry. Transparency and innovation are both enhanced by government openness and especially by government data sharing. Further, the linked data approach, using maturing semantic web technologies, has been shown to be very valuable in creating "mashups" in which government datasets can be combined in new and innovative ways, and turned into "live" infographics. In this talk, presented to the computer science department at PUC-RIO, we describe the research aspects of RPI's ongoing work in this area.

Citation preview

Page 1: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Linked Open Government Data

http://logd.tw.rpi.edu

Jim HendlerTetherless World Professor of Computer and Cognitive Science

Assistant Dean of Information Technology and Web Science

Rensselaer Polytechnic Institutehttp://www.cs.rpi.edu/~hendler

@jahendler (twitter)

Page 2: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Demo of our site

http://logd.tw.rpi.edu

Page 3: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Government Data on the Web

Page 4: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Data.gov community: International

Page 5: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Government Data SharingJa

nu

ary

1,

20

09

“Openness will strengthen our democracy and promote efficiency and effectiveness in Government.”

--- President Obama

Putting Govt Data online-Data.gov.uk beta

Ma

y 2

1,

20

09

Jan

ua

ry 1

9,

20

10

data.gov.uk online

Ma

y 2

1,

20

10

data.gov online data.gov relaunchwith semantic webfeatured

Jun

e3

0,2

00

9

De

cem

be

r 8

, 2

00

9

“Open GovernmentDirective” released

2009 2010 …

57 Data Sets

~6000 Data Set

~2000 Data Sets

>305,000 Data Sets

Page 6: RPI Research in Linked Open Government Systems

Tetherless World Constellation

David McCandless

New ways to see data sets

Page 7: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Important to the citizens: eg. Education

Page 8: RPI Research in Linked Open Government Systems

Tetherless World Constellation

What’s promising

• Linked open government data (data.gov, data.gov.uk)– Of many kinds

• Markup languages and semantics and tools to enable transparency

• Lower barriers to internet visualization, e.g. Google vis, MIT simile, many more…

• Web 2.0 to put people in the loop and use and contribute to annotations

Page 9: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Moving data.gov to linked data (UK)

• Built around “linked data” from the start

• Authorization for this from the Prime Minister

Page 10: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Moving data.gov to linked data (US)

• Third parties (like RPI) translate the government datasets into linked data formats

• US Data.gov hosts 6.4B RDF triples 5/21/2010•acknowledges Semantic Web as a key technology for open government data

Page 11: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Linked Open Data goes beyond govt

http://linkeddata.org/

Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)

Page 12: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Distribution Statement

Create Mashups

More than 50 of these at http://logd.tw.rpi.edu

Page 13: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Data.gov + epa.gov

Page 14: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Page 15: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Adding some Web magic

Web Analytics

Social Data Networks

External Links

Page 16: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Linking GDP of the US and China

GDP of China (Billion Chinese Yuan )

GDP of the US (Billion Dollar)

[Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn

Page 17: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Linking GDP of the US and China

GDP of China (Billion Chinese Yuan )

GDP of the US (Billion Dollar)

[Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn

This mashup was built in less than 4 hours – including conversion of data, web interface, and visualization!

Page 18: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Trends in Smoking Prevalence, Tobacco Policy Coverage and Tobacco Prices (1991-2007)

Mashups allow comparisons that single data sets cannot

Extensible Mashups via Linked Data Diverse datasets from NIH Potentially linking to “unemployment rate”Accountable Mashups via Provenance Annotate datasets used in demos Feedback users’ comment to gov contact (e.g. %)

Page 19: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Integrate with Social media

Page 20: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Our process

Convert

derive derive

create

derive

revision

Access

Enhance

Version

SemDiff

Page 21: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Conversion of data sets

Page 22: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Csv2rdflod (from logd.tw.rpi.edu)

Installcsv2rdflod

Page 23: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Metadata is critical

What kinds of metadata are: simple to create, powerful enough for search and internationalizable (esp. beyond English)

Page 24: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Work in Progress

• Automated linking– Can we discover link points in the data given the

standard URI and metadata collections we have– Approach

• High quality experimentation on small dataset (gold standard)

– MS by Johanna Flores, Web Science poster, 2011

• Take best heuristics to large-scale data– Ongoing

• Evaluate– Ongoing

• Repeat as needed…

Page 25: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Datasets are incomplete

Page 26: RPI Research in Linked Open Government Systems

Tetherless World Constellation

RDF encodings from our metadata collection

Page 27: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Process

Page 28: RPI Research in Linked Open Government Systems

Tetherless World Constellation

• Tried three heuristic approaches

Bag of words LED on stringsString Match

Various Weighted Combinations

Page 29: RPI Research in Linked Open Government Systems

Simple Example

Facility ID … Latitude Longitude ST:val

… … 40.416944

-75.935 42

… … 42.955383

-85.480074

26

… … 43.1698 -88.01829 55

… … 38.87025

-77.00905 14

… … … … …

EPA Toxic Release Data

This looks like it could be state identifiers.

Look for possible state identifiers:-Names: “Pennsylvania”, “Michigan”, “Wisconsin”-Abbr: “PA”, “MI”, “WI”-FIPS: “42”, “26”, “55”

75% match state identifiers.

If this meets our threshold, then recommend interpreting as state and integrating with linked data on the web.

Federal Information Processing Standards (FIPS) 14 is “Guam” which is not a US state

Page 30: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Results

• Analyzed 1,396 “raw” Data.gov datasets – About 1.66B triples of converted CSV to RDF– Did not include metadata, provenance, linking or other

products of the “enhanced” conversion

• Simple heuristics were able to identify 3,432 meaningful database labels, yielding 1.2M US state identifiers and 3.8M geo-coordinates– Parallelized enhancement system is able to process 65k

triples/second/process– Analyzed 1,396 “raw” Data.gov datasets in 3.1min

• on 256 processors of the CCNI Opteron cluster• Currently porting to an IBM BlueGene

• Analysis (ongoing) found no errors in links produced (but many errors of omission)

Page 31: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Next Steps

• Explore– Use of mapping heuristics

• Talking w/PUC-RIO about

– “real” LED and machine-learning approaches

– Metadata analysis• Clustering & ML (336,000 labeled examples)• Metadata linking (esp. re: languages)

– Govt terms provide a good start

– Try other mapping tools (cf. SERIMI)

Page 32: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Challenge

•Ontology and vocabulary issues–How do we compare across heterogeneous and unreconciled data•Good news and Bad news

Page 33: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Good news – easy to do comparisons

Page 34: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Good news - Even if not “rationalized” together

Page 35: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Bad news – real comparisons are hard across govts

Page 36: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Presents a challenge

Same or different?

Page 37: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Different “ontologies” ?

Definitely not the expected result!!

Page 38: RPI Research in Linked Open Government Systems

Tetherless World Constellation

And many other interesting issues

• Trust– Government data is controversial, and potentially biased

• How do we confirm or dispute?

• Combination– When we combine data we need to keep the provenance of

information (see trust)• How can we show and use?

• Scaling– LOGD has already converted 8,678,741,017 triples– ~500 of 390,000 reported US datasets

• Versioning and updating• Archiving• Searching in the data• …

Page 39: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Summary

• The Open Govt data is a critical resource– Government data released as RDF (UK)– Government data converted to RDF (US)– Government data that can be found in many forms and used or

converted (WWW)

• Government transparency comes through in the “mashing up” of data from many sites– Key to linked data

• But many challenges remain– Scaling, Trust, Provenance, Archiving, Curation, …

• The Research agenda for linked government data is an important area for a Web-Science based approach

Page 40: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Questions?

http://logd.tw.rpi.edu

Page 41: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Govt systems can use linked data web for context

Correlates fires, acres burned, and agency budgets

Page 42: RPI Research in Linked Open Government Systems

Tetherless World Constellation

Visualization can help identify data errors

Were there really no fires in 1985?