Upload
james-hendler
View
3.284
Download
2
Embed Size (px)
DESCRIPTION
Open Government systems are changing the way that many governments around the world interact with their citizenry. Transparency and innovation are both enhanced by government openness and especially by government data sharing. Further, the linked data approach, using maturing semantic web technologies, has been shown to be very valuable in creating "mashups" in which government datasets can be combined in new and innovative ways, and turned into "live" infographics. In this talk, presented to the computer science department at PUC-RIO, we describe the research aspects of RPI's ongoing work in this area.
Citation preview
Tetherless World Constellation
Linked Open Government Data
http://logd.tw.rpi.edu
Jim HendlerTetherless World Professor of Computer and Cognitive Science
Assistant Dean of Information Technology and Web Science
Rensselaer Polytechnic Institutehttp://www.cs.rpi.edu/~hendler
@jahendler (twitter)
Tetherless World Constellation
Government Data on the Web
Tetherless World Constellation
Data.gov community: International
Tetherless World Constellation
Government Data SharingJa
nu
ary
1,
20
09
“Openness will strengthen our democracy and promote efficiency and effectiveness in Government.”
--- President Obama
Putting Govt Data online-Data.gov.uk beta
Ma
y 2
1,
20
09
Jan
ua
ry 1
9,
20
10
data.gov.uk online
Ma
y 2
1,
20
10
data.gov online data.gov relaunchwith semantic webfeatured
Jun
e3
0,2
00
9
De
cem
be
r 8
, 2
00
9
“Open GovernmentDirective” released
2009 2010 …
57 Data Sets
~6000 Data Set
~2000 Data Sets
>305,000 Data Sets
Tetherless World Constellation
David McCandless
New ways to see data sets
Tetherless World Constellation
Important to the citizens: eg. Education
Tetherless World Constellation
What’s promising
• Linked open government data (data.gov, data.gov.uk)– Of many kinds
• Markup languages and semantics and tools to enable transparency
• Lower barriers to internet visualization, e.g. Google vis, MIT simile, many more…
• Web 2.0 to put people in the loop and use and contribute to annotations
Tetherless World Constellation
Moving data.gov to linked data (UK)
• Built around “linked data” from the start
• Authorization for this from the Prime Minister
Tetherless World Constellation
Moving data.gov to linked data (US)
• Third parties (like RPI) translate the government datasets into linked data formats
• US Data.gov hosts 6.4B RDF triples 5/21/2010•acknowledges Semantic Web as a key technology for open government data
Tetherless World Constellation
Linked Open Data goes beyond govt
http://linkeddata.org/
Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)
Tetherless World Constellation
Distribution Statement
Create Mashups
More than 50 of these at http://logd.tw.rpi.edu
Tetherless World Constellation
Data.gov + epa.gov
Tetherless World Constellation
Tetherless World Constellation
Adding some Web magic
Web Analytics
Social Data Networks
External Links
Tetherless World Constellation
Linking GDP of the US and China
GDP of China (Billion Chinese Yuan )
GDP of the US (Billion Dollar)
[Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn
Tetherless World Constellation
Linking GDP of the US and China
GDP of China (Billion Chinese Yuan )
GDP of the US (Billion Dollar)
[Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn
This mashup was built in less than 4 hours – including conversion of data, web interface, and visualization!
Tetherless World Constellation
Trends in Smoking Prevalence, Tobacco Policy Coverage and Tobacco Prices (1991-2007)
Mashups allow comparisons that single data sets cannot
Extensible Mashups via Linked Data Diverse datasets from NIH Potentially linking to “unemployment rate”Accountable Mashups via Provenance Annotate datasets used in demos Feedback users’ comment to gov contact (e.g. %)
Tetherless World Constellation
Integrate with Social media
Tetherless World Constellation
Our process
Convert
derive derive
create
derive
revision
Access
Enhance
Version
SemDiff
Tetherless World Constellation
Conversion of data sets
Tetherless World Constellation
Csv2rdflod (from logd.tw.rpi.edu)
Installcsv2rdflod
Tetherless World Constellation
Metadata is critical
What kinds of metadata are: simple to create, powerful enough for search and internationalizable (esp. beyond English)
Tetherless World Constellation
Work in Progress
• Automated linking– Can we discover link points in the data given the
standard URI and metadata collections we have– Approach
• High quality experimentation on small dataset (gold standard)
– MS by Johanna Flores, Web Science poster, 2011
• Take best heuristics to large-scale data– Ongoing
• Evaluate– Ongoing
• Repeat as needed…
Tetherless World Constellation
Datasets are incomplete
Tetherless World Constellation
RDF encodings from our metadata collection
Tetherless World Constellation
Process
Tetherless World Constellation
• Tried three heuristic approaches
Bag of words LED on stringsString Match
Various Weighted Combinations
Simple Example
Facility ID … Latitude Longitude ST:val
… … 40.416944
-75.935 42
… … 42.955383
-85.480074
26
… … 43.1698 -88.01829 55
… … 38.87025
-77.00905 14
… … … … …
EPA Toxic Release Data
This looks like it could be state identifiers.
Look for possible state identifiers:-Names: “Pennsylvania”, “Michigan”, “Wisconsin”-Abbr: “PA”, “MI”, “WI”-FIPS: “42”, “26”, “55”
75% match state identifiers.
If this meets our threshold, then recommend interpreting as state and integrating with linked data on the web.
Federal Information Processing Standards (FIPS) 14 is “Guam” which is not a US state
Tetherless World Constellation
Results
• Analyzed 1,396 “raw” Data.gov datasets – About 1.66B triples of converted CSV to RDF– Did not include metadata, provenance, linking or other
products of the “enhanced” conversion
• Simple heuristics were able to identify 3,432 meaningful database labels, yielding 1.2M US state identifiers and 3.8M geo-coordinates– Parallelized enhancement system is able to process 65k
triples/second/process– Analyzed 1,396 “raw” Data.gov datasets in 3.1min
• on 256 processors of the CCNI Opteron cluster• Currently porting to an IBM BlueGene
• Analysis (ongoing) found no errors in links produced (but many errors of omission)
Tetherless World Constellation
Next Steps
• Explore– Use of mapping heuristics
• Talking w/PUC-RIO about
– “real” LED and machine-learning approaches
– Metadata analysis• Clustering & ML (336,000 labeled examples)• Metadata linking (esp. re: languages)
– Govt terms provide a good start
– Try other mapping tools (cf. SERIMI)
Tetherless World Constellation
Challenge
•Ontology and vocabulary issues–How do we compare across heterogeneous and unreconciled data•Good news and Bad news
Tetherless World Constellation
Good news – easy to do comparisons
Tetherless World Constellation
Good news - Even if not “rationalized” together
Tetherless World Constellation
Bad news – real comparisons are hard across govts
Tetherless World Constellation
Presents a challenge
Same or different?
Tetherless World Constellation
Different “ontologies” ?
Definitely not the expected result!!
Tetherless World Constellation
And many other interesting issues
• Trust– Government data is controversial, and potentially biased
• How do we confirm or dispute?
• Combination– When we combine data we need to keep the provenance of
information (see trust)• How can we show and use?
• Scaling– LOGD has already converted 8,678,741,017 triples– ~500 of 390,000 reported US datasets
• Versioning and updating• Archiving• Searching in the data• …
Tetherless World Constellation
Summary
• The Open Govt data is a critical resource– Government data released as RDF (UK)– Government data converted to RDF (US)– Government data that can be found in many forms and used or
converted (WWW)
• Government transparency comes through in the “mashing up” of data from many sites– Key to linked data
• But many challenges remain– Scaling, Trust, Provenance, Archiving, Curation, …
• The Research agenda for linked government data is an important area for a Web-Science based approach
Tetherless World Constellation
Questions?
http://logd.tw.rpi.edu
Tetherless World Constellation
Govt systems can use linked data web for context
Correlates fires, acres burned, and agency budgets
Tetherless World Constellation
Visualization can help identify data errors
Were there really no fires in 1985?