59
Beautifying Data in the Real World Group 5: Toan Do - An Du Vinh Nguyen - Tan Tran Instructor: Professor Lothar Piepme 1

Beautifying Data in the real world

Embed Size (px)

Citation preview

1

Beautifying Datain the Real World

Group 5: Toan Do - An Du

Vinh Nguyen - Tan Tran

Instructor: Professor Lothar Piepmeyer

How big is the data on the Internet?

2004: The first time Internet exceed 1EB

2005: Eric Schmidt estimated it was 5 million Terabytes (~ 5EB)

Cisco forecasts that in 2015, the size of the Internet will reach nearly 1,000 EB

How big is it?Source: http://www.wisegeek.com/how-big-is-the-internet.htmhttp://techland.time.com/

3

If 1 byte = 0.5mm

Source: http://blog.fliptop.com/how-much-data-is-on-the-internet/

Content

IntroductionOpen Notebook Sciences appoachingCurating and presenting the data Beautfifying the dataData Visualization & Building a portal from

open data and free servicesDemonstration

Data on the internet

Source: http://news.bbc.co.uk/2/hi/technology/8562801.stm

Problems of data in real world (Scientific)

Noisy source of data The barrier of data presentation

OCR version Text version Human-readable Machine readable …

How to verify the data?

Open Notebook Science

Purpose: record full scientific research raw data, make it available and online

Benefits: obtain detailed descriptions of procedures improve the communication of science increase the progress reduce time lost due to the repetition of failed

experiments…

Apply ONS on free services

Crowdsourcing

a distributed problem-solving and production model

Crowdsourcing

Crowdsourcing

Crowdsourcing

Source: http://r18ultrachair.com/

Validating crowdsourced data

According to ONS, all detail data have been recorded

The doubtful data also be kept and marked for

Unique Identifiers for Chemical Entity

Standardize data

Facilitate the integration with other

data sets

Consider 3 possibilities CAS Registry Number InChI SMILES

CAS Registry Number

Proprietary

Cannot converted to chemical structure

Dependent to a external organization to issue

For example, the CAS number of water is 7732-18-

5: the checksum 5 is calculated as (8×1 + 1×2 +

2×3 + 3×4 + 7×5 + 7×6) = 105; 105 mod 10 = 5

http://en.wikipedia.org/wiki/CAS_registry_number

InChI

IUPAC International Chemical Identifier

Freely usable and non-proprietary

Do not have to be assigned by some organization

Can be computed from structural information

Human readable (with practice)

http://en.wikipedia.org/wiki/Inchi

SMILES

Simplified molecular-input line-entry system

More human-readable than InChI

Can convert to InChI

http://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system

18

http://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system

Analysis Options

Access to live dataGet SummaryComplex Statistical representations

of modelsMark the skeptical data for later

consideration

20

Google Docs API

Allows developers to create, retrieve, update, and delete Google Docs files and collections

Also provides some advanced features like resource archives, Optical Character

Recognition, translation, and revision history.

Useful to store data in the cloud, perform resource management, convert document formats

https://developers.google.com/google-apps/documents-list/

Google Visualization API

Chart LibraryJavaScript classes

Data TableJavaScript DataTable

classData Source

Chart Tools Datasource protocol

https://developers.google.com/chart/interactive/docs/index

23

24

https://google-developers.appspot.com/chart/interactive/docs/gallery

RESTful Web Service

Representational State Transfer - a simpler alternative to SOAP - and Web Services Description Language (WSDL) based Web services

Principles: Use HTTP methods explicitly. Be stateless. Expose directory structure-like URIs. Transfer XML, JavaScript Object

Notation (JSON), or both.

http://www.ibm.com/developerworks/webservices/library/ws-restful/

Compare REST and SOAP

Who's using REST?All of Yahoo's web services use REST, including

Flickr, del.icio.us API uses it, pubsub, bloglines, technorati, and both eBay, and Amazon have web services for both REST and SOAP.

Who's using SOAP?Google seams to be consistent in implementing

their web services to use SOAP, with the exception of Blogger, which uses XML-RPC. You will find SOAP web services in lots of enterprise software as well.http://www.petefreitag.com/item/431.cfm

Compare REST and SOAP

RESTLightweight - not

a lot of extra xml markup

Human Readable Results

Easy to build - no toolkits required

SOAP Easy to consume

- sometimes Rigid - type

checking, adheres to a contract

Development tools

28

An Effort to Aggregate Data from Multiple Sources

Introducing ChemSpiderAn online lookup engine for Chemists

http://www.chemspider.com40 mil substancesMultiple data sourcesA "link farm" to other sources

30

What is "wrong" with wikipedia.com?

Wikipedia.com

Not “wrong”:

Very informative for human being

Wikipedia.com

This little guy is left behind

Not machine-readable

Semantic Web

Describing things in a way that computers applications can understand it.“The Beatles was a band from

Liverpool”Describes the relationships between

things (like A is a part of B and Y is a member of  Z) and the properties of things (like size, weight, age, and price)

“..will make all the data in the world look like one huge database“ – Tim Berners-Lee

http://www.w3schools.com/web/web_semantic.asp

Resource Description Framework

Is a language to describe resources on the web

Component of the Semantic WebData is self-describing

Triples: "subject", "predicate" and "value“

URIs are used to denote resources

RDF

Graph DatabaseNodesEdges

Well-suited for Knowledge RepresentationBeautified Data => Knowledge

RDF Example

<?xml version="1.0"?><rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:cd="http://www.recshop.fake/cd#"><rdf:Descriptionrdf:about="http://www.recshop.fake/cd/Empire Burlesque"> <cd:artist>Bob Dylan</cd:artist> <cd:country>USA</cd:country> <cd:company>Columbia</cd:company> <cd:price>10.90</cd:price> <cd:year>1985</cd:year></rdf:Description></rdf:RDF>

Semantic Web Example: DBPedia

“Old School” wikipedia: http://en.wikipedia.org/wiki/Porsche_Panamera

DbPedia Entries

http://dbpedia.org/page/Porsche_Panamera http://dbpedia.org/page/Chromium_carbide

Query Language: SPARQL (sparkle)

Query Language for RDFGraph TraversalMatching the triples

Example:Data:

<http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title> "SPARQL Tutorial”

Query:SELECT ?titleWHERE { <http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title> ?title . }

Query Result: title "SPARQL Tutorial"

39

To Infinity and Beyond

• DB2 and Oracle are ready for this train

•Object DatabaseVersant OODBMS, anybody?

• Machine-Readable DataWill they become self-awareness?

40

“Data Finds Data” and Semantic Data Model – A Hypothesis

41

LÂM

BẢO

Non-Obvious Relationship Awareness

42

LÂM

BẢO

LÂM’s iPhone

has

Non-Obvious Relationship Awareness

43

LÂM

BẢO

LÂM’s iPhone

has BẢO’s

SS Galaxy

has

Non-Obvious Relationship Awareness

44

LÂM

BẢO

LÂM’s iPhone

has BẢO’s

SS Galaxy

has

TheGioiDiDong.com

Sold

Was sold

45

LÂM

BẢO

LÂM’s iPhone

has BẢO’s

SS Galaxy

has

TheGioiDiDong.com

Sold

Was sold

SoldWas sold

46

LÂM

BẢO

LÂM’s iPhone

has BẢO’s

SS Galaxy

has

TheGioiDiDong.com

Sold

Was sold

Sold

Connection Detected! -Bao could have met Lam at Thegioididong? -They could have discussed their World domination scheme during the meeting there?-???

Was sold

47

LÂM

BẢO

LÂM’s iPhone

has BẢO’s

SS Galaxy

has

TheGioiDiDong.com

Sold

Was sold

SoldWas sold

(Doe

s no

t exis

ts)

Data Visualization

Building a portal from open data and free services

Visualization of Data

Source http://nmap.org/favicon/

Top million web sites (per Alexa traffic data) was performed in early 2010 ]

Visualization of Data

Second LifeSecond Life is a 3D world where everyone you see is a real person and every place you visit is built by people just like you.

3D Visualization in SL

SL- The Opportunity for "Edutainment"

Drexel Island on Second Life

iSchool Teaching: Quizzes and Lectures

Classrooms with Powerpoint Research Center

3-D Environments

http://3rdrockgrid.com/

http://www.osgrid.org/

http://www.craft-world.org

http://www.secondlife.com/

http://youralternativelife.com//

Visualization To Suggest New Experiments

Building A Portal From Open Data And Free Services

Freely hosted Wiki service Google Spreadsheet Google Docs API / javascripts Visualization services/anlalysis

services (2D, 3D) RDF/ Senmantic Web/ Webservices Cost: free or fit to the purpose

Key To Success

+ Transparency

Model

Information

Data

Records

Demonstration

Google Docs

Second Life

References

Oreilly – Beautiful data – Chapter 16th Beautifying data in the real world

http://techland.time.com/2011/06/01/how-big-is-the-internet-spoiler-not-as-big-as-itll-be-in-2015/

http://drexelisland.wikispaces.com/SMILE to 3D – Secon Life,

http://www.youtube.com/watch?v=tOfhuoRbnCg&feature=player_embedded