Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org)...

Preview:

Citation preview

Today

•  Finish Section on Linked Data •  Begin data cleaning and pre-processing topic

Graphs: Social networks

https://www.flickr.com/photos/marc_smith/5592302165

Protein-Protein Interactions

http://www.nature.com/nrg/journal/v5/n2/fig_tab/nrg1272_F2.html

The Internet Graph (https://en.wikipedia.org/wiki/Opte_Project)

Linked Data

•  We need to connect data together --- form links. –  A key part of the Semantic Web –  Also important for the Internet of Things

•  (26 billion things by 2020, each continuously producing data)

1.  Principles of links from Tim Berners-Lee 1.  All kinds of conceptual things, they have names now that start with

HTTP. 2.  If I take one of these HTTP names and I look it up, I will get back

some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.

3.  When I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.

Linked Data Examples

•  DBPedia –  ~5 million “things” from Wikipedia –  Can be linked to external datasets such as CIA World

Factbook, US Census Data –  “Give me all cities in New Jersey with more than 10,000

people

•  Freebase •  FOAF (friend of a friend) •  Google Knowledge Graph

•  https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html

Standards for Linked Data

•  Widely used standards (W3C Recommendations) –  JSON-LD (JSON Linked Data) –  RDF (Resource Description Framework)

JSON-LD (example from json-ld.org)

•  Provide mechanisms for specifying unambiguous meaning in JSON data

•  Provides extra keys with “@” sign –  “@context” (used to define meanings of terms, map to

identifiers) –  “@type” –  “@id”

•  Use cases –  Google Knowledge Graph

JSON-LD Example (from https://en.wikipedia.org/wiki/JSON-LD)

{"@context": { "name": "http://xmlns.com/foaf/0.1/name", "homepage": { "@id": "http://xmlns.com/foaf/0.1/workplaceHomepage", "@type": "@id" }, "Person": "http://xmlns.com/foaf/0.1/Person" }, "@id": "http://me.example.com", "@type": "Person", "name": "John Smith", "homepage": "http://www.example.com/" }

Graphs – RDF (Resource Description Framework) [materials from w3.org]

Serialisation of RDF Example Graph

This graph can be serialised as XML (don’t worry about syntax!)

<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#">

<contact:Person rdf:about="http://www.w3.org/People/EM/contact#me"> <contact:fullName>Eric Miller</contact:fullName> <contact:mailbox rdf:resource="mailto:em@w3.org"/> <contact:personalTitle>Dr.</contact:personalTitle> </contact:Person>

RDF – Triple Store

•  An alternative format for storing RDF type data – triple store <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#fullName> "Eric Miller" . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#mailbox> <mailto:em@w3.org> . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#personalTitle> "Dr." . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .

Freebase

•  A large database that connects entities (facts, people, places, organizations …) together as a graph –  www.freebase.com –  Freebase is the basis of the Google Knowledge graph that is

used to improve search. •  https://developers.google.com/knowledge-graph/

•  Retrieving data from the Google Knowledge Graph –  Example adapted from http://www.nolan-nichols.com/

knowledge-graph-via-sparql.html

Other formats for Graphs: Matrix Representation

A

C

D

B A B C D

A 0 0 1 0 B 0 0 0 0 C 0 1 0 0 D 0 1 0 0 A ‘1’ in the matrix iff there is an edge from node X to node Y. Or use a relational table

Source Destination

A C C B D B

What you should know about data formats

•  -Why do we have different data formats and why do we wish to transform between different formats?

•  -Motivation for using relational databases to manage information •  -Different between a (standard) relational database and a nosql database •  -What is a csv, what is a spreadsheet, what is the difference? •  -Be able to write regular expressions in python format (operators .^$*+|[]) •  -Difference between HTML and XML and when to use each •  -Motivation behind using XML and XML namespaces •  -Be able to read and write data in XML (elements, attributes, namespaces) •  -Be able to read and write data in JSON •  -Difference between XML and JSON. Applications where each can be used. •  -The purpose of using schemas for XML and JSON data. •  -The motivation behind Linked Data and the purpose of using JSON-LD or RDF

to represent it.

Further reading

•  Further reading –  Relational databases

•  Pages 403-409 of http://i.stanford.edu/~ullman/focs/ch08.pdf

–  XML •  http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html

–  JSON and JSON-LD •  http://json.org •  http://crypt.codemancers.com/posts/2014-02-11-An-introduction-to-

json-schema/ •  https://cloudant.com/blog/webizing-your-database-with-linked-data-in-

json-ld/#.Vtp_UMfB_Gw –  RDF

•  https://www.w3.org/DesignIssues/LinkedData.html •  http://www-sop.inria.fr/acacia/cours/essi2006/Scientific%20American_

%20Feature%20Article_%20The%20Semantic%20Web_%20May%202001.pdf

•  http://www.dlib.org/dlib/may98/miller/05miller.html

COMP20008 Elements of Data Processing Data Pre-Processing and Cleaning

Why is pre-processing needed?

Name Age Date of Birth

“Henry” 20.2 20 years ago

Katherine Forty-one 20/11/66

Michelle 37 5/20/79

Oscar@!! “5” 13th Feb. 2011

- 42 -

Mike___Moore 669 -

巴拉克奥巴⻢马 52 1961年8月4日

Why is pre-processing needed?

•  Measuring data quality –  Accuracy

•  Correct or wrong, accurate or not –  Completeness

•  Not recorded, unavailable –  Consistency

•  E.g. discrepancies in representation –  Timeliness

•  Updated in a timely way –  Believability

•  Do I trust the data is correct? –  Interpretability

•  How easily can I understand the data?

Major data preprocessing activities

Data mining concepts and techniques, Han et al 2012

Terminology

Height Weight Age Gender 1.8 80 22 Male 1.53 82 23 Male 1.6 62 18 Female

•  The 4 columns (height, weight, age, gender) are features or attributes

•  The data items (3 rows) are called instances or objects •  Height, Weight and Age are continuous features •  Gender is a categorical or discrete feature

Data integration

•  Bringing data from multiple sources together –  Resolve conflicts –  Detect duplicates

•  Will cover in depth in weeks 8 and 9

Data Source

Data Source

Integrated Data Source

Data reduction

•  Decrease the number of features (columns) or instances (rows) –  Sampling strategies –  Remove irrelevant features and reduce noise –  Easier to visualise, faster to analyse

•  Will cover during section on visualisation (weeks 5 and 6), and feature analysis (weeks 9 and 10)

http://bigdataexaminer.com/data-science/understanding-dimensionality-reduction-principal-component-analysis-and-singular-value-decomposition/

Data cleaning

•  Incomplete (missing data) •  Noisy data •  Inconsistent data •  Intentionally disguised data

Data cleaning – The Process

•  Many tools exist (Goole Refine, Kettle, Talend, …) –  Data scrubbing –  Data discrepancy detection –  Data auditing –  ETL (Extract Transform Load) tools: users specify

transformations via a graphical interface •  Our emphasis will be to understand some of the methods

employed by some of these tools

Missing or incomplete data

•  Lacking feature values •  Name=“” •  Age=null

•  Types of missing data (Rubin 1976) –  Missing completely at random: Data are missing

independently of observed and unobserved data. –  E.g/ Coin flipping to decide whether or not to answer

an exam question. –  Missing not completely at random

•  I create a dataset by surveying the class about how healthy they feel. What is the meaning of missing values for those who don’t respond?

•  I set an exam and ask a question in hard to understand language. What is the meaning of missing values for those who don’t answer the question?

Example: USA Salary survey data

•  Is Person B’s salary missing at random? •  Very difficult to determine reasons for missingness.

–  In practice report assumptions about missingness.

Name Salary Person C $59k Person D $63k Person H $99k Person E $102k Person G $140k Person F $150k Person A $180k Person B -

Causes of missing data

•  Why does it occur? –  Malfunction of equipment (e.g. sensors) –  Not recorded due to misunderstanding –  May not be considered important at time of entry –  Deliberate

•  How to handle it? –  We will look at a number of strategies

Extreme Missing data

•  Movie Recommender systems

Person Star Wars

Batman Jurassic World

The Martian

The Revenant

Lego Movie

Selma ….

James 3 2 - - - 1 - John - - 1 2 - - - Jill 1 - - 3 2 1 -

Users and movies Each user only rates a few movies (say 1%) Netflix wants to predict the missing ratings for each user

Noisy data

•  Truncated fields (exceeded 80 character limit) •  Text incorrectly split across cells (e.g. separator issues) •  Salary=“-5” •  Some causes

–  Imprecise instruments –  Data entry issues –  Data transmission issues

Inconsistent data

•  Different naming representations (“Melbourne University” versus “University of Melbourne”) or (“three” versus “3”)

•  Different date formats (“3/4/2016” versus “3rd April 2016”) •  Age=20, Birthdate=“1/1/2002” •  Two students with the same student id •  Outliers

–  E.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999 •  No good if it is list of ages of hospital patients •  Might be ok though for a listing of people number of

contacts on Linkedin though –  Can use automated techniques, but also need domain

knowledge

Disguised data

•  Everyone’s birthday is January 1st? •  Email address is xx@xx.com •  Adriaans and Zantige

–  “Recently, a colleague rented a car in the USA. Since he was Dutch, his post-code did not fit the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.”

•  How to handle –  Look for “unusual” or suspicious values in the dataset, using

knowledge about the domain

Dealing with missing data

•  What are the consequences of missing data? –  May break application programs not expecting it –  Less power for later analysis analysis –  May bias later analysis

•  So, how to handle it?

Strategy 1: Delete all instances with a missing value

•  Sometimes called case deletion •  Effects

–  Easy to analyse the new (complete data) –  May produce bias on analysis if new sample size small or

structure exists in the missing data.

Case deletion

Person Star Wars

Batman

Jurassic World

The Martian

The Revenant

Lego Movie

Selma

Mandy 1 2 1 3 3 2 3

Person Star Wars

Batman

Jurassic World

The Martian

The Revenant

Lego Movie

Selma

Mandy 1 2 1 3 3 2 3

James 3 2 - - - 1 -

John - - 1 2 - - -

Jill 1 - - 3 2 1 -

Strategy 2: Manually correct

•  A human eyeballs the missing value and fills it in using their expert knowledge

https://en.wikipedia.org/wiki/Eye

Strategy 3: Imputation

•  Impute a value (replace the missing value with a substitute one) •  After imputing all missing values, can use standard analysis

techniques for complete datasets

Person Star Wars

Batman

Jurassic World

The Martian

The Revenant

Lego Movie

Selma ….

James 3 2 2 2 1 1 1

John 3 2 1 2 2 1 1

Jill 1 1 1 3 2 1 1

Person Star Wars

Batman

Jurassic World

The Martian

The Revenant

Lego Movie

Selma ….

James 3 2 - - - 1 -

John - - 1 2 - - -

Jill 1 - - 3 2 1 -

Imputation: Fill in with zeros (or similar)

Person Star Wars

Batman Jurassic World

The Martian

The Revenant

Lego Movie

Selma ….

James 3 2 0 0 0 1 0

John 0 0 1 2 0 0 0

Jill 1 0 0 3 2 1 0

•  Simple •  Won’t break application programs •  Limited utility for analysis

Imputation: Fill in with mean value

•  Popular method –  Can be good for supervised classification –  Apply separately to each attribute

Name Age

Daisy 10

Maisy 15

Harry 2

Jackie -

Jackie’s age is imputed to be (10+15+2)/3=9

Imputation: Fill in with mean value cont

•  Drawbacks –  Reduces the variance of the feature –  Incorrect view of the distribution of that attribute –  Relationships to other features changes

•  Can also use median instead of mean (if distribution is skewed) •  Use mode (most frequent value) imputation for categorical

features

Fill in with category mean

•  Take categories/clusters and compute the mean ….

Name Age Gender Daisy 10 Female Maisy 15 Female Harry 2 Male Jackie - Female

Jackie’s age is imputed to be (10+15)/2=12.5 (considering the category “Female”)

Time series: Last value carried forward

Day Kilometres Walked Day 1 8.9 Day 2 8.2 Day 3 9.6 Day 4 Day 5 11.6 Day 6 12.0

Kilometres walked on Day 4 = ?

Acknowledgements

–  Data Mining Concepts and Techniques. Han, Kamber and

Pei. 3rd edition (chapter 3). Available through library as ebook.

–  Data analysis using regression and multilevel hierarchical models. Gelman and Hill (chapter 25), 2006.

Next Week

•  Second workshop is available on LMS –  Practice with JSON and XML and Web scraping

•  Project will be released •  Continue data-preprocessing and cleaning

–  Look at more complex techniques for value imputation (e.g. for the movie recommender system example)

Recommended