92
Motivation Tools we need to learn R and Collecting Internet Data Luis F. Campos Department of Statistics University of California, Berkeley February 11, 2011 - 4-5 PM - 1011 Evans Hall Luis F. Campos UC, Berkeley R and Collecting Internet Data

R and Collecting Internet Data - Department of Statisticsluis/seminar/XML_presentation_LC.pdfInternet Movie Database (IMDb) Outline 1 Motivation Internet Movie Database (IMDb) Music

Embed Size (px)

Citation preview

Motivation Tools we need to learn

R and Collecting Internet Data

Luis F. Campos

Department of StatisticsUniversity of California, Berkeley

February 11, 2011 - 4-5 PM - 1011 Evans Hall

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Internet Movie Database (IMDb)

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Internet Movie Database (IMDb)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Internet Movie Database (IMDb)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Internet Movie Database (IMDb)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Music Databases

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Music Databases

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Music Databases

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Yahoo! Sports

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Yahoo! Sports

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Yahoo! Sports

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Yahoo! Sports

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Main example: Superbowl - Play by Play Data

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Main example: Superbowl - Play by Play Data

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Main example: Superbowl - Play by Play Data

What is actually going on in your browser?

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Main example: Superbowl - Play by Play Data

What is actually going on in your browser?

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Language

markup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)

extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >

HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XML

HTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

What is HTML? Simply, a set of predetermined structuralmarkers.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html>

<head><title>The document title</title>

</head><body>

<h1>Main heading</h1><p>A paragraph.</p><a href = "www.stat.berkeley.edu">Statistics Website</a>

</body></html>

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

What is HTML? Simply, a set of predetermined structuralmarkers.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><title>The document title</title>

</head><body><h1>Main heading</h1><p>A paragraph.</p><a href = "www.stat.berkeley.edu">Statistics Website</a>

</body></html>

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

Another useful way to view an HTML document is as a tree.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

We call the boxes nodes and the arrows edges

an edge goes from a to b if:<a>

<b></b>

</a>

Note: there is a unique path from the root node to anygiven node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

We call the boxes nodes and the arrows edgesan edge goes from a to b if:<a>

<b></b>

</a>

Note: there is a unique path from the root node to anygiven node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

We call the boxes nodes and the arrows edgesan edge goes from a to b if:<a>

<b></b>

</a>

Note: there is a unique path from the root node to anygiven node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:

a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">

Statistics Website</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily unique

any number of attributesoptional text<a href = "www.stat.berkeley.edu">

Statistics Website</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributes

optional text<a href = "www.stat.berkeley.edu">

Statistics Website</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text

<a href = "www.stat.berkeley.edu">Statistics Website

</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website

</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website

</a>

nodename: a

attribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website

</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"

text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website

</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node

// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree

. selects current node

.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree. selects current node

.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node

@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

To traverse to the only anchor node below, we can use:"/../body/div/a" or "//a"

If we had <a href = "stat.berkeley.edu">stats</a>:"//a[@href = ’stat.berkeley.edu’]" or "//a[text() = ’stats’]"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

To traverse to the only anchor node below, we can use:"/../body/div/a" or "//a"If we had <a href = "stat.berkeley.edu">stats</a>:"//a[@href = ’stat.berkeley.edu’]" or "//a[text() = ’stats’]"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

Why R, in general?

Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?

Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.

It’s free! So you can spend money on other things.Why R, for collecting internet data?

Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?

Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?

Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatistics

Implementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitive

There are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

R: briefly

R is a command line interpreter language (interpretscommands you type)

We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x

x <- fun(unnamed, arg = named)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

R: briefly

R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:

It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x

x <- fun(unnamed, arg = named)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

R: briefly

R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)

Notice that if fun has any output it will be stored in x

x <- fun(unnamed, arg = named)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Programming Language

R: briefly

R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x

x <- fun(unnamed, arg = named)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:

htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)

getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:

htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)

getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:

htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)

getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"

x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).

xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])

xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"

x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).

xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])

xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"

x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).

xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])

xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"

x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).

xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])

xmlChildren, xmlAncestors: gets all children orancestors of a given node.

xmlValue: gets the text of a given node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"

x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).

xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])

xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

One shortcut:If you know you want to get a table element from an htmlfile

a table is a very specific html element!

readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

One shortcut:If you know you want to get a table element from an htmlfile

a table is a very specific html element!

readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!

So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

One shortcut:If you know you want to get a table element from an htmlfile

a table is a very specific html element!

readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!

Correct! No, not everything is stored in a table!This is why we learned all this!

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

One shortcut:If you know you want to get a table element from an htmlfile

a table is a very specific html element!

readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!

This is why we learned all this!

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

R Package: XML

One shortcut:If you know you want to get a table element from an htmlfile

a table is a very specific html element!

readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.

We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"

Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas above

Quantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"

asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...

plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...

[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"

These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.

There are a ton more. See reference.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Regular Expressions in R are used in the following functions:grep finds a pattern in a list of candidates:>grep("abc",c("abcdsf", "fabcasda", "cba"))[1] 1 2

gsub replaces a pattern with another in a list ofcandidates:>gsub("abc","CANDY",c("abcdsf", "fabca", "cba"))[1] "CANDYdsf" "fCANDYa" "cba"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

Regular Expressions in R are used in the following functions:grep finds a pattern in a list of candidates:>grep("abc",c("abcdsf", "fabcasda", "cba"))[1] 1 2

gsub replaces a pattern with another in a list ofcandidates:>gsub("abc","CANDY",c("abcdsf", "fabca", "cba"))[1] "CANDYdsf" "fCANDYa" "cba"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Regular Expressions (regex)

strsplit>strsplit(c("abcdsf", "fabcasda", "cba"), "abc")[[1]][1] "" "dsf"

[[2]][1] "f" "asda"

[[3]][1] "cba"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Demonstration/Resources

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Motivation Tools we need to learn

Demonstration/Resources

We’ll go though a quick Demo! (time permitting)Resources:

R: http://cran.r-project.org/R::XML:http://cran.r-project.org/web/packages/XML/index.htmlDuncan Temple Lang:http://www.stat.ucdavis.edu/ duncan/XPath Tutorial:http://www.w3schools.com/xpath/default.aspRegEx: http://www.regular-expressions.info/reference.html,WikipediaThis Presentation: http://www.stat.berkeley.edu/ luis/

Luis F. Campos UC, Berkeley

R and Collecting Internet Data