11
WEB SCRAPING BY EMIL KJER RMIT UNIVERSITY [email protected] MARCH 2012 March 28, 2012

webscraping.pdf

Embed Size (px)

Citation preview

Page 1: webscraping.pdf

WEB SCRAPING BY EMIL KJER RMIT UNIVERSITY

[email protected] MARCH 2012

March 28, 2012

Page 2: webscraping.pdf

WHAT IS SCRAPING?

From http://youtu.be/--THSIi85I8 March 28, 2012 1 / 1

0

Page 3: webscraping.pdf

WHAT IS SCRAPING?

March 28, 2012 2 / 1

0

From http://youtu.be/--THSIi85I8

Page 4: webscraping.pdf

WHAT IS SCRAPING?

March 28, 2012 3 / 1

0

From http://youtu.be/--THSIi85I8

Page 5: webscraping.pdf

WHAT SCRAPING IS Web scraping / web harvesting / web data extraction 1.  Collect content from a source

•  Firefox, Chrome, Safari •  wget, curl, urllib2[python], URL[java], Net::HTTP[ruby]

2.  Do something to the content

•  Parser with regular expression •  BeautifulSoup [python] •  Jsoup [java]

March 28, 2012 4 / 1

0

Page 6: webscraping.pdf

WHAT TO USE IT FOR? •  Research •  News mashup •  Weather app

•  Movie comparison

•  Webcrawler

March 28, 2012 5 / 1

0

Page 7: webscraping.pdf

REMARKS Politeness

•  Heavy load •  Amount of data •  Private sources •  Bad code

Robustness

•  Data changes form •  Broken links

March 28, 2012 6 / 1

0

Page 8: webscraping.pdf

TIMETABLES In a browser go to: http://www.metlinkmelbourne.com.au/timetables Select transport mode http://www.metlinkmelbourne.com.au/timetables/metropolitan-trains Select a route e.g. Alamain Line http://www.metlinkmelbourne.com.au/route/view/1 Select a direction e.g. to city http://tt.metlinkmelbourne.com.au/tt/XSLT_TTB_REQUEST?command=direct&language=en&outputFormat=0&net=vic&line=02ALM&project=ttb&itdLPxx_selLineDir=H&sup=B Download the page via. the browser e.g. in Safari File > Save As… > Format "Page Source".

March 28, 2012 7 / 1

0

Page 9: webscraping.pdf

GET AND EXTRACT 1 Simple and “solution” using Python and BeautifulSoup

•  From file •  From url •  HTML parsed to objects

BeautifulSoup

http://www.crummy.com/software/BeautifulSoup/

March 28, 2012 8 / 1

0

Page 10: webscraping.pdf

GET AND EXTRACT 2 Using Java and Jsoup

•  From file •  From url •  HTML parsed to objects

Jsoup

http://jsoup.org Alternative from javax for parsing html as documents javax.swing.text.html.HTMLEditorKit

March 28, 2012 9 / 1

0

Page 11: webscraping.pdf

SUMMARY Many applications 1.  Collect data

2.  Do something with the data

Politeness and robustness

Source code at https://github.com/emilkjer/Web-Scraping

March 28, 2012 10 /

10