14
MACHINE LEARNING web crawlers

Web crawlers part-1-20161102

Embed Size (px)

Citation preview

Page 1: Web crawlers part-1-20161102

MACHINE LEARNINGweb crawlers

Page 2: Web crawlers part-1-20161102

BREAKING NEWS

Page 3: Web crawlers part-1-20161102
Page 4: Web crawlers part-1-20161102
Page 5: Web crawlers part-1-20161102

WRITING WEB CRAWLERS

Page 6: Web crawlers part-1-20161102
Page 7: Web crawlers part-1-20161102

• screen scraping, data mining, web harvesting, web scraping

• web scraping, data mining

Page 8: Web crawlers part-1-20161102

“In theory, web scraping is the practice of gathering data through any means other than a program

interacting with an API (or, obviously, through a human using a web browser). This is most commonly

accomplished by writing an automated program that queries a web server, requests data (usually in the

form of the HTML and other files that comprise web pages), and then parses that data to extract needed

information.”

Page 9: Web crawlers part-1-20161102

There are several reasons why an API might not exist:

• You are gathering data across a collection of sites that do not have a cohesive API.

• The data you want is a fairly small, finite set that the webmaster did not think warranted an API.

• The source does not have the infrastructure or technical ability to create an API.

Page 11: Web crawlers part-1-20161102

PIWIK

Page 12: Web crawlers part-1-20161102

GOOGLE ANALYTICS

Page 13: Web crawlers part-1-20161102

URLOPENfrom urllib.request import urlopen

html = urlopen("http://sampleshop.pl/product/happy-ninja-2/")

print(html.read())

Page 14: Web crawlers part-1-20161102

BEAUTIFUL SOUPfrom urllib.request import urlopen from bs4 import BeautifulSoup

html = urlopen("http://sampleshop.pl") bsObj = BeautifulSoup(html.read(), "html.parser")

print(bsObj.h1)