10

Click here to load reader

Web Crawling- Scraping Ajax Sites

Embed Size (px)

DESCRIPTION

Challenges with crawling AJAX pages on the web and their solutions.

Citation preview

Page 1: Web Crawling- Scraping Ajax Sites

Scraping AJAX PagesBig Data made small

Page 2: Web Crawling- Scraping Ajax Sites

What’s AJAX on a web page?

1. Filters 2. Load more results

3. Forms

and others...

Page 3: Web Crawling- Scraping Ajax Sites

GET vs. POST

Client Server

Client ServerGET

POST

http://example.com?date=20140410

http://example.com

Payload

Form Data, JSON Strings, Query Parameters, View States, etc.

Page 4: Web Crawling- Scraping Ajax Sites

What makes crawling AJAX difficult?

Page 5: Web Crawling- Scraping Ajax Sites

Challenge 1- Javascript Calls

Solution- Emulate Javascript calls using headless browsers

Data fetched from under Javascript code

Page 6: Web Crawling- Scraping Ajax Sites

Challenge 2- Fetch Bandwidths

Solution-Optimize fetch limits

Incomplete page fetched because of low fetch age

Image Credit: ticketmaster.com

Page 7: Web Crawling- Scraping Ajax Sites

Challenge 3- .NET Architectures

Solution- Track states, pass event validations, restore states for mitigation

Viewstate

Page 8: Web Crawling- Scraping Ajax Sites

Challenge 4- Page Encoding

Solution- Send request (content type, media type, accept field parameters) and parse responses in same format as expected by server

Page 9: Web Crawling- Scraping Ajax Sites

Use Case- Crawl Ticketing Sites

Page 10: Web Crawling- Scraping Ajax Sites

Thank You!

Have specific queries on AJAX crawling?

Reach out to [email protected].