Click here to load reader
Upload
promptcloud
View
779
Download
3
Embed Size (px)
DESCRIPTION
Challenges with crawling AJAX pages on the web and their solutions.
Citation preview
Scraping AJAX PagesBig Data made small
What’s AJAX on a web page?
1. Filters 2. Load more results
3. Forms
and others...
GET vs. POST
Client Server
Client ServerGET
POST
http://example.com?date=20140410
http://example.com
Payload
Form Data, JSON Strings, Query Parameters, View States, etc.
What makes crawling AJAX difficult?
Challenge 1- Javascript Calls
Solution- Emulate Javascript calls using headless browsers
Data fetched from under Javascript code
Challenge 2- Fetch Bandwidths
Solution-Optimize fetch limits
Incomplete page fetched because of low fetch age
Image Credit: ticketmaster.com
Challenge 3- .NET Architectures
Solution- Track states, pass event validations, restore states for mitigation
Viewstate
Challenge 4- Page Encoding
Solution- Send request (content type, media type, accept field parameters) and parse responses in same format as expected by server
Use Case- Crawl Ticketing Sites