Web Crawling- Scraping Ajax Sites

Preview:

DESCRIPTION

Challenges with crawling AJAX pages on the web and their solutions.

Citation preview

Scraping AJAX PagesBig Data made small

What’s AJAX on a web page?

1. Filters 2. Load more results

3. Forms

and others...

GET vs. POST

Client Server

Client ServerGET

POST

http://example.com?date=20140410

http://example.com

Payload

Form Data, JSON Strings, Query Parameters, View States, etc.

What makes crawling AJAX difficult?

Challenge 1- Javascript Calls

Solution- Emulate Javascript calls using headless browsers

Data fetched from under Javascript code

Challenge 2- Fetch Bandwidths

Solution-Optimize fetch limits

Incomplete page fetched because of low fetch age

Image Credit: ticketmaster.com

Challenge 3- .NET Architectures

Solution- Track states, pass event validations, restore states for mitigation

Viewstate

Challenge 4- Page Encoding

Solution- Send request (content type, media type, accept field parameters) and parse responses in same format as expected by server

Use Case- Crawl Ticketing Sites

Thank You!

Have specific queries on AJAX crawling?

Reach out to info@promptcloud.com.

Recommended