18
SpiderDuck: Twitter's Real-time URL Fetcher Murzabayev Askhat aka @murzabayev Presentation may contain adult language or discuss topics that are above a PG-13 level

Spider duck

Embed Size (px)

DESCRIPTION

Presentation made by Murzabayev Askhat at Suleyman Demirel University,Kazakhstan about Twitter's SpiderDuck real time URL-parser

Citation preview

Page 1: Spider duck

SpiderDuck: Twitter's Real-time URL Fetcher

Murzabayev Askhat aka @murzabayev

Presentation may contain adult language or discuss topics that are above a PG-13 level

Page 2: Spider duck

Introduction

• Tweets often contain URLs or links to a variety of content on the web.

• SpiderDuck is a service at Twitter that fetches all URLs shared in Tweets in real-time, parses the downloaded content to extract metadata of interest and makes that metadata available for other Twitter services to consume within seconds.

Page 3: Spider duck

So, what does it mean?

Yes, it means that weknow everything about you and your

tweets

Page 4: Spider duck

Introduction(cont.)

• Several teams at Twitter need to access the linked content, typically in real-time, to improve Twitter products. For example:– Search – Clients – Tweet Button– Trust & Safety – Analytics

Page 5: Spider duck

Background (Before Hijri)

• Prior to SpiderDuck, Twitter had a service that resolved all URLs shared in Tweets by issuing HEAD requests and following redirects– It resolved the URLs but did not actually download

the content. – It did not implement politeness rules typical of

modern bots.(ex: rate limiting and following robots.txt directives.)

Page 6: Spider duck

Background (Великое переселение)

• Open source URL crawler. We realized though that almost all of the available crawlers have two properties that we didn't need: – They are recursive crawlers. – They are optimized for large batch crawls. What

we needed was a fast, real-time URL fetcher.

Page 7: Spider duck

Background (After Hijri)

Page 8: Spider duck

System OverviewKestrel: This is message queuing system widely used at Twitter for queuing incoming Tweets.

Schedulers: These jobs determine whether to fetch a URL, schedule the fetch, follow redirect hops if any. Each scheduler performs its work independently of the others; that is, any number of schedulers can be added to horizontally scale the system as Tweet and URL volume grows.

Page 9: Spider duck

System OverviewFetchers: These are Thrift servers that maintain short-term fetch queues of URLs, issue the actual HTTP fetch requests and implement rate limiting and robots.txt processing. Like the Schedulers, Fetchers scale horizontally with fetch rate.

Memcached: This is a distributed cache used by the fetchers to temporarily store robots.txt files.

Page 10: Spider duck

System OverviewMetadata Store: This is a Cassandra-based distributed hash table that stores page metadata and resolution information keyed by URL, as well as fetch status for every URL recently encountered by the system. This store serves clients across Twitter that need real-time access to URL metadata.

Content Store: This is an HDFS(Hadoop) cluster for archiving downloaded content and all fetch information.

Page 11: Spider duck

URL Scheduler

Page 12: Spider duck

URL Fetcher

Page 13: Spider duck

How Twitter uses SpiderDuck

• To retrieve URL metadata (for example, page title) and resolution information (that is, the canonical URL after redirects).

• Other services periodically process SpiderDuck logs in HDFS to generate aggregate stats for Twitter’s internal metrics dashboards or conduct other types of batch analyses. ( How many images are shared on Twitter each day?” “What news sites do Twitter users most often link to?” and “How many URLs did we fetch yesterday from this specific website?”)

Page 14: Spider duck

Performance numbers

• For URLs that get fetched, SpiderDuck’s median processing latency <2 sec.,& 99% <5 sec.(clicked “Tweet,” the URL in that Tweet is extracted, prepared for fetch, all redirect hops are retrieved, the content is downloaded and parsed, and the metadata is extracted and made available to clients via the Metadata Store)

Page 15: Spider duck

Performance numbers(cont.)

• Most of that time is spent either in the Fetcher Request Queues (due to rate limiting) or in actually fetching from the external web server. SpiderDuck itself adds no more than a few hundred milliseconds of processing overhead, most of which is spent in HTML parsing.

• Cassandra-based MetaDS handles =10,000 req./sec. The store’s median latency for reads is 4-5 ms., and its 99% is 50-60 ms.

Page 16: Spider duck

That’s All

• Read acknowledgements part, thanks to them• Links to resources(open-source libs) will be

given to Mr.Saparkhojayev, ask him to share them

Page 17: Spider duck

Thank you!

Page 18: Spider duck

@murzabayev