6

Click here to load reader

Webscale for a small web

Embed Size (px)

DESCRIPTION

Lightning Talk slides from the ELAG 2014 conference on the web archive indexing- and search-project at Statsbiblioteket. Hardware data and performance graphs. Slide notes contains additional details. A better write-up can be found in blog-form on https://sbdevel.wordpress.com/2014/06/17/terabyte-index-search-and-faceting-with-solr/

Citation preview

Page 1: Webscale for a small web

Webscale for a small web

● Since 2005● All *.dk websites● 4 times/year● ~100 special sites harvested daily● Explicitely stated by law that everything public can and must

be harvested

ELAG 2014 quickly hacked lightning talk, Toke Eskildsen, [email protected], State and University Library, Denmark

Page 2: Webscale for a small web

Numbers

● Currently 370TB+ / 8-10B web resources● Estimated final index: 20-24TB● Indexing: 24 core / 256GB RAM / 5TB SSD● Searching: 16 core / 256GB RAM / 24TB SSD

– Cost: ~£12,000

● 1 optimized shard / SSD (900GB, 300M docs)– Build time / shard: ~8 days

● 1 solr / shard, connected with Solrcloud

https://github.com/netarchivesuite/netsearch (based on https://github.com/ukwa/webarchive-discovery)

Page 3: Webscale for a small web

20 threads, 3.6TB, simple searches

Page 4: Webscale for a small web

1 thread, faceting, 900GB, URL-field

Page 5: Webscale for a small web

Remember the milk!

Page 6: Webscale for a small web

Remember the milk!