Tldr solr-courseload

Preview:

Citation preview

tl;dr: Solr

Dumbledore: "I use the Pensieve. One simply siphons the excess thoughts from one's mind, pours them into the basin, and examines them at one's leisure. It becomes easier to spot patterns and links, you understand, when they are in this form."Harry: "You mean... that stuff's your thoughts?"Dumbledore: "Certainly."

Dumbledore: "I use the Pensieve. One simply siphons the excess thoughts from one's mind, pours them into the basin, and examines them at one's leisure. It becomes easier to spot patterns and links, you understand, when they are in this form."Harry: "You mean... that stuff's your thoughts?"Dumbledore: "Certainly."

Solr is Lucene-based Lucene = text search engine library written in Java All kinds of crazy goodies:

Ranked search Multiple indexing Simultaneous read & write Date-range search ...the list goes on

Platform-independent (thanks, Java!) Fast & efficient

Index size ~= 20-30% size of indexed data Very high throughput indexing (95GB/hour)

Solr is NoSQL NoSQL == Non-relational database RDBMS metaphor:

One database One table Denormalized data Query parameters instead of SQL “Documents” instead of rows

Bottom line: it's a persistent datastore, and we use it to store data persistently.

Vocabulary Master Slave Replication Document API

Master There can be only one Read & write operations Must be secure Younger, stronger brother of production DB Home base for Solr slaves

Slave There are many copies They have a plan: replication Read-only Gets copy of index from the Solr master every k minutes

Responds to queries

Replication Slaves –-HTTP GET--> Master Replication is differential Configuration is set in solrconfig.xml http://tinyurl.com/DESolrRepl

Document RDBMS = row; Solr = document Denormalized relational data

my friend,

RDBMS = row; Solr = document Denormalized relational data

Flatten a bunch of related RDBMS rows into a single Solr document

API Application programming interface Primary means of communicating with Solr is an HTTP API

The Good Stuff:Unix & Diagnostics

“This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.”

- Doug McIlroy

Examples of things beyond the scope of this talk: Cat Awk Grep Sed Cut Wc Sort Tail Head

Great read: http://matt.might.net/articles/sql-in-the-shell/

The Good Stuff:Unix & Diagnostics

You cannot effectively troubleshoot without parsing logs You cannot effectively parse logs without good text-parsing tools:

Cat Awk Grep Sed Cut Wc Sort Tail Head

No *nix OS? PowerShell!

The Good Stuff:Unix & Diagnostics

Example commands: tail -f /var/log/celery/project.log

Output the Celery log to stdout, in real time cat /ebs2/log/celery/project.log|grep -oE 'BUID:([0-9]{0,5})'|grep -oE '[0-9]{0,5}'|sort --unique Parse the Celery log, printing a list of unique BUIDs

cat /ebs2/log/celery/project.log|grep -B 15 "DocumentInvalid"|grep -E 'Download complete for BUID ([0-9]{1,5})'|awk '{sub(/\[/, "");print $1 " " $2 " " $7 ":" $8}' Parse the Celery log, outputting a list of BUID the feed file for which failed for some reason:

Conclusion RTFreakingM

http://wiki.apache.org/solr/SolrQuerySyntax http://wiki.apache.org/solr/SolrCaching http://wiki.apache.org/solr/SchemaXml http://django-haystack.readthedocs.org/en/latest/

Experiment & tinker & reinvent the wheel Get comfortable with the command line – you can't effectively administer

Solr (or any sufficiently complex system) with a web GUI Read the logs Connect Solr behavior to application operations

Recommended