Upload
drago
View
54
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Tesserae: addressing scalability & flexibility concerns. Chris Eberle. Background. Tesserae: A linguistics project to compare intertextual similarities Collaboration between University of Buffalo and UCCS Live version at http://tesserae.caset.buffalo.edu / - PowerPoint PPT Presentation
Citation preview
Tesserae: addressing scalability & flexibility concernsCHRIS EBERLE
Background
Tesserae: A linguistics project to compare intertextual similarities Collaboration between University of Buffalo and UCCS Live version at http://tesserae.caset.buffalo.edu/ Source code at https://github.com/tesserae/tesserae
Tesserae
Background
The good: Well-designed, proven, robust algorithm
See “Intertextuality in the Digital Age” by Neil Coffee, J.-P. Koenig, Shakthi Poornima, Roelant Ossewaarde, Christopher Forstall, and Sarah Jacobson
See “The Tesserae Project: intertextual analysis of Latin poetry” by Neil Coffee, Jean-Pierre Koenig, Shakthi Poornima, Christopher W. Forstall, Roelant Ossewaarde and Sarah L. Jacobson
Simple website, intuitive operations, meaningful scores (user friendly)
Multi-language support Large corpus (especially Latin)
Background
The bad: Perl outputs PHP outputs HTML Error-prone deployments (hand-edit Perl scripts)
The ugly: Mixing data and display layers Custom file formats
Perl nested dictionaries serialized to external text files -- slow Results must be partially pre-computed
Statistics are pre-computed at ingest time Text vs. text comparisons done all at once, in memory, results written to disk, paginated by another
script – searches represent a “snapshot in time”, not a live search. No online ingest
All offline, involving multiple scripts to massage incoming data Can only compare one text to another; no per-section, per paragraph, per-line, or per-author
comparisons
Goals Tesserae-NG: The next generation of Tesserae
Performance Use live caches & lazy computation where appropriate, no more bulk computation Make certain operations threaded / parallel
Scalability Proven storage backend (Solr) used for storage rather than custom binary formats Use industry-standard practices to separate data and display, allowing the possibility for clustering, load-
balancing, caching, and horizontal scaling as necessary. Make all operations as parallel as possible
Flexibility Use Solr’s extensible configuration to support more advanced, flexible searches (more than simple “Text
A” vs “Text B” searches) Ease of deployment
Create a virtual environment that can easily be used by anyone to stand up their own instance User interface
Create a modern, user-friendly user interface that both improves on the original design AND gives administrators web-based tools to manage their data.
Goals
In short: rewrite Tesserae to address scalability and flexibility concerns
(with a secondary focus on ease of development and a nicer UI)
Architecture
Frontend: Django-powered website with online uploader Middleware: Asynchronous ingest engine to keep the frontend
responsive Backend: Solr-powered database for data storage and search
Architecture: Frontend Powered by Django, jQuery, Twitter Bootstrap, and Haystack
Simple MVC paradigm, separation of concerns (no more data logic in the frontend)
Nice template engine, free admin interface, free input filtering / forgery protection.
Responsive modern HTML5 UI thanks to jQuery and Twitter Bootstrap Python-based, modular, well-documented Solr searches very easy thanks to Haystack
Scalability provided by uWSGI and Nginx Interpreter is only run once, bytecode is cached and kept alive Automatic scaling (multiple cores / multiple machines) Static content doesn’t even get handled by Python, very cheap now
Architecture: Middleware
Celery Accepts texts to ingest Each text is split into 100-line chunks and distributed amongst
workers Each worker translates the text into something Solr can ingest, and
makes the required ingest call to Solr Highly parallel, fairly robust. Interrupted jobs are automatically re-
run. Ensures that any large texts ingested from the frontend can’t
degrade the frontend experience Uses RabbitMQ to queue up any unprocessed texts
Architecture: Backend
Apache Solr for Storage and Search Proven search engine, fast, efficient Perfectly suited for large quantities of text
Efficient, well-tested storage, easily cacheable, scales well Flexible schema configuration
Support any kind of query on the data we wish to perform Does not have text-vs-text comparison tool built-in
A custom Solr plugin was written to accommodate this, based on the original Tesserae algorithm
Tomcat for application container Can quickly create a load-balanced cluster if the need arises
Architecture: Other concerns Web-based ingest is tedious for batch jobs
Provide command-line tools to ingest large quantities of texts, just for the initial setup (use of these tools are optional)
Solr’s storage engine can’t / won’t handle some of the metadata that the current Tesserae format expects (e.g. per-text frequency data) Use a secondary key-value database to the side to store this extra information
(LevelDB – very fast lookups) Tesserae’s CSV-based Lexicon database is too slow, and won’t fit into
memory Create an offline, one-time transformer to ingest the CSV file into a LevelDB
database that will be quicker to read Metrics – where are the slow points?
Use the Carbon / Graphite to collect metrics (both stack-wide, and in-code) May want to access texts directly – view only mode, no search
PostgreSQL for simple storage
Architecture
Solr Plugin No built-in capability for Solr to compare one Document to
another Solr is a simple web-wrapper with configuration files Uses Lucene under the covers for all heavy lifting No built-in support for comparisons in Lucene either, but writing a
Solr wrapper to do this is possible
Solr Plugin: Design decisions What will be searched?
Simple one document vs another? Portions of a document vs another? Actual text within document? What is a “document”? A text? A volume of texts?
General approach Treat each line in the original text as its own document
This “minimal unit” is configurable at install time Dynamically assemble two “texts” at runtime based on whatever
parameters the user wishes. Can compare two texts, two volumes, two authors, a single line vs. a whole
text, a portion of a text vs. an entire author, etc, etc. Only limited by the expressive power of Solr’s search syntax, and the schema
Solr Plugin: Schema ExampleAuthor
Title Volume
Line #
Text
Lucan Bellum Civile
1 1 Bella per Emathios plus quam civilia campos
Lucan Bellum Civile
1 2 Iusque datum sceleri canimus, populumque potentem
Lucan Bellum Civile
1 3 In sua victrici conversum viscera dextra,
…Vergil Aeneid 12 950 hoc dicens ferrum adverso sub pectore conditVergil Aeneid 12 951 fervidus. Ast illi solvuntur frigore membraVergil Aeneid 12 952 vitaque cum gemitu fugit indignata sub umbras.Each row, in Solr parlance, is called a “document”. To be sure, these are actually documentfragments from the user’s perspective. Each “document” has a unique ID and can be addressed individually. We can combine them at runtime into two “pools” of documents, which will be compared to one another for similarity.
Solr Plugin: Ingest Logic Receive a batch of lines + metadata For each line, do the following:
Split the line into words (done automatically with Solr’s tokenizer) Take each word, normalize it, and look up the stem word from a Latin lexicon DB Look up all forms of the stem word in the DB Place the original word, and all other forms of the word in the Solr index
Encode the form into the word so we can determine at search time which form it is Allows this line to match no matter which form of a word is used
Update a global (language-wide) frequency database with the original word, and all other forms of the word
Metadata is automatically associated, no intervention required Final “document” is stored and indexed by Solr. Term vectors are calculated
automatically.
Solr Plugin: Search Logic Take in two queries from the user
Source query, and Target query Gather together Solr documents that match each query
Collect each result set in parallel as “source set” and “target set” Treat each result set as two large meta-documents
Dynamically build frequency statistics on each meta-document Dynamically construct a stop-list based on global statistics
Global statistics must live from one run to the next, use an external DB Global statistics don’t change from one search to the next, cached
Run the core Tesserae algorithm on the two meta-documents Compare all-vs-all, only keeping line-pairs that share 2 or more terms
Words that are found in the stoplist above are ignored Calculate distances for each pair, throw away distances above some threshold Calculate a score based on distance and frequency statistics Order results by this final score (high to low) Format results, try to determine which words need highlighting Stream result to caller (pagination is automatic thanks to Solr)
Solr Plugin: Flexible Query Language
Compare "Bellum Civile“ with “Aeneid” (all volumes) http://solrhost:8080/solr/latin?tess.sq=title:Bellum
%20Civile&tess.tq=title:Aeneid Compare line 6 of “Bellum Civile” with all of Vergil’s works
http://solrhost:8080/solr/latin?tess.sq=title:Bellum%20Civile%20AND%20line:6&tess.tq=author:Vergil
Compare Line 3 of Aeneid Part 1 with Line 10 of Aeneid Part 1 http://solrhost:8080/solr/latin?tess.sq=title:Aeneid%20AND
%20volume:1%20AND%20line:3&tess.tq=title:Aeneid%20AND%20volume:1%20AND%20line:10
Rich query language provided by Solr, most queries easily supported https://wiki.apache.org/solr/SolrQuerySyntax
Solr Plugin: Difficulties Solr is optimized for text search, not text comparison
Bulk reads of too many documents can be very slow because the index isn’t used Rather than loading the actual documents, use an experimental feature called “Term
Vectors” which store frequency information for the row directly in the index. Use the Term Vectors exclusively until the actual document is needed
The meta-document approach makes it impossible to pre-compute statistics. Calculating this at runtime is somewhat costly. Using a cache partially mitigates this problem for related searches.
The original Tesserae has a multi-layered index Actual word + location -> Stemmed word + All other forms
Allows the engine to make decisions about which word form to use at each stage of the search Solr is flat: word + location
Had to “fake” the above hierarchy by packing extra information into each word Implies each word must still be split apart and parsed, this can be slow for large document collections. Would need a custom Solr storage engine to fix this (yes, this is possible – Solr is very pluggable) Would also need my own Term Vector implementation (also possible)
Easy deployment: Vagrant Many components, complicated build process, multiple languages, dozens of
configuration files Need to make this easy to deploy, or no one will use this
Solution: Vagrant Create a Linux image by hand with some pre-installed software
Java, Tomcat, Postgres, Maven, Ant, Sbt, Nginx, Python, Django, RabbitMQ, etc Store all code, setup scripts, and configuration in git Automatically download the Linux image, provision it, and lay down the custom
software and configuration. Automatically start all services, and ingest base corpora Entire deployment boiled down to one command: vagrant up Average deployment time: 10 minutes
Encourages more participation (lower barrier to entry)
The final product Step 1: Clone the project
The final product Step 2: Vagrant up (automatic provisioning, install, config, &
ingest)
The final product Step 3: Search
The final product
Live Demo
Results Results are generated within a similar time-frame to the original (a couple seconds on
average for one core) Scores are nearly identical (many thanks to Walter Scheirer and his team for the help
on translating and explaining the original algorithm, as well as testing the implementation).
Results are truly dynamic, no need to pre-compute / pre-sort No temporary or session files used
Related accesses are very fast (10s of milliseconds) Faster than original site Possible thanks to Solr’s ability to cache search results
Scales very well Numbers are relatively constant regardless of how many other documents occupy the
database (storage volume doesn’t impede speed) Can be made noticeably faster by deploying on a multi-core machine Biggest determining speed factor is how big the two “meta-documents” are
Can’t be made truly parallel, each phase relies on the previous being done Only data that will be displayed is actually transmitted, no wasted bandwidth per search.
Analysis Success!
Both primary and secondary goals were met While single searches on single-core setups won’t see any improvements, using multiple cores
definitely improves speed All original simple-search functionality is intact New functionality added
Sub/super-document comparisons via custom plugin Single-document text search is a given with Solr Solr multi-core support
Can configure multiple instances of Solr to run at the same time, not only means multiple languages but also multiple arbitrary configurations.
Online asynchronous ingest Search and storage caching Web-based administration
Because Solr uses the JVM, no need to run a costly interpreter for each and every search – JVM will compile the most-used pieces of code to near-native speeds.
Original scoring algorithm is O(m*n) (as a result of the all-vs-all comparison) – parallelism only helps so much
Conclusion The results speak for themselves Unfortunate that Solr doesn’t have a built-in comparison endpoint
Writing own turned out to be necessary anyway, doubtful they’d have a scoring scheme based on the original Tesserae algorithm
Lucene API provided everything needed to do this comparison, very few “hacks” necessary Should provide the Tesserae team with a nice framework moving forward
Easy to deploy Separation of concerns Nice UI
Simple, scriptable MVC frontend Written against a well-documented set of APIs
Robust backend Scales better than the perl version A formal, type-checked, thread-safe, compiled language for the core algorithm Written against a well-documented set of APIs
Rich batch tools
Future work UI frontend
Add more advanced search types to frontend Full UI management of ingested texts (view, update, delete) Free-text search of available texts
Solr backend Word highlighting (expensive right now) Core algorithm: address O(n*m) implementation Refactor code, a tad jumbled right now Address slow ingest speed Add support for index rebuild
Vagrant / installer Flush out “automatic” corpora selection Multi-VM installer (automatic load balancing)
Further information
Source code at https://github.com/eberle1080/tesserae-ng Documentation at https://github.com/eberle1080/tesserae-ng/wiki Live version at http://tesserae-ng.chriseberle.net/ SLOC statistics
3205 lines of Python 3119 lines of Scala 2034 lines of XML 719 lines of Bash 548 lines of HTML 237 lines of Java
Questions?