17

Search Analytics Component: Presented by Steven Bower, Bloomberg L.P

Embed Size (px)

Citation preview

Search Analytics Component Steven Bower ©2014 Bloomberg L.P.

Bloomberg •  Largest provider of financial news and information

•  Our strength is quickly and accurately delivering data, news and analytics

•  Creating high performance and accurate information retrieval systems is core to our strength

Bloomberg Search Team •  Search infrastructure

•  Develop and support search as a service platform •  Support for other search applications within the company

•  Consultancy •  Provide design consultancy/support to application teams •  Promote search best practices/standardization throughout the company

•  Machine learning •  Develop machine learning techniques to improve relevancy •  Create natural language processors to answer questions

•  Unified search •  Create information retrieval tools to organize and connect the vast and varied

datasets provided to our clients

Our Challenge

Our Approach •  Use Search/Solr as it provides flexible search/filtering over large, fast moving,

result sets

•  Initially used StatsComponent, but quickly ran into limitations

•  Wanted to push the bounds of analytics capabilities in Solr/Lucene

•  Needed a pluggable framework to perform complex calculations/aggregations on numerical time-series data

•  DocValues provided high performance columnar access to fields in the index (without un-inversion cost)

DocValues •  DocValues provide high performance

columnar access to fields in the index

•  No un-inversion cost

•  Increased storage footprint

•  Helps achieve NRT

•  Values live off-heap in memory map

Analytics Component •  New component from the ground up

•  Designed/Implemented by the Bloomberg Search Team over summer of 2013

•  Initial implementation was built using DocValues API directly, but moved to FieldCache

•  Refactored existing faceting implementation to support analytics

•  Created simple prefix notation for statistical expressions

•  Available as a Solr Contrib module in Solr 5.x or patches for 4.8+ on SOLR-5302

Features •  Flexible/Extendable framework for adding additional statistics/faceting

•  Supports Multiple Analytics Requests per query execution •  Multiple statistic calculations per request •  Multiple facets per request •  Each request can facet statistics over different fields and ranges

Features - Faceting •  Field Faceting

•  Support for int, long, float, double, date, string fields •  Support for multi-value fields •  Support for limit, offset and mincount •  Support for sorting of stats-facets by any statistic (i.e. sort by mean)

•  Range faceting •  Numeric types and dates •  Dynamically calculate range/gap based on calculated statistics

•  Support for query faceting of stats •  Use calculated statistics to generate facet queries

Features – Map Operators •  Basic Math

•  neg(<expr>) •  add(<expr>,...) •  mult(<expr>,...) •  div(<expr>,<expr>) •  pow(<expr>,<expr>) •  log(<expr>,<expr>)

•  Constants •  const_num(<number>) •  const_date(<date>) •  const_str(<string>)    

•  Date Math •  date_math(<date expr>,<date op>,...)

•  String operations •  rev(<expr>) •  concat(<expr>,...)

•  Field •  <field>

•  Missing Values •  miss(<expr>,<value>)

Features – Reduction Operators •  Statistical

•  min(<expr>) •  max(<expr>) •  sum(<expr>) •  count(<expr>) •  miss(<expr>) •  unique(<expr>)

•  Complex •  sumofsquares(<expr>) •  mean(<expr>) •  stddev(<expr>) •  median(<expr>) •  percentile(<expr>)

Examples

•  Weighted Average •  Calculate weighted average of field_a with field_b as the weight

div( mean( mult(field_a, field_b) ), sum(field_b) )

•  Variance •  Calculate the variance of field_a

pow( stddev(field_a), const_num(2) )

Examples

•  T-Score •  Calculate a t-score where ## is the value and all values in your sample are stored in field_a.

div( add( const_num(##), neg( mean(field_a) ) ), div( stddev(field_a), pow( count(field_a), const_num(.5) ) ) )

•  Segment, aggregate and analyze financial data quickly

•  Aggregate time series data across multiple fields to render charts

•  Created flexible diagnostic tools/visualizations to analyze Solr performance

How We Use It

Future Plans •  Multi-shard support

•  Pivot Facet Support

•  Statistics on Multi-value fields •  To support unique()

•  Filter result set based upon calculated statistics

•  Generalize facet implementation

Links and Questions?

Analytics Component h"ps://issues.apache.org/jira/browse/SOLR-­‐5302  

More About Bloomberg

h"p://www.bloomberglabs.com/