41
OCTOBER 11-14, 2016 BOSTON, MA

Creating New Streams: Presented by Dennis Gove, Bloomberg LP

Embed Size (px)

Citation preview

Page 1: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Page 2: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

Creating New Streams Dennis Gove

Bloomberg LP

Copyright 2016 Bloomberg Finance L.P. All rights reserved.

Page 3: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 Bloomberg

●  Largest provider of financial news and information ●  Our strength is quickly and accurately delivering data, news and analytics ●  Creating high performance and accurate information retrieval systems is core to

our strength

Page 4: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 What’s Our Goal?

We’re going to explore what Solr Streams are and how you can extend the functionality to solve problems at your organizations.

Page 5: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

02 Agenda

●  Solr Streams and Expressions ●  Expression Structure ●  Core Pieces of Every Stream Class ●  The read() function – 3 examples ●  Exposing New Streams in Solr

Page 6: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

"Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. They are a suite of functions that can be combined to perform many different parallel computing tasks.” [Solr Reference Guide v6.0]

Page 7: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

"Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. They are a suite of functions that can be combined to perform many different parallel computing tasks.” [Solr Reference Guide v6.0]

A Solr Stream is a pipeline of actions performed over a set of tuples (documents)

Data Flow

Page 8: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

"Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. They are a suite of functions that can be combined to perform many different parallel computing tasks.” [Solr Reference Guide v6.0]

A Solr Stream is a pipeline of actions performed over a set of tuples (documents)

A Solr Streaming Expression is a way to describe that pipeline rollup( search(pets, q="type:dog", fl=”ownerId,age", sort=”ownerId ASC"), over=”ownerId”, min(age), max(age), count(*) )

Page 9: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

"Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. They are a suite of functions that can be combined to perform many different parallel computing tasks.” [Solr Reference Guide v6.0]

A Solr Stream is a pipeline of actions performed over a set of tuples (documents)

{ “ownerId” : 12345, “age” : 13 }

A Solr Streaming Expression is a way to describe that pipeline rollup( search(pets, q="type:dog", fl=”ownerId,age", sort=”ownerId ASC"), over=”ownerId”, min(age), max(age), count(*) )

Page 10: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

"Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. They are a suite of functions that can be combined to perform many different parallel computing tasks.” [Solr Reference Guide v6.0]

A Solr Stream is a pipeline of actions performed over a set of tuples (documents)

{ “ownerId” : 12345, “min(age)” : 3, “max(age)” : 14, “count(*)” : 3 }

A Solr Streaming Expression is a way to describe that pipeline rollup( search(pets, q="type:dog", fl=”ownerId,age", sort=”ownerId ASC"), over=”ownerId”, min(age), max(age), count(*) )

Page 11: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

innerJoin( on="ownerId=personId”, rollup( search(pets, q="type:dog", fl="ownerId,age", sort="ownerId ASC"), over="ownerId”, min(age), max(age), count(*) ), search(people,q="age:[25 to 40] AND state:(MA RI NH)”,fl=”personId,name”,sort=”personId ASC”) )

Page 12: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

innerJoin( on="ownerId=personId”, rollup( search(pets, q="type:dog", fl="ownerId,age", sort="ownerId ASC"), over="ownerId”, min(age), max(age), count(*) ), search(people,q="age:[25 to 40] AND state:(MA RI NH)”,fl=”personId,name”,sort=”personId ASC”) )

Page 13: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

innerJoin( on="ownerId=personId”, rollup( search(pets, q="type:dog", fl="ownerId,age", sort="ownerId ASC"), over="ownerId”, min(age), max(age), count(*) ), search(people,q="age:[25 to 40] AND state:(MA RI NH)”,fl=”personId,name”,sort=”personId ASC”) )

Page 14: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

All dog owners from MA, RI, or NH, aged 25 thru 40, who have voted in at least 1 presidential primary since 2008, excluding those who have already donated > $50 {

“personId” : 12345, “name” : “Jane Doe” }

Page 15: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

All dog owners from MA, RI, or NH, aged 25 thru 40, who have voted in at least 1 presidential primary since 2008, excluding those who have already donated > $50 {

“personId” : 12345, “name” : “Jane Doe”, “ownerId” : 12345, “min(age)” : 2, “max(age)” : 14, “count(*)” : 3 }

Page 16: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

All dog owners from MA, RI, or NH, aged 25 thru 40, who have voted in at least 1 presidential primary since 2008, excluding those who have already donated > $50 {

“personId” : 12345, “name” : “Jane Doe”, “ownerId” : 12345, “min(age)” : 2, “max(age)” : 14, “count(*)” : 3 }

Page 17: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03 Solr Streams and Expressions

All dog owners from MA, RI, or NH, aged 25 thru 40, who have voted in at least 1 presidential primary since 2008, excluding those who have already donated > $50 {

“personId” : 12345, “name” : “Jane Doe”, “ownerId” : 12345, “min(age)” : 2, “max(age)” : 14, “count(*)” : 3 }

Page 18: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

functionName( positionalParameters, named=“parameters”, typed(parameters) )

Expression Structure

Page 19: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

functionName( positionalParameters, named=“parameters”, typed(parameters) )

Expression Structure

update( petOwners, batchSize=5, rollup( over=“ownerId”, search(pets, q=*:*, fl=“ownerId,age”, sort=“personId ASC”), min(age), max(age), count(*) ) )

The function name is mapped to the class implementing the logic. update -> UpdateStream min -> MinMetric rollup -> RollupStream max -> MaxMetric search -> CloudSolrStream count -> CountMetric

Page 20: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

functionName( positionalParameters, named=“parameters”, typed(parameters) )

Expression Structure

Positional parameters are those expected to be in a particular position. Most often used to reference collection or field names.

update( petOwners, batchSize=5, rollup( over=“ownerId”, search(pets, q=*:*, fl=“ownerId,age”, sort=“personId ASC”), min(age), max(age), count(*) ) )

Page 21: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

functionName( positionalParameters, named=“parameters”, typed(parameters) )

Expression Structure

Named parameters can exist in any non-positional location and in any order. Quotes are only required if the value contains a non-alphanumeric character.

update( petOwners, batchSize=5, rollup( over=“ownerId”, search(pets, q=*:*, fl=“ownerId,age”, sort=“personId ASC”), min(age), max(age), count(*) ) )

Page 22: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

functionName( positionalParameters, named=“parameters”, typed(parameters) )

Expression Structure

Typed parameters are useful in situations where you can accept a parameter representing some other thing, such as streams or metrics, but you don’t care exactly what is provided.

update( petOwners, batchSize=5, rollup( over=“ownerId”, search(pets, q=*:*, fl=“ownerId,age”, sort=“personId ASC”), min(age), max(age), count(*) ) )

Page 23: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

public class RollupStream extends TupleStream implements Expressible { public RollupStream(StreamExpression expression, StreamFactory factory) throws IOException; public void open() throws IOException; public void close() throws IOException; public void setStreamContext(StreamContext context); public List<TupleStream> children(); public int getCost(); public StreamComparator getStreamSort(); public StreamExpression toExpression(StreamFactory factory) throws IOException; public Explanation toExplanation(StreamFactory factory) throws IOException; public Tuple read() throws IOException; }

Core Pieces of Every Stream

Page 24: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

public class RollupStream extends TupleStream implements Expressible { public RollupStream(StreamExpression expression, StreamFactory factory) throws IOException { // grab all parameters out TupleStream incomingStream = extractStream(expression, factory); List<Metric> metrics = extractMetrics(expression, factory); String over = extractOver(expression, factory); ......<validate input>...... } }

Core Pieces of Every Stream

Page 25: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

public class RollupStream extends TupleStream implements Expressible { private TupleStream extractStream(StreamExpression expression, StreamFactory factory) throws IOException { List streamExpressions = factory.getExpressionOperandsRepresentingTypes( expression, Expressible.class, TupleStream.class ); // ......<validate there was exactly 1 stream found>...... return factory.constructStream(streamExpressions.get(0)) } }

Core Pieces of Every Stream

Page 26: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

public class RollupStream extends TupleStream implements Expressible { private List<Metric> extractMetrics(StreamExpression expression, StreamFactory factory) throws IOException { // Get the metric parameters List metricExpressions = factory.getExpressionOperandsRepresentingTypes( expression, Expressible.class, Metric.class ); // Construct the metrics List<Metric> metrics = new ArrayList<Metric>() for(StreamExpression metricExpr : metricExpressions){ metrics.add(factory.constructMetric(metricExpr); } return metrics; } }

Core Pieces of Every Stream

Page 27: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

public class RollupStream extends TupleStream implements Expressible { private String extractOver(StreamExpression expression, StreamFactory factory) throws IOException { // Get the over parameter StreamExpressionNamedParameter overExpression = factory.getNamedOperand( expression, "over” ); // return the over value return ((StreamExpressionValue)overExpression.getParameter()).getValue(); } }

Core Pieces of Every Stream

Page 28: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

public class RollupStream extends TupleStream implements Expressible { private TupleStream incomingStream; public void open() throws IOException { incomingStream.open(); } public void close() throws IOException { incomingStream.close(); } }

Core Pieces of Every Stream

Page 29: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

public class RollupStream extends TupleStream implements Expressible { private TupleStream incomingStream; public void setStreamContext(StreamContext context) { incomingStream.setStreamContext(context); } public List<TupleStream> children() { return Lists.newArrayList(incomingStream); } public StreamComparator getStreamSort() { return incomingStream.getStreamSort(); } }

Core Pieces of Every Stream

Page 30: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

public class RollupStream extends TupleStream implements Expressible { private TupleStream incomingStream; private List<Metric> metrics; private String over; public StreamExpression toExpression(StreamFactory factory) throws IOException { StreamExpression expression = new StreamExpression(factory.getFunctionName(getClass())); // stream expression.addParameter(incomingStream.toExpression(factory)); // over expression.addParameter(new StreamExpressionNamedParameter("over",over)); // metrics for(Metric metric : metrics){ expression.addParameter(metric.toExpression(factory)); } return expression; } }

Core Pieces of Every Stream

Page 31: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

public class RollupStream extends TupleStream implements Expressible { private Metric[] metrics; public Explanation toExplanation(StreamFactory factory) throws IOException { Explanation explanation = new StreamExplanation(getStreamNodeId().toString()) .withChildren(new Explanation[]{ incomingStream.toExplanation(factory) }) .withFunctionName(factory.getFunctionName(getClass())) .withImplementingClass(getClass().getName()) .withExpressionType(ExpressionType.STREAM_DECORATOR) .withExpression(toExpression(factory).toString()); for(Metric metric : metrics){ explanation.withHelper(metric.toExplanation(factory)); } return explanation; } }

Core Pieces of Every Stream

Page 32: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 TupleNumberStream Add a field containing which number in the stream this tuple is tupleNumber( search(pets, q=“type:dog”, fl=“age, name, owner”, sort=“owner asc”) )

Page 33: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 TupleNumberStream – read() public class TupleNumberStream extends TupleStream implements Expressible { private long tupleNumber = 0; /** * Read and return the next tuple. * For each tuple we will add a field 'tupleNumber' containing the number of this tuple * in the stream. Numbers start at 1. */ public Tuple read() throws IOException { Tuple nextTuple = incomingStream.read(); tupleNumber += 1; nextTuple.fields.put("tupleNumber", tupleNumber); return nextTuple; } }

Page 34: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 RandomDropStream Based on a drop rate provided by the user, randomly drop tuples from the stream randomDrop( search(pets, q=“type:dog”, fl=“age, name, owner”, sort=“owner asc”), dropRate=.4 )

Page 35: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 RandomDropStream – read() public class RandomDropStream extends TupleStream implements Expressible { private double dropRate; // read off expression in constructor /** * For each tuple we decide if it should be dropped based a random value vs the dropRate. * We will continue to read from the incoming stream until we either find a tuple that * we decide to not drop OR we find the EOF tuple (the end of the stream) */ public Tuple read() throws IOException { Tuple nextTuple = incomingStream.read(); while(!nextTuple.EOF && randomizer.nextDouble() < dropRate){ nextTuple = incomingStream.read(); } return nextTuple; } }

Page 36: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 ConcatenateStream Add a field containing the concatenation of two other fields to a tuple concatenate( search(pets, q=“type:dog”, fl=“age, name, owner”, sort=“owner asc”), left=“name”, right=“age” )

Page 37: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 ConcatenateStream– read() public class ConcatenateStream extends TupleStream implements Expressible { private String leftField; // read off expression in constructor private String rightField; // read off expression in constructor /** * Read and return the next tuple. * For each tuple, add a field made up of the concatenation of two fields */ public Tuple read() throws IOException { Tuple tuple = incomingStream.read(); if(!tuple.EOF){ if(tuple.fields.containsKey(leftField) && tuple.fields.containsKey(rightField)){ tuple.fields.put( "newField", tuple.get(leftField).toString() + tuple.get(rightField).toString() ); } } return tuple; } }

Page 38: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 Integrate with Solr – Option 1 When committing back to Solr, add each new stream to the default list in o.a.s.handler.StreamHandler.java public void inform(SolrCore core){ … streamFactory.withFunction(“tupleNumber”, TupleNumberStream.class); streamFactory.withFunction(“randomDrop”, RandomDropStream.class); streamFactory.withFunction(“concatenate”, ConcatenateStream.class); … }

Page 39: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

03

facet features gatherNodes jdbc model (Solr 6.3) random search shortestPath stats train topic

Streams Available by Default

classify (Solr 6.3) commit complement daemon leftOuterJoin hashJoin innerJoin intersect merge outerHashJoin

parallel reduce rollup scoreNodes select sort top unique update

Source Streams Decorator Streams

Page 40: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 Integrate with Solr – Option 2 When keeping internal to your organization, add each new stream to the solrconfig.xml of each collection you want to use it in (Solr 6.3) <config> ... <expressible name=“tupleNumber” class=“your.name.space.TupleNumberStream”/> <expressible name=“randomDrop” class=“your.name.space.RandomDropStream”/> <expressible name=“concatenate” class=“your.name.space.ConcatenateStream”/> ... </config>

Page 41: Creating New Streams: Presented by Dennis Gove, Bloomberg LP

01 Questions Reference Guide https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions Sample Code https://github.com/dennisgove/solr-rev-2016 Rollup Stream https://github.com/apache/lucene-solr/.../solrj/io/stream/RollupStream.java Contact [email protected]