Creating New Streams: Presented by Dennis Gove, Bloomberg LP

Preview:

Citation preview

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Creating New Streams Dennis Gove

Bloomberg LP

Copyright 2016 Bloomberg Finance L.P. All rights reserved.

01 Bloomberg

●  Largest provider of financial news and information ●  Our strength is quickly and accurately delivering data, news and analytics ●  Creating high performance and accurate information retrieval systems is core to

our strength

01 What’s Our Goal?

We’re going to explore what Solr Streams are and how you can extend the functionality to solve problems at your organizations.

02 Agenda

●  Solr Streams and Expressions ●  Expression Structure ●  Core Pieces of Every Stream Class ●  The read() function – 3 examples ●  Exposing New Streams in Solr

03 Solr Streams and Expressions

"Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. They are a suite of functions that can be combined to perform many different parallel computing tasks.” [Solr Reference Guide v6.0]

03 Solr Streams and Expressions

"Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. They are a suite of functions that can be combined to perform many different parallel computing tasks.” [Solr Reference Guide v6.0]

A Solr Stream is a pipeline of actions performed over a set of tuples (documents)

Data Flow

03 Solr Streams and Expressions

"Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. They are a suite of functions that can be combined to perform many different parallel computing tasks.” [Solr Reference Guide v6.0]

A Solr Stream is a pipeline of actions performed over a set of tuples (documents)

A Solr Streaming Expression is a way to describe that pipeline rollup( search(pets, q="type:dog", fl=”ownerId,age", sort=”ownerId ASC"), over=”ownerId”, min(age), max(age), count(*) )

03 Solr Streams and Expressions

"Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. They are a suite of functions that can be combined to perform many different parallel computing tasks.” [Solr Reference Guide v6.0]

A Solr Stream is a pipeline of actions performed over a set of tuples (documents)

{ “ownerId” : 12345, “age” : 13 }

A Solr Streaming Expression is a way to describe that pipeline rollup( search(pets, q="type:dog", fl=”ownerId,age", sort=”ownerId ASC"), over=”ownerId”, min(age), max(age), count(*) )

03 Solr Streams and Expressions

"Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. They are a suite of functions that can be combined to perform many different parallel computing tasks.” [Solr Reference Guide v6.0]

A Solr Stream is a pipeline of actions performed over a set of tuples (documents)

{ “ownerId” : 12345, “min(age)” : 3, “max(age)” : 14, “count(*)” : 3 }

A Solr Streaming Expression is a way to describe that pipeline rollup( search(pets, q="type:dog", fl=”ownerId,age", sort=”ownerId ASC"), over=”ownerId”, min(age), max(age), count(*) )

03 Solr Streams and Expressions

innerJoin( on="ownerId=personId”, rollup( search(pets, q="type:dog", fl="ownerId,age", sort="ownerId ASC"), over="ownerId”, min(age), max(age), count(*) ), search(people,q="age:[25 to 40] AND state:(MA RI NH)”,fl=”personId,name”,sort=”personId ASC”) )

03 Solr Streams and Expressions

innerJoin( on="ownerId=personId”, rollup( search(pets, q="type:dog", fl="ownerId,age", sort="ownerId ASC"), over="ownerId”, min(age), max(age), count(*) ), search(people,q="age:[25 to 40] AND state:(MA RI NH)”,fl=”personId,name”,sort=”personId ASC”) )

03 Solr Streams and Expressions

innerJoin( on="ownerId=personId”, rollup( search(pets, q="type:dog", fl="ownerId,age", sort="ownerId ASC"), over="ownerId”, min(age), max(age), count(*) ), search(people,q="age:[25 to 40] AND state:(MA RI NH)”,fl=”personId,name”,sort=”personId ASC”) )

03 Solr Streams and Expressions

All dog owners from MA, RI, or NH, aged 25 thru 40, who have voted in at least 1 presidential primary since 2008, excluding those who have already donated > $50 {

“personId” : 12345, “name” : “Jane Doe” }

03 Solr Streams and Expressions

All dog owners from MA, RI, or NH, aged 25 thru 40, who have voted in at least 1 presidential primary since 2008, excluding those who have already donated > $50 {

“personId” : 12345, “name” : “Jane Doe”, “ownerId” : 12345, “min(age)” : 2, “max(age)” : 14, “count(*)” : 3 }

03 Solr Streams and Expressions

All dog owners from MA, RI, or NH, aged 25 thru 40, who have voted in at least 1 presidential primary since 2008, excluding those who have already donated > $50 {

“personId” : 12345, “name” : “Jane Doe”, “ownerId” : 12345, “min(age)” : 2, “max(age)” : 14, “count(*)” : 3 }

03 Solr Streams and Expressions

All dog owners from MA, RI, or NH, aged 25 thru 40, who have voted in at least 1 presidential primary since 2008, excluding those who have already donated > $50 {

“personId” : 12345, “name” : “Jane Doe”, “ownerId” : 12345, “min(age)” : 2, “max(age)” : 14, “count(*)” : 3 }

03

functionName( positionalParameters, named=“parameters”, typed(parameters) )

Expression Structure

03

functionName( positionalParameters, named=“parameters”, typed(parameters) )

Expression Structure

update( petOwners, batchSize=5, rollup( over=“ownerId”, search(pets, q=*:*, fl=“ownerId,age”, sort=“personId ASC”), min(age), max(age), count(*) ) )

The function name is mapped to the class implementing the logic. update -> UpdateStream min -> MinMetric rollup -> RollupStream max -> MaxMetric search -> CloudSolrStream count -> CountMetric

03

functionName( positionalParameters, named=“parameters”, typed(parameters) )

Expression Structure

Positional parameters are those expected to be in a particular position. Most often used to reference collection or field names.

update( petOwners, batchSize=5, rollup( over=“ownerId”, search(pets, q=*:*, fl=“ownerId,age”, sort=“personId ASC”), min(age), max(age), count(*) ) )

03

functionName( positionalParameters, named=“parameters”, typed(parameters) )

Expression Structure

Named parameters can exist in any non-positional location and in any order. Quotes are only required if the value contains a non-alphanumeric character.

update( petOwners, batchSize=5, rollup( over=“ownerId”, search(pets, q=*:*, fl=“ownerId,age”, sort=“personId ASC”), min(age), max(age), count(*) ) )

03

functionName( positionalParameters, named=“parameters”, typed(parameters) )

Expression Structure

Typed parameters are useful in situations where you can accept a parameter representing some other thing, such as streams or metrics, but you don’t care exactly what is provided.

update( petOwners, batchSize=5, rollup( over=“ownerId”, search(pets, q=*:*, fl=“ownerId,age”, sort=“personId ASC”), min(age), max(age), count(*) ) )

03

public class RollupStream extends TupleStream implements Expressible { public RollupStream(StreamExpression expression, StreamFactory factory) throws IOException; public void open() throws IOException; public void close() throws IOException; public void setStreamContext(StreamContext context); public List<TupleStream> children(); public int getCost(); public StreamComparator getStreamSort(); public StreamExpression toExpression(StreamFactory factory) throws IOException; public Explanation toExplanation(StreamFactory factory) throws IOException; public Tuple read() throws IOException; }

Core Pieces of Every Stream

03

public class RollupStream extends TupleStream implements Expressible { public RollupStream(StreamExpression expression, StreamFactory factory) throws IOException { // grab all parameters out TupleStream incomingStream = extractStream(expression, factory); List<Metric> metrics = extractMetrics(expression, factory); String over = extractOver(expression, factory); ......<validate input>...... } }

Core Pieces of Every Stream

03

public class RollupStream extends TupleStream implements Expressible { private TupleStream extractStream(StreamExpression expression, StreamFactory factory) throws IOException { List streamExpressions = factory.getExpressionOperandsRepresentingTypes( expression, Expressible.class, TupleStream.class ); // ......<validate there was exactly 1 stream found>...... return factory.constructStream(streamExpressions.get(0)) } }

Core Pieces of Every Stream

03

public class RollupStream extends TupleStream implements Expressible { private List<Metric> extractMetrics(StreamExpression expression, StreamFactory factory) throws IOException { // Get the metric parameters List metricExpressions = factory.getExpressionOperandsRepresentingTypes( expression, Expressible.class, Metric.class ); // Construct the metrics List<Metric> metrics = new ArrayList<Metric>() for(StreamExpression metricExpr : metricExpressions){ metrics.add(factory.constructMetric(metricExpr); } return metrics; } }

Core Pieces of Every Stream

03

public class RollupStream extends TupleStream implements Expressible { private String extractOver(StreamExpression expression, StreamFactory factory) throws IOException { // Get the over parameter StreamExpressionNamedParameter overExpression = factory.getNamedOperand( expression, "over” ); // return the over value return ((StreamExpressionValue)overExpression.getParameter()).getValue(); } }

Core Pieces of Every Stream

03

public class RollupStream extends TupleStream implements Expressible { private TupleStream incomingStream; public void open() throws IOException { incomingStream.open(); } public void close() throws IOException { incomingStream.close(); } }

Core Pieces of Every Stream

03

public class RollupStream extends TupleStream implements Expressible { private TupleStream incomingStream; public void setStreamContext(StreamContext context) { incomingStream.setStreamContext(context); } public List<TupleStream> children() { return Lists.newArrayList(incomingStream); } public StreamComparator getStreamSort() { return incomingStream.getStreamSort(); } }

Core Pieces of Every Stream

03

public class RollupStream extends TupleStream implements Expressible { private TupleStream incomingStream; private List<Metric> metrics; private String over; public StreamExpression toExpression(StreamFactory factory) throws IOException { StreamExpression expression = new StreamExpression(factory.getFunctionName(getClass())); // stream expression.addParameter(incomingStream.toExpression(factory)); // over expression.addParameter(new StreamExpressionNamedParameter("over",over)); // metrics for(Metric metric : metrics){ expression.addParameter(metric.toExpression(factory)); } return expression; } }

Core Pieces of Every Stream

03

public class RollupStream extends TupleStream implements Expressible { private Metric[] metrics; public Explanation toExplanation(StreamFactory factory) throws IOException { Explanation explanation = new StreamExplanation(getStreamNodeId().toString()) .withChildren(new Explanation[]{ incomingStream.toExplanation(factory) }) .withFunctionName(factory.getFunctionName(getClass())) .withImplementingClass(getClass().getName()) .withExpressionType(ExpressionType.STREAM_DECORATOR) .withExpression(toExpression(factory).toString()); for(Metric metric : metrics){ explanation.withHelper(metric.toExplanation(factory)); } return explanation; } }

Core Pieces of Every Stream

01 TupleNumberStream Add a field containing which number in the stream this tuple is tupleNumber( search(pets, q=“type:dog”, fl=“age, name, owner”, sort=“owner asc”) )

01 TupleNumberStream – read() public class TupleNumberStream extends TupleStream implements Expressible { private long tupleNumber = 0; /** * Read and return the next tuple. * For each tuple we will add a field 'tupleNumber' containing the number of this tuple * in the stream. Numbers start at 1. */ public Tuple read() throws IOException { Tuple nextTuple = incomingStream.read(); tupleNumber += 1; nextTuple.fields.put("tupleNumber", tupleNumber); return nextTuple; } }

01 RandomDropStream Based on a drop rate provided by the user, randomly drop tuples from the stream randomDrop( search(pets, q=“type:dog”, fl=“age, name, owner”, sort=“owner asc”), dropRate=.4 )

01 RandomDropStream – read() public class RandomDropStream extends TupleStream implements Expressible { private double dropRate; // read off expression in constructor /** * For each tuple we decide if it should be dropped based a random value vs the dropRate. * We will continue to read from the incoming stream until we either find a tuple that * we decide to not drop OR we find the EOF tuple (the end of the stream) */ public Tuple read() throws IOException { Tuple nextTuple = incomingStream.read(); while(!nextTuple.EOF && randomizer.nextDouble() < dropRate){ nextTuple = incomingStream.read(); } return nextTuple; } }

01 ConcatenateStream Add a field containing the concatenation of two other fields to a tuple concatenate( search(pets, q=“type:dog”, fl=“age, name, owner”, sort=“owner asc”), left=“name”, right=“age” )

01 ConcatenateStream– read() public class ConcatenateStream extends TupleStream implements Expressible { private String leftField; // read off expression in constructor private String rightField; // read off expression in constructor /** * Read and return the next tuple. * For each tuple, add a field made up of the concatenation of two fields */ public Tuple read() throws IOException { Tuple tuple = incomingStream.read(); if(!tuple.EOF){ if(tuple.fields.containsKey(leftField) && tuple.fields.containsKey(rightField)){ tuple.fields.put( "newField", tuple.get(leftField).toString() + tuple.get(rightField).toString() ); } } return tuple; } }

01 Integrate with Solr – Option 1 When committing back to Solr, add each new stream to the default list in o.a.s.handler.StreamHandler.java public void inform(SolrCore core){ … streamFactory.withFunction(“tupleNumber”, TupleNumberStream.class); streamFactory.withFunction(“randomDrop”, RandomDropStream.class); streamFactory.withFunction(“concatenate”, ConcatenateStream.class); … }

03

facet features gatherNodes jdbc model (Solr 6.3) random search shortestPath stats train topic

Streams Available by Default

classify (Solr 6.3) commit complement daemon leftOuterJoin hashJoin innerJoin intersect merge outerHashJoin

parallel reduce rollup scoreNodes select sort top unique update

Source Streams Decorator Streams

01 Integrate with Solr – Option 2 When keeping internal to your organization, add each new stream to the solrconfig.xml of each collection you want to use it in (Solr 6.3) <config> ... <expressible name=“tupleNumber” class=“your.name.space.TupleNumberStream”/> <expressible name=“randomDrop” class=“your.name.space.RandomDropStream”/> <expressible name=“concatenate” class=“your.name.space.ConcatenateStream”/> ... </config>

01 Questions Reference Guide https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions Sample Code https://github.com/dennisgove/solr-rev-2016 Rollup Stream https://github.com/apache/lucene-solr/.../solrj/io/stream/RollupStream.java Contact dpgove@gmail.com

Recommended