Solr JDBC - Lucene/Solr Revolution 2016

Preview:

Citation preview

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Solr JDBCKevin Risden

Apache Lucene/Solr Committer; Avalon Consulting, LLC

3

03About me

• Consultant with Avalon Consulting, LLC• ~4 years working with Hadoop and Search• Contributed patches to Ambari, HBase, Knox, Solr, Storm• Installation, security, performance tuning, development,

administration

• Kevin Risden• Apache Lucene/Solr Committer• YCSB Contributor

4

03Overview• Background• Use Case• Solr JDBC• Demo• Future Development/Improvements

5

01Background - What is JDBC?

The JDBC API is a Java API that can access any kind of tabular data, especially data stored in a Relational Database.

Source: https://docs.oracle.com/javase/tutorial/jdbc/overview/

JDBC drivers convert SQL into a backend query.

6

01Background - Why should you care about Solr JDBC?

• SQL skills are prolific.• JDBC drivers exist for most relational databases.• Existing reporting tools work with JDBC/ODBC

drivers.

Solr 6 works with SQL and existing JDBC tools!

7

01Use Case – Analytics – Utility RatesData set: 2011 Utility RatesQuestions:• How many utility companies serve the state of Maryland?• Which Maryland utility has the cheapest residential rates?• What are the minimum and maximum residential power rates excluding missing data

elements?• What is the state and zip code with the highest residential rate?

How could you answer those questions with Solr?

Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/

• Facets• Filter Queries• Filters• Grouping

• Sorting• Stats• String queries

together

8

01Use Case – Analytics – Utility Rates

Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/

Method: Lucene syntaxQuestions:• How many utility companies serve the state of Maryland?

http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1

• Which Maryland utility has the cheapest residential rates?http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc

• What are the minimum and maximum residential power rates excluding missing data elements?http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate

• What is the state and zip code with the highest residential rate?http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1Is there a better way?

9

01Solr JDBCHighlights• JDBC Driver for Solr• Powered by Streaming Expressions and Parallel SQL

• Thursday - Parallel SQL and Analytics with Solr – Yonik Seeley• Thursday - Creating New Streaming Expressions – Dennis

Gove• Integrates with any* JDBC client * tested with the JDBC clients in this presentation

Usagejdbc:solr://SOLR_ZK_CONNECTION_STRING?collection=COLLECTION_NAME

Apache Solr Reference Guide - Parallel SQL Interface

10

01Solr JDBC - Architecture

11

01DemoProgramming Languages• Java• Python/Jython• R• Apache Spark

Web• Apache Zeppelin• RStudio

GUI – JDBC• DbVisualizer• SQuirreL SQL

GUI – ODBC• Microsoft Excel• Tableau*

https://github.com/risdenk/solrj-jdbc-testing

12

01Demo – Javaimport org.slf4j.Logger;import org.slf4j.LoggerFactory;import java.sql.*;public class SolrJJDBCTestingJava { private static final Logger LOGGER = LoggerFactory.getLogger(SolrJJDBCTestingJava.class); public static void main(String[] args) throws Exception { String sql = args[0]; try (Connection con = DriverManager.getConnection("jdbc:solr://solr:9983?collection=test")) { try (Statement stmt = con.createStatement()) { try (ResultSet rs = stmt.executeQuery(sql)) { ResultSetMetaData rsMetaData = rs.getMetaData(); int columns = rsMetaData.getColumnCount(); StringBuilder header = new StringBuilder(); for(int i = 1; i < columns + 1; i++) { header.append(rsMetaData.getColumnLabel(i)).append(","); } LOGGER.info(header.toString()); while (rs.next()) { StringBuilder row = new StringBuilder(); for(int i = 1; i < columns + 1; i++) { row.append(rs.getObject(i)).append(","); } LOGGER.info(row.toString()); } } } } }} Apache Solr Reference Guide - Generic

13

01Demo – Python#!/usr/bin/env python# https://pypi.python.org/pypi/JayDeBeApi/import jaydebeapiimport sysif __name__ == '__main__': jdbc_url = "jdbc:solr://solr:9983?collection=test” driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl” statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10” conn = jaydebeapi.connect(driverName, jdbc_url) curs = conn.cursor() curs.execute(statement) print(curs.fetchall()) conn.close()

Apache Solr Reference Guide - Python/Jython

14

01Demo – Jython#!/usr/bin/env jython# http://www.jython.org/jythonbook/en/1.0/DatabasesAndJython.html# https://wiki.python.org/jython/DatabaseExamples#SQLite_using_JDBCimport sys from java.langimport Class from java.sqlimport DriverManager, SQLExceptionif __name__ == '__main__': jdbc_url = "jdbc:solr://solr:9983?collection=test” driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl” statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10” dbConn = DriverManager.getConnection(jdbc_url) stmt = dbConn.createStatement() resultSet = stmt.executeQuery(statement) while resultSet.next(): print(resultSet.getString("fielda")) resultSet.close() stmt.close() dbConn.close() Apache Solr Reference Guide - Python/Jython

15

01Demo – R# https://www.rforge.net/RJDBC/library("RJDBC")solrCP <- c(list.files('/opt/solr/dist/solrj-lib', full.names=TRUE), list.files('/opt/solr/dist', pattern='solrj', full.names=TRUE, recursive = TRUE))drv <- JDBC("org.apache.solr.client.solrj.io.sql.DriverImpl", solrCP, identifier.quote="`")conn <- dbConnect(drv, "jdbc:solr://solr:9983?collection=test", "user", "pwd")dbGetQuery(conn, "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10")dbDisconnect(conn)

Apache Solr Reference Guide - R

16

01Demo – Apache Zeppelin

Apache Solr Reference Guide - Apache Zeppelin

17

01Demo – RStudio

18

01Demo – DbVisualizer

Apache Solr Reference Guide - DbVisualizer

19

01Demo – SQuirreL SQL

Apache Solr Reference Guide - SQuirreL SQL

20

01Demo – Microsoft Excel

21

01Use Case – Analytics – Utility Rates

Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/

Method: Lucene syntaxQuestions:• How many utility companies serve the state of Maryland?

http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1

• Which Maryland utility has the cheapest residential rates?http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc

• What are the minimum and maximum residential power rates excluding missing data elements?http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate

• What is the state and zip code with the highest residential rate?http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1Is there a better way?

22

01Use Case – Analytics – Utility RatesMethod: SQLQuestions:• How many utility companies serve the state of Maryland?

select distinct utility_name from rates where state='MD';• Which Maryland utility has the cheapest residential rates?

select utility_name,min(res_rate) from rates where state='MD' group by utility_name order by min(res_rate) asc limit 1;

• What are the minimum and maximum residential power rates excluding missing data elements?

select min(res_rate),max(res_rate) from rates where not res_rate = 0;• What is the state and zip code with the highest residential rate?

select state,zip,max(res_rate) from rates group by state,zip order by max(res_rate) desc limit 1;

How should you answer those questions with Solr? – Using SQL!Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/

23

01Use Case – Analytics – Utility Rates

How should you answer those questions with Solr? – Using SQL!Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/

24

01Future Development/Improvements• Replace Presto with Apache Calcite - SOLR-8593• Improve SQL compatibility• Ability to specify optimization rules (push downs, joins,

etc)• Potentially use Avatica JDBC/ODBC drivers

• Streaming Expressions/Parallel SQL improvements - SOLR-8125

• JDBC driver improvements - SOLR-8659Info on how to get involved

25

01Future Development/ImprovementsSQL Join

Info on how to get involved

SELECT movie_title,character_name,lineFROM movie_dialogs_movie_titles_metadata aJOIN movie_dialogs_movie_lines bONa.movieID=b.movieID;

select( innerJoin( search(movie_dialogs_movie_titles_metadata, q=*:*, fl="movieID,movie_title", sort="movieID asc"), search(movie_dialogs_movie_lines, q=*:*, fl="movieID,character_name,line", sort="movieID asc"), on="movieID” ), movie_title,character_name,line)

Streaming Expression Join

26

01Questions?