21

Schemaless Solr and the Solr Schema REST API

Embed Size (px)

DESCRIPTION

Steve will show how and why to use Solr’s new Schemaless Mode, under which document indexing can be performed with no up-front schema configuration. Solr uses content clues to choose among a predefined set of field types and then automatically add previously unseen fields to the schema.

Citation preview

Page 1: Schemaless Solr and the Solr Schema REST API
Page 2: Schemaless Solr and the Solr Schema REST API

SCHEMALESS SOLR AND THE SOLR SCHEMA REST API Steve Rowe Senior Software Engineer, LucidWorks Twitter: @steven_a_rowe

Page 3: Schemaless Solr and the Solr Schema REST API

•  LucidWorks employee •  Lucene/Solr committer since 2010 •  JFlex committer since 2008 •  Previously at the Center for Natural Language Processing

at Syracuse University’s iSchool (School of Information) •  Twitter: @steven_a_rowe

Who am I?

Page 4: Schemaless Solr and the Solr Schema REST API

•  As of version 4.4, Solr can operate in schemaless mode:

–  No need to pre-configure fields in the schema

–  As documents are indexed, previously unknown fields are automatically added to the schema

–  Field types are auto-detected from a limited set of basic types:

•  Long, Double, Boolean, Date, Text (default)

•  All are multi-valued –  Works in standalone Solr and SolrCloud

Schemaless Solr

•  Solr features used to implement schemaless mode:

–  Managed schema •  Required for runtime

schema modification –  Field value class guessing

•  Parsers attempt to detect the Java class of String-valued field content

–  Automatic schema field addition

•  Java class(es) mapped to schema field type

Page 5: Schemaless Solr and the Solr Schema REST API

•  “Schemaless” does not mean that there is no schema •  Search applications need schemas to support non-trivial document models

–  No schema needed when there is only one field, or only one field type, i.e. all fields share:

•  Document & query processing, including analysis •  Index features & format •  Similarity implementation •  (etc.)

–  Otherwise, search apps need to manage per-field processing configuration (i.e. a schema) to consistently index documents and effectively serve queries

•  So what does “schemaless” mean for Solr? –  No up-front schema configuration required –  Schema discovery: document structure is either not fixed or not fully known

The slide about the nature and utility of schemalessness

Page 6: Schemaless Solr and the Solr Schema REST API

•  Convention over configuration •  Glob-like patterns match field names with field types

!

<dynamicField name="*_i" type="int" indexed="true” stored="true"/>!<fieldType name="int" class="solr.TrieIntField"! precisionStep="0" positionIncrementGap="0"/>!!

•  Dynamic fields solve the problem of assigning field types to unknown fields by inferring a field’s type from its name

•  By contrast, Solr’s schemaless mode infers an unknown field’s type from its value or values

•  These two approaches are complementary •  The Solr schemaless example defines a number of dynamic fields, including the

*_i ! int mapping above

Dynamic fields

Page 7: Schemaless Solr and the Solr Schema REST API

From example/example-schemaless/solr/collection1/conf/schema.xml: !

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />! <field name="_version_" type="long" indexed="true" stored="true"/>! From example/exampledocs/books.csv:

id,cat,name,price,inStock,author,series_t,sequence_i,genre_s! 0441385532,book,Jhereg,7.95,false,Steven Brust,Vlad Taltos,1,fantasy! ...!!$ cd example && java -Dsolr.solr.home=example-schemaless/solr -jar start.jar!!

$ cd exampledocs && java -Dtype=text/csv -jar post.jar books.csv!!

SimplePostTool version 1.5!Posting files to base url http://localhost:8983/solr/update using content-type text/csv..!POSTing file books.csv!1 files indexed.!COMMITting Solr index changes to http://localhost:8983/solr/update..!Time spent: 0:00:00.147!

Schemaless mode example

Page 8: Schemaless Solr and the Solr Schema REST API

$ curl http://localhost:8983/solr/schema/fields!!

{ "fields":[{ "name":"_version_", "type":"long", "indexed":true, "stored":true },! { "name":"author", "type":"text_general" },! { "name":"cat", "type":"text_general" },! { "name":"id", "type":"string", "multiValued":false, "indexed":true,! "required":true, "stored":true,! "uniqueKey":true },! { "name":"inStock", "type":"booleans" },! { "name":"name", "type":"text_general" },! { "name":"price", "type":"tdoubles" }]}!!!!!!

From example/example-schemaless/solr/collection1/conf/schema.xml: !

<fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>! <fieldType name="tdoubles" class="solr.TrieDoubleField" precisionStep="8" ! positionIncrementGap="0" multiValued="true"/>!!!

Schemaless mode example

id! cat! name! price! inStock! author! series_t! sequence_i! genre_s!

0441385532! book! Jhereg! 7.95! false! Steven Brust!

Vlad Taltos!

1! fantasy!

Page 9: Schemaless Solr and the Solr Schema REST API

•  The schema resource is managed by Solr, rather than hand edited

•  On first startup, Solr auto-converts schema.xml to managed-schema

•  Managed schema format is currently XML, but may change in the future

•  XML comments don’t survive the conversion.

•  mutable=true enables runtime schema modification

–  Automatic schema field addition –  Schema REST API

Managed schema From example/example-schemaless/solr/collection1/conf/solrconfig.xml: ! <schemaFactory class="ManagedIndexSchemaFactory">! <bool name="mutable">true</bool>! <str name="managedSchemaResourceName">managed-schema</str>! </schemaFactory>!

conf/ before startup

currency.xml!elevate.xml!lang/!protwords.txt!schema.xml!solrconfig.xml!stopwords.txt!synonyms.txt!

conf/ after startup

currency.xml!elevate.xml!lang/!managed-schema!protwords.txt!schema.xml.bak!solrconfig.xml!stopwords.txt!synonyms.txt!

Page 10: Schemaless Solr and the Solr Schema REST API

•  Unknown fields’ String-typed values are speculatively parsed

–  Cascading parsers attempt to recognize field values

–  On failure, the next one is tried –  First successful parse wins

•  Reconfigurable –  Integer parser could be swapped

in for the Long parser, etc. –  Numeric parsers can take a locale

for java.text.NumberFormat!–  Date parser, implemented using

Joda-Time, can be configured with other patterns, a locale, and/or a default time zone

Field value class guessing <updateRequestProcessorChain name="add-unknown-fields-to-the-schema">! <processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>! <processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>! <processor class="solr.ParseLongFieldUpdateProcessorFactory"/>! <processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>! <processor class="solr.ParseDateFieldUpdateProcessorFactory">! <arr name="format">! <str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>! <str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>! <str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>! <str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>! <str>yyyy-MM-dd'T'HH:mm:ssZ</str>! <str>yyyy-MM-dd'T'HH:mm:ss</str>! <str>yyyy-MM-dd'T'HH:mmZ</str>! <str>yyyy-MM-dd'T'HH:mm</str>! <str>yyyy-MM-dd HH:mm:ss.SSSZ</str>! <str>yyyy-MM-dd HH:mm:ss,SSSZ</str>! <str>yyyy-MM-dd HH:mm:ss.SSS</str>! <str>yyyy-MM-dd HH:mm:ss,SSS</str>! <str>yyyy-MM-dd HH:mm:ssZ</str>! <str>yyyy-MM-dd HH:mm:ss</str>! <str>yyyy-MM-dd HH:mmZ</str>! <str>yyyy-MM-dd HH:mm</str>! <str>yyyy-MM-dd</str>! </arr>! </processor>! !

Page 11: Schemaless Solr and the Solr Schema REST API

•  Field value classes are mapped to field types

•  First match wins •  If none of the typeMapping-s

match, the default field type is assigned

•  If a multi-valued field contains a mix of value classes, the first mapping that matches all values’ classes wins

•  The new field is added to the schema with the mapped field type

•  Reconfigurable

Automatic schema field addition

<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">! <str name="defaultFieldType">text_general</str>! <lst name="typeMapping">! <str name="valueClass">java.lang.Boolean</str>! <str name="fieldType">booleans</str>! </lst>! <lst name="typeMapping">! <str name="valueClass">java.util.Date</str>! <str name="fieldType">tdates</str>! </lst>! <lst name="typeMapping">! <str name="valueClass">java.lang.Long</str>! <str name="valueClass">java.lang.Integer</str>! <str name="fieldType">tlongs</str>! </lst>! <lst name="typeMapping">! <str name="valueClass">java.lang.Number</str>! <str name="fieldType">tdoubles</str>! </lst>! </processor>!

Page 12: Schemaless Solr and the Solr Schema REST API

•  Automatically adding new schema fields in production may not be a good idea –  Unwanted fields, e.g. field name typos, won’t trigger an error

•  First instance wins: field type detection can’t know about the full range of a field’s values

•  Wasted space: e.g. Longs are always used, when Integers might suffice •  Limited gamut of detectable field types •  Single analysis specification for text fields •  Single processing model for all fields

Schemaless mode limitations

Page 13: Schemaless Solr and the Solr Schema REST API

Schema REST API

Page 14: Schemaless Solr and the Solr Schema REST API

•  Each element of the schema is individually readable via the Schema REST API •  Output format can be JSON or XML (wt request param) •  Read-only elements:

–  The entire schema •  In addition to JSON and XML output formats, output can also be in

schema.xml format (?wt=schema.xml) –  All fields, or a specified set of them –  All dynamic fields, or a specified set of them –  All field types, or a specific one –  All copy field directives –  The schema name, version, uniqueKey, and default query operator –  The global similarity

•  Managed schema is not required to use the read-only schema REST API.

Schema REST API: read-only

Page 15: Schemaless Solr and the Solr Schema REST API

$ SOLR=http://localhost:8983/solr/collection1!!$ curl $SOLR/schema/dynamicfields/*_i!!

{! "responseHeader":{! "status":0,! "QTime":1},! "dynamicField":{! "name":"*_i",! "type":"int",! "indexed":true,! "stored":true}}!

Schema REST API: read-only examples !!$ curl $SOLR/schema/uniquekey?wt=xml!!

<?xml version="1.0" encoding="UTF-8"?>!<response>!<lst name="responseHeader">! <int name="status">0</int>! <int name="QTime">1</int>!</lst>!<str name="uniqueKey">id</str>!</response>!

•  Schema REST API URLs employ the downcased form of all schema elements, but the responses use the same casing as schema.xml.

•  For full details on the Solr Schema REST API, see the Schema API section of the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Schema+API

Page 16: Schemaless Solr and the Solr Schema REST API

•  To enable schema modification via the schema REST API, the schema must be managed, and must be configured as mutable.

•  Schema modifications possible as of Solr 4.4: –  Fields may be added

•  Copy field directives may optionally be added at the same time –  Copy field directives may be added

•  Works under both standalone Solr and SolrCloud –  Under SolrCloud, conflicting simultaneous requests are detected using a form of

optimistic concurrency and automatically retried •  Core/collection reload not required for schema modifications that are compatible with

previously indexed documents –  Generally additions are not sources of schema incompatibility

•  Schema incompatibility-inducing operations will require core/collection reload: –  Modifying or removing (dynamic) fields or copy field directives –  Modifying all other schema elements

Schema REST API: runtime schema modification

Page 17: Schemaless Solr and the Solr Schema REST API

Schema REST API: add field example $ SOLR=http://localhost:8983/solr/collection1!!$ curl $SOLR/schema/fields/claimid -X PUT -H 'Content-type: application/json' --data-binary '!{ ! "type":"string",! "stored":true,! "copyFields": [ ! "claims", ! "all"! ]!}’!!

•  The copyField destinations “claims” and “all” must already exist in the schema. •  For full details on the Solr Schema REST API, see the Schema API section of the Solr

Reference Guide: https://cwiki.apache.org/confluence/display/solr/Schema+API

Page 18: Schemaless Solr and the Solr Schema REST API

•  https://issues.apache.org/jira/browse/SOLR-4898 is the umbrella JIRA issue under which further schema REST API work will be done, including:

–  adding dynamic fields –  adding field types –  enabling wholesale replacement by PUTing a new schema. –  modifying and removing fields, dynamic fields, field types, and copy field

directives –  modifying all remaining aspects of the schema: Name, Version, Unique Key,

Global Similarity, and Default Query Operator

Schema REST API TODOs

Page 19: Schemaless Solr and the Solr Schema REST API

•  Add arbitrary metadata at the top level of the schema and at each leaf node •  Allow read/write access to that metadata via the REST API. •  Uses cases:

–  Round-trippable documentation •  Conversion to managed schema format drops all comments

–  Documentable tags –  When modifying the schema via REST API, a "last-modified" annotation could

be automatically added. –  User-level arbitrary key/value metadata

•  W3C XML Schema has a similar facility: http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-annotation

Proposal: Schema Annotations

Page 20: Schemaless Solr and the Solr Schema REST API

<schema name="example" version="1.5">! <annotation>!   <description element="tag" ! content="plain-numeric-field-types">!     Plain numeric field types store and index the! text value verbatim.!   </description>!   <documentation element="copyField">!     copyField commands copy one field to another at! the time a document is added to the index.  It's! used either to index the same field differently,!     or to add multiple fields to the same field for! easier/faster searching.!   </documentation>!   <last-modified>2014-03-08T12:14:02Z</last-modified>!   …! </annotation>!…!

Schema Annotation example  <fieldType name="pint" class="solr.IntField">!   <annotation>!     <tag>plain-numeric-field-types</tag>!   </annotation>! </fieldType>! <fieldType name="plong" class="solr.LongField">!   <annotation>!     <tag>plain-numeric-field-types</tag>!   </annotation>! </fieldType>! …! <copyField source="cat" dest="text">!   <annotation>!     <todo>Copy to the catchall field?</todo>!   </annotation>! </copyField>! …! <field name="text" type="text_general">!   <annotation>!     <description>catchall field</description>!     <visibility>public</visibility>!   </annotation>! </field>!

Page 21: Schemaless Solr and the Solr Schema REST API

•  Schemaless Solr mode enables quick prototyping with minimal setup

•  Schema REST API provides programmatic read/write access to Solr’s schema •  More elements writeable soon

•  Schema annotations would enable round-trippable documentation, tagging, and arbitrary user-provided metadata

Summary