Release 1.1.0 Tuplejump - Read the Docs · 2019. 4. 2. · •Stargate-search - A search server like Solr/ElasticSearch (Work in progress.) 1.1Stargate-core Features 1.Add lucene

stargate DocumentationRelease 1.1.0

Tuplejump

February 29, 2016

Contents

1 What is Stargate 31.1 Stargate-core Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Installation 52.1 Install from binaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Install from source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Important Note on Shutdown procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Development usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Quick start 73.1 Pre-requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Creating a Row Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Querying a Row Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Indexing and querying JSON 114.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Creating an index on JSON fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Querying JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Index Configuration 175.1 Index Creation Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.3 Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.4 JSON indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.5 CQL collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.6 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.7 Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.8 Out of box Analyzers with Stargate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.9 Custom Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.10 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.11 Index Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.12 Numeric field precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.13 Striping/Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Queries 216.1 Query and filter options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.2 Types of queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2.1 Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.2.2 Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

i

6.2.3 Phrase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.2.4 Fuzzy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.2.5 Prefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2.6 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2.7 Regex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.2.8 Wildcards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.3 Combining conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.4 Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.5 Sorting across partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.6 Pagination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Indices and tables 27

ii

stargate Documentation, Release 1.1.0

Contents:

Contents 1


2 Contents

CHAPTER 1

What is Stargate

Stargate is made of 2 components.

• Stargate-core - To add Lucene indexing support in Cassandra. (See installation.)

• Stargate-search - A search server like Solr/ElasticSearch (Work in progress.)

1.1 Stargate-core Features

1. Add lucene based row indices to Cassandra CQL tables.

2. Index and query JSON data directly.

3. Index CQL maps,sets and lists.

4. Query, filter and sort based on fields in row index.

5. Specify different data types and analyzers for lucene analysis and querying.

6. Use a variety of lucene queries like match, range, phrase, wildcard, regex, fuzzy, prefix and more.

3


4 Chapter 1. What is Stargate

CHAPTER 2

Installation

Stargate-core is currently tested on Cassandra 2.1.12, 2.1.13

2.1 Install from binaries

• Extract the archives downloaded from the download link.

• Copy the jars from the lib folder of the extracted archive into your cassandra installation lib folder and you aregood to go.

2.2 Install from source

• Prerequisites - Java 1.8, Cassandra 2.1.10/11, Gradle.

• Checkout the master branch in the git-repo.

• Run ‘gradle jar’ in the stargate-core directory. This will create the required libraries in build/libs.

• Drop the libraries into your cassandra installation lib folder and you are good to go.

2.3 Important Note on Shutdown procedure

Warning: For Stargate enabled Cassandra, shutdown Cassandra using kill -15 cassandra-pid.Alternately, request a flush from nodetool and then shutdown Cassandra using kill -9.Cassandra can be shutdown with kill -9 cassandra-pid but, some writes to the index may not be flushed when usingthis method.This is usually fine when the node fails by itself. When a node fails it is simpler to purge it and replace it back intothe cluster.

Stargate flushes indexes periodically or when you request a flush, and also with a Shutdown hook. All the writes tothe index are guaranteed to be flushed only when you explicitly call flush or shutdown using the kill -15 (since kill -9does not call Shutdown hooks on the JVM). Otherwise, some writes to the index will be lost.

5


2.4 Development usage

Stargate-core is in Maven central

6 Chapter 2. Installation

CHAPTER 3

Quick start

3.1 Pre-requisites

Install Stargate as instructed in the installation.

Open cassandra/bin/cqlsh and optionally create a keyspace:

CREATE KEYSPACE my_keyspace WITH replication ={'class': 'SimpleStrategy','replication_factor' : 1

};

Change into your keyspace:

USE MY_KEYSPACE;

Let us create a table named PERSON like so:

CREATE TABLE PERSON (id int primary key,isActive boolean,age int,eyeColor varchar,name text,gender varchar,company varchar,email varchar,phone varchar,address text,stargate text

);

3.2 Creating a Row Index

A row index with name ‘person_idx’ can be created on a table named ‘PERSON’ like so:

CREATE CUSTOM INDEX person_idx ON PERSON(stargate) USING'com.tuplejump.stargate.RowIndex' WITH options ={

'sg_options':'{"fields":{

7


"age":{},"eyeColor":{},"name":{},"gender":{},"company":{},"phone":{},"address":{}}

}'};

Note:

• You create an index on a meta column of CQL type text. The column name can be anything.

• The meta column should be left empty. While inserting data, it is left out in the values list.

• The meta column is used to return any meta information about the search such as score(relevance), positions forhighlighting etc.

• You specify options as - WITH options={‘sg_options’:’<json>’}.

• The sg_options string has to be a valid JSON. Note the Single quotes and Double quotes.

• You always need to create the RowIndex only on a column with a string CQL datatype i.e CQL type varchar,asciior text.

• The columns which need to be indexed are specified in the ‘fields’ object.

The above statement will create a row index on the table person and will index the columns specified with appropriatedata type mapping derived from the Cassandra data type. More details about this can be found in the Index optionssection.

Now go ahead and insert data like so:

INSERT INTO PERSON(id,isActive,age,eyeColor,name,gender,company,email,phone,address) VALUES(1,false,29,'green','Davidson Hurst','male','TALKOLA','[email protected]','+1 (950) 405-2257','691 Hampton Place, Felt, North Carolina, 8466');INSERT INTO PERSON(id,isActive,age,eyeColor,name,gender,company,email,phone,address) VALUES(2,false,27,'black','Maxwell Kemp','male','AMTAP','[email protected]','+1 (800) 495-3822','466 Kenilworth Place, Fivepointville, Maryland, 6240');INSERT INTO PERSON(id,isActive,age,eyeColor,name,gender,company,email,phone,address) VALUES(3,false,25,'black','Cecelia Cain','female','MAINELAND','[email protected]','+1 (874) 590-2058','644 Broome Street, Rutherford, Delaware, 6271');INSERT INTO PERSON(id,isActive,age,eyeColor,name,gender,company,email,phone,address) VALUES(4,true,28,'green','Morse Sanders','male','APEX','[email protected]','+1 (857) 427-3391','786 Division Avenue, Rose, Rhode Island, 4217');INSERT INTO PERSON(id,isActive,age,eyeColor,name,gender,company,email,phone,address) VALUES(5,true,25,'black','Fernandez Morse','male','OPTICALL','[email protected]','+1 (911) 442-2649','116 Suydam Place, Libertytown, Michigan, 2257');INSERT INTO PERSON(id,isActive,age,eyeColor,name,gender,company,email,phone,address) VALUES(6,false,27,'brown','Ryan Ross','male','ZAPHIRE','[email protected]','+1 (843) 423-2420','804 Erskine Loop, Robinette, Marshall Islands, 9161');INSERT INTO PERSON(id,isActive,age,eyeColor,name,gender,company,email,phone,address) VALUES(7,true,34,'brown','Avis Mosley','female','TETRATREX','[email protected]','+1 (883) 461-3832','391 Heyward Street, Hayes, Alabama, 5934');INSERT INTO PERSON(id,isActive,age,eyeColor,name,gender,company,email,phone,address) VALUES(8,false,29,'black','Juana Ewing','female','REPETWIRE','[email protected]','+1 (809) 410-2791','510 Lake Avenue, Austinburg, Virgin Islands, 2964');INSERT INTO PERSON(id,isActive,age,eyeColor,name,gender,company,email,phone,address) VALUES(9,false,33,'brown','Edwards Patton','male','MANGELICA','[email protected]','+1 (977) 508-2935','131 Stone Avenue, Cucumber, Minnesota, 4601');INSERT INTO PERSON(id,isActive,age,eyeColor,name,gender,company,email,phone,address) VALUES(10,false,38,'blue','Weaver Carson','male','ISOLOGIX','[email protected]','+1 (916) 566-2681','560 Hanson Place, Gardners, Puerto Rico, 7821');

Once you have done that, you are now ready to query.

3.3 Querying a Row Index

Here is a list of quick queries that can be made using the default options. For more information on queries, read theQueries section

-- select all people with age more than 30SELECT * FROM PERSON WHERE stargate ='{

filter: {type: "range",

8 Chapter 3. Quick start


field: "age",lower: 30

}}';

-- select all person with age more than 30 and less than 35SELECT name,age,email FROM PERSON WHERE stargate ='{

filter: {type: "range",field: "age",lower: 30,upper:35

}}';

-- get the person called AvisSELECT * FROM PERSON WHERE stargate ='{

filter: {type: "match",field: "name",value: "Avis"

}}';

-- find people living in some street.SELECT * FROM PERSON WHERE stargate ='{

filter: {type: "match",field: "address",value: "street"

}}';

-- find people starting with m.SELECT * FROM PERSON WHERE stargate ='{

filter: {type: "wildcard",field: "name",value: "m*"

}}';

-- find companies starting with a.SELECT * FROM PERSON WHERE stargate ='{

filter: {type: "prefix",field: "company",value: "a"

}}';

-- find companies from 'a'to 'p' and sort by name reverseSELECT name,company FROM PERSON WHERE stargate ='{

filter: {type: "range",field: "company",lower: "a",

3.3. Querying a Row Index 9


upper: "p"},sort:{

fields: [{field:"name",reverse:true}]}

}';

-- find people starting with m who belong to companies starting with a.SELECT * FROM PERSON WHERE stargate ='{

filter: {type: "boolean",must:[

{type:"wildcard",field: "name",value: "m*"

},{type: "prefix",field: "company",value: "a"

}]

}}';

10 Chapter 3. Quick start

CHAPTER 4

Indexing and querying JSON

4.1 Prerequisites

Install Stargate as instructed in the installation.

Open cassandra/bin/cqlsh and optionally create a keyspace:

CREATE KEYSPACE my_keyspace WITH replication ={'class': 'SimpleStrategy','replication_factor' : 1

};

Change into your keyspace:

USE MY_KEYSPACE;

Let us create a table named PERSON_JSON like so:

CREATE TABLE PERSON_JSON (id int primary key,json text,stargate text

);

4.2 Creating an index on JSON fields

The above table has just one column ‘json’ which is of CQL type text. We can push valid JSON into this column. Toindex this JSON, we create an index on the ‘stargate’ meta column like so

CREATE CUSTOM INDEX json_idx ON PERSON_JSON(stargate) USING'com.tuplejump.stargate.RowIndex' WITH options ={


"json":{"type":"object"

}}

}'};

11


Note:

• You create an index on a meta column of CQL type text. The column name can be anything.

• The JSON column should be of CQL type text.

• In sg_options, you specify the type of the column as ‘object’. This indicates that the object is a json.

Suppose you insert data into the table like so

INSERT INTO PERSON_JSON (id,json) values (1,'{"age": 40,"eyeColor": "green","name": "Casey Stone","gender": "female","company": "EXODOC","address": "760 Gold Street, Choctaw, Iowa, 3595","registered": "2014-03-30T18:24:14 -06:-30","latitude": 30.904815,"longitude": 169.113457,"tags": [

"idiot","fool","bad"

],"friends": [

{"name": "Casey Stone"

},{

"name": "Clark Wise"},{

"name": "Letitia Holder"}

]}');INSERT INTO PERSON_JSON (id,json) values (2,'{

"age": 20,"eyeColor": "brown","name": "Selma Nelson","gender": "female","company": "WAAB","address": "421 Dictum Court, Deltaville, Hawaii, 5115","registered": "2014-05-13T23:42:48 -06:-30","latitude": 88.721567,"longitude": -77.946054,"tags": [

"good","nice","cool"

],"friends": [


},{

"name": "Sweet Chambers"

12 Chapter 4. Indexing and querying JSON


},{

"name": "Cantor Wise"}


"age": 37,"eyeColor": "brown","name": "Powers Brown","gender": "male","company": "EXODOC","address": "527 Beard Street, Springhill, Iowa, 4189","registered": "2014-05-15T01:38:29 -06:-30","latitude": 11.414768,"longitude": -97.106062,"tags": [

"bad","ugly","yuck"

],"friends": [

{"name": "Anthony Vaughan"

},{

"name": "Sweet Chambers"},{

"name": "Cantor Hunt"}


"age": 34,"eyeColor": "blue","name": "Mercer Roberts","gender": "male","company": "BEDDER","address": "496 Thornton Street, Gwynn, Maine, 3535","registered": "2014-02-21T09:08:57 -06:-30","latitude": -59.376042,"longitude": 68.532665,"tags": [

"friendly","nice","cool"

],"friends": [


},{

"name": "Wooten Daugherty"},{

"name": "Robyn Wynn"}

4.2. Creating an index on JSON fields 13



"age": 35,"eyeColor": "blue","name": "Avila Quinn","gender": "male","company": "BEDDER","address": "682 Beadel Street, Cawood, Arkansas, 9088","registered": "2014-01-15T13:07:00 -06:-30","latitude": -21.666006,"longitude": 137.589547,"tags": [

"good","bad","ugly"

],"friends": [

{"name": "Patty Salas"

},{

"name": "Clark Wise"},{

"name": "Casey Stone"}

]}');

Note:

• In the above data all json fields become searchable as top level index fields. For example, ‘age’ in the jsonbecomes searchable ‘age’ in the index.

• Nested fields become searchable top level fields with a ‘parent.child’ notation.

• For example, ‘name’ in ‘friends’ becomes searchable as ‘friends.name’.

4.3 Querying JSON

With the index created and with the data inserted as above you can make basic queries such as these:

-- find a person with name AvilaSELECT * from PERSON_JSON where stargate= '{

query:{type:"match",field:"name",value:"Avila"

}}';

-- find people with a friend called PattySELECT * from PERSON_JSON where stargate= '{

query:{



type:"match",field:"friends.name",value:"Patty"

}}';

-- find people who have been tagged as goodSELECT * from PERSON_JSON where stargate= '{

query:{type:"match",field:"tags",value:"good"

}}';

However, if you do the following query, it would not work!

-- find people with age 30-- this wont work until you change the mapping.SELECT * from PERSON_JSON where stargate= '{

query:{type:"match",field:"age",value:35

}}';

This is because, although Stargate indexes numeric fields as numeric, while querying, it would not understand that itneeds to query it numerically. So, you change the mapping as follows

DROP INDEX json_idx;

CREATE CUSTOM INDEX json_idx ON PERSON_JSON(stargate) USING'com.tuplejump.stargate.RowIndex' WITH options ={


"json":{"type":"object","fields":{

"age":{ "type":"integer"}}

}}

}'};

This mapping tells Stargate that the field needs to be queried as an integer. Now the above query will work as expected.

For more details on configuration, read the Index Options section.

4.3. Querying JSON 15



CHAPTER 5

Index Configuration

5.1 Index Creation Options

An index is created with field mapping with the following syntax:

CREATE CUSTOM INDEX (IF NOT EXISTS)? <index_name>ON <table_name> ( <meta_column> )

USING 'com.tuplejump.stargate.RowIndex'WITH OPTIONS ={

'sg_options':'<sg_options>'}

where

• <index_name> specifies the name of the index.

• <table_name> is the name of the table on which you want to create a row index on.

• <sg_options> needs to be a valid JSON as specified below.

The property ‘sg_options’ is a valid JSON object indicating the properties used to index columns in a CQL table. Theroot of this JSON object has various properties which indicate the default properties to be used in absence of per fieldproperties. The root also has another JSON object field called ‘fields’ which indicates the names of columns to indexand their corresponding properties.

Here are the various properties

<sg_options> :={

type : <datatype>,analyzer : <analyzer>,tokenized : <tokenized>,omitNorms : <omitNorms>,striped : <striping>,indexOptions : <indexOptions>,numericPrecisionStep : <numericPrecisionStep>,fields : <sg_options>

}

17


5.2 Fields

The fields property is used to specify properties for data types with nesting i.e, object (used to index JSON) and map(used to index CQL maps).

5.3 Datatypes

<datatype> indicates the data type of the indexed column. These are derived from the Cassandra column CQL typebut may be overridden if specified explicitly. Datatypes give default behaviours for other properties of a field. Thefollowing are the available data types.

Data type Behavior CQL Typeobject JSON type. Behaviour per Field JSON in CQL type text. Field has nested fieldsmap Key and value behaviours CQL map typetext Standard analyzer, Tokenzied CQL ascii/text typestring Keyword analyzer, not tokenized CQL varchar typeinteger Keyword analyzer, not tokenized CQL int typebigint Keyword analyzer, not tokenized CQL bigint typedecimal Keyword analyzer, not tokenized CQL float typebigdecimal Keyword analyzer, not tokenized CQL double/decimal typedate Keyword analyzer, not tokenized Field should be parsed as Datebool Keyword analyzer, not tokenized CQL bool type

5.4 JSON indexing

A data type of ‘object’ indicates that the CQL column(of type text) will contain a JSON. Each field in JSON will beindexed and queried separately. Nested field properties may be specified using ‘parentname.childname’ notation. Formore details on using this, refer to the JSON indexing and querying section.

5.5 CQL collections

A CQL set and list data types by default, use the same type as that derived from the type of the element of thecollection. Specifying properties for sets and list is therefore done in the same way as regular fields.

Map types have 2 or 3 indexed fields per entry depending on the type of the map key. The key and value types bydefault are derived according to the key and value CQL types. For map types with a non-tokenized type (from abovetable), an additional field with the key string value as ‘name’ and the value as ‘value’ is added to the lucene document.The properties of the key are set using ‘colname.key’ notation. Similarly for value ‘colname.value’ notation is used.

5.6 Tokenization

<tokenized> default: as described above in the table.

This splits your text into chunks and since different analyzers may use different tokenizers, you can get different outputtoken streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer doesn’t split the text at all and takes allthe field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuationas split points. For example, for the phrase “I am very happy”, it will produce a list (“i”, “am”, “very”, “happy”) or

18 Chapter 5. Index Configuration


something like that. The text type by default inherits tokenized behaviour. This behaviour can be overridden by usingthe <tokenized> property.

5.7 Analyzers

<analyzer> default: as described above in the table.

An Analyzer builds TokenStreams, which analyzes text. It thus represents a policy for extracting index terms fromtext. For more information on lucene Analyzers, read the lucene docs.

5.8 Out of box Analyzers with Stargate

The following are the out of box lucene analyzers provided with Stargate. They can be used by specifying the <ana-lyzer> property. Read lucene docs for explanation on each of them.

• StandardAnalyzer

• WhitespaceAnalyzer

• SimpleAnalyzer

• KeywordAnalyzer

5.9 Custom Analyzers

Custom analyzers may be specified using the fully qualified class name. Lucene 5.5 custom analyzers are required.

5.10 Norms

<omitNorms> default:true

Norms allow index time boosts and field length normalization. This allows you to add boosts to fields at index timeand makes shorter documents score higher. This may not be useful for short or non-full-text fields. Norms are storedin the index as a byte value per document per field. When norms are loaded up into an IndexReader, they are loadedinto a byte[maxdoc] array for each field – so, even if one document out of 400 million has a field, it is still goingto load byte[maxdoc] for that field, potentially using a lot of RAM. Considering turning norms off for certain fields,especially if you have a large number of fields in the index. Any field that is very short (i.e. not really a full text field– ids, names, keywords, etc), is a great candidate. For a large index, you might have to make some hard decisions andturn off norms for key full text fields as well. As an example of how much RAM we are talking about, one field in a10 million doc index will take up just under 10 MB of RAM. One hundred such fields will take nearly a gigabyte ofRAM. You can omit norms using the <omitNorms> property.

5.11 Index Options

<indexOptions> default:DOCS

This controls how much information is stored in the postings lists of the lucene index. For a detailed explanation, referto lucene documentation. The available options are -

5.7. Analyzers 19


Option DescriptionDOCS_AND_FREQS Only documents and term frequencies are indexed:

positions are omittedDOCS_AND_FREQS_AND_POSITIONS Indexes documents, frequencies and positions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSIndexes documents, frequencies, positions and offsets.DOCS Only documents are indexed: term frequencies and

positions are omitted.

5.12 Numeric field precision

<numericPrecisionStep> default:4

Read lucene docs for explanation.

5.13 Striping/Sorting

<striped> default:none

Other options:also,only

This controls whether the index value is stored in a striped/columnar fashion using Lucene doc values. Sortable fieldsneed to be stored in this fashion. For any field which requires sorting use “also” (indicating a doc value field is storedin Lucene along with indexing the field) or “only”(indicating that only a doc value field is stored in lucene) as theoption.

20 Chapter 5. Index Configuration

CHAPTER 6

Queries

A query is given as follows

SELECT <selection> from <table> where <meta-column> = '{<query-parts}'

A query has 3 parts. Query, filter and sort

{query: {<query-options>},filter: {<query-options>},sort: {<sort-options>}

}

A query and filter have the same options i.e <query-options>. A query takes part in score calculation. The scores areoutput in the meta column in the result set. A filter simply filters the rows passing the conditions. By default the rowsof the resultset are ordered by the score relevance. If you want the result sorted differently, the sorting order may bespecified in the ‘<sort-options>’

6.1 Query and filter options

A query or filter is as follows:

{type: <type>,<property> : <property-value>

}

6.2 Types of queries

6.2.1 Lucene

A query using the lucene standard query parser syntax.

Datatypes supported

• All

21


Properties

• type :lucene• field: The default value on which this lucene query is being made.• value: The lucene query using query parser syntax.

6.2.2 Match

A query to match a value.

Datatypes supported

• All

Properties

• type :match• field: The field name for the value has to match.• value: The value of the field to match.

6.2.3 Phrase

Various values forming a phrase with a slop.

Datatypes supported

• text

Properties

• type :phrase• field: The field name for the value has to match.• value: The list of values of the phrase.• slop: How many words can be skipped between the words in the phrase.

6.2.4 Fuzzy

Fuzzy searches based on the Levenshtein Distance.

Note:

• For fuzzy queries, the index needs to store term vectors with position.

• Hence while creating the index, Index options need to be specified asDOCS_AND_FREQS_AND_POSITIONS.

22 Chapter 6. Queries


Datatypes supported

• text• string

Properties

• type :fuzzy• field: The field name for the value has to match.• value: The value of the field to match.• maxEdits: (default = 2) Value between 0 and 2 (the Levenshtein automaton maximum supported distance).• prefixLength: (default = 0) Integer representing the length of common non-fuzzy prefix.• maxExpansions: (default = 50): An integer for the maximum number of terms to match.

6.2.5 Prefix

A query to find values with the passed prefix.

Datatypes supported

• text• string

Properties

• type :prefix• field: The field name for which the value has to be prefixed with.• value: The value of the field to have the passed prefix.

6.2.6 Range

A range of values to match.

Datatypes supported

• All

Properties

• type :range• field: The field name for which the range is being specified.• lower: lower bound of the range. Defaults to lower value of the data type.• includeLower: (default = false) if the left value is included in the results (>=).• upper: upper bound of the range. Defaults to upper value of the data type.• includeUpper: (default = false) if the right value is included in the results (<=).

6.2. Types of queries 23


6.2.7 Regex

A query which can match the passed regex.

Datatypes supported

• text• string

Properties

• type :regex• field: The field name for which the value has to match the regex.• value: The value of the regex.

6.2.8 Wildcards

A query which can match the passed wildcard.

Datatypes supported

• text• string

Properties

• type :wildcard• field: The field name for which the value has to match the wildcard.• value: The value of the wildcard expression.

6.3 Combining conditions

Conditions can be combined using the boolean query option. A Boolean query can further contain nested booleanqueries. A Boolean query can have a must,should and not conditions.

Datatypes supported

• All

Properties

• type :match• must: A list of conditions that must occur in the value. Each condition is a query.• should: A list of conditions that should occur. Each condition is a query.• not: A list of conditions that should not occur. Each condition is a query.



As a reference the table below lists the queries that are possible and along with the properties that are available foreach type of query

Querytype

Properties Description

lucene field: The default value on which this lucene query isbeing made

A query using the lucene standard queryparser syntax. All datatypes supported.

value: The lucene query using query parser syntax.match field: The field name for the value has to match A query to match a value exactly. All

datatypes supported.value: The value of the field to match.

phrase field: The field name for the value has to match Various values forming a phrase with a slop.For text types only.

values: The list of values of the phraseslop: How many words can be skipped betweenthewords in the phrase

fuzzy field: The field name for the value has to match Fuzzy searches based on the LevenshteinDistance. For text and string types only.

value: The value of the field to match. * Also need to specify indexOptions duringcreation

maxEdits: (default = 2):Value between 0 and 2 (theLevenshtein automaton maximum supported distance).

* Index options need to haveDOCS_AND_FREQS_AND_POSITIONS

prefixLength (default = 0): integer representing thelength of common non-fuzzy prefix.maxExpansions (default = 50): an integer for themaximum number of terms to match.

prefix field: The field name for the value has to be prefixedwith

A query to find values with the passedprefix. For text and string types only.

value: The value of the field to have the passed prefix.range field: The field name for which the range is being

specified.A range of values to match.All datatypessupported.

lower: lower bound of the range. Defaults to lowervalue of the data type.includeLower (default = false): if the left value isincluded in the results (>=)upper: upper bound of the range. Defaults to uppervalue of the data type.includeUpper (default = false): if the right value isincluded in the results (<=).

regex field: The field name for which the value has to matchthe regex

A query which can match the passed regex.For text and string types only.

value: The regex valuewild-card

field: The field name for which the value has to matchthe wildcard

A query with wild card expressions. For textand string types only.

value: The value with wildcardsboolean must: a list of conditions that must occur in the value.

Each condition is a query.A query which joins sub queries using aboolean condition. All datatypes supported.

should: a list of conditions that should occur. Eachcondition is a query.not: a list of conditions that should not occur. Eachcondition is a query

6.3. Combining conditions 25


6.4 Sort

Sortable fields need to be stored as striped fields as specified in Index options. Once a field is marked as striped while indexing,A sort may then be is specified as follows

{fields: [{field:<name>,reverse:<reverse>},{field:<name>,reverse:<reverse>}...

]}

where <name> is the name of the field on which the sort is to be applied and reverse is specified optionally as true toreflect the sort order on a field.

6.5 Sorting across partitions

Sorting should be used only when the partition key is specified in the CQL clause. Sorting should be avoided whenthe complete partition key is not specified as this leads to a distributed sorting. Sorting across partitions or distributedsorting is not supported fully. This is left out on purpose because distributed sorting leads to deep paging. Read nextsection for an overview of deep paging.

6.6 Pagination

Pagination is done via usual CQL means on clustering keys.

Pagination with strgate sorting is efficient only when the partition key is specified but should be avoided as thisleads to deep paging.Deep paging refers to specifying a large start offset into the search results. Basic paging canbe inefficient with large start values since to return rows 1,000,000 through 1,000,010 in a sorted row list (only 10documents), because the query engine must find the top 1,000,010 documents and then take the last 10 to return to theuser. Although Stargae is smart enough to only retrieve the data from Cassandra for the final 10 documents, there isstill the overhead of sorting the internal ids of the top 1,000,010 documents.

Deep paging via basic paging controls is even more inefficient for distributed searches (across partition keys) since thesort values for the first 1,000,010 documents from each shard need to be returned and merged at an aggregator node inorder to find the correct 10. Hence this is not supported and may lead to incorrect results.


CHAPTER 7

Indices and tables

• genindex

• modindex

• search

27

Documents

Release 1.1.0 Tuplejump - Read the Docs · 2019. 4. 2. · •Stargate-search - A search server like Solr/ElasticSearch (Work in progress.) 1.1Stargate-core Features 1.Add lucene