Information Retrieval - Data Science Bootcamp

Information RetrievalJOSA Data Science Bootcamp

Kais Hassan

● Chief Data Officer @ Altibbi.com

○ Data Science○ BI

● Created several domain specific search solutions

● Previously Assistant Professor @ PSUT

● PhD in Computer Science from England (Medical Imaging)

AgendaIntroduction to IR and search

● Unstructured text, document-based storage● Search Engines vs. Databases● Inverted Index

Intro to Lucene/Solr● Available open source search libraries and engines.● Architectural diagram for Lucene and Solr

Solr basics● Hands-on implementation the first Solr collection● Indexing (example: XML files)● Retrieving Information from Solr - Basic Queries and Parameters

Field Custom data types● Copy fields● Analysis Chain: Analyzers, Tokenizers and Character Filters● Analyzers: Case Sensitivity, Lemmatization, Stemming, Synonyms, Shingles

Exercise● Autocomplete using n-grams

Solr @ Altibbi● Real life examples

• Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

– These days we frequently think first of web search, but there are many other cases:

• Corporate knowledge bases• Text classification• Text clustering

Information Retrieval

Basic assumptions of Information Retrieval

• Document-based storage/Collection: A set of self-contained documents, all of the data for the document is stored in the document itself — not in a related table as it would be in a relational database

• Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task

How good are the retrieved docs?

▪ Precision : Fraction of retrieved docs that are relevant to the user’s information need

▪ Recall : Fraction of relevant docs in collection that are retrieved

IR vs. databases:Structured vs unstructured data

• Structured data tends to refer to information in “tables”

Typically allows numerical range and exact match(for text) queries, e.g., Salary < 60000 AND Manager = Smith.

Unstructured data• Typically refers to free text• Allows

– Keyword queries including operators– More sophisticated “concept” queries e.g.,

• find all web pages dealing with drug abuse• Classic model for searching text documents

• 85% of the World’s data Unstructured

8

The Inverted Index - key data structure in IR

Stages of text processing• Tokenization

– Cut character sequence into word tokens• Normalization

– Map text and query term to same form• You want U.S.A. and USA to match

• Stemming– We may wish different forms of a root to match

• authorize, authorization• Stop words

– We may omit very common words (or not)• the, a, to, of

Inverted index construction

What is Lucene?

➔ High performance, scalable, full-text search library➔ Focus: Indexing + Searching Documents

◆ “Document” is just a list of name+value pairs➔ No crawlers or document parsing➔ Flexible Text Analysis (tokenizers + token filters)➔ 100% Java, no dependencies, no config files

Both Solr and ElasticSearch are based on it

What is Solr?

• A full text search server based on Lucene• XML/HTTP, JSON Interfaces• Faceted Search (category counting)• Flexible data schema to define types and fields• Hit Highlighting• Configurable Advanced Caching• Index Replication• Written in Java

Solr Architectural Diagram

Solr Terminology core: physical instance of a Lucene index files along with all the Solr configuration files

i.e. index with a given schema and that holds a set of documents.

collection: logical index in a SolrCloud cluster, associated with a config set files stored in Zookeeper

In a non-distributed search (standalone solr) some can refer to core as a collection

Understanding Solr Directory Structure

bin: bash files to control solrcontrib: additional plugins (ex. clustering)dist: Solr librariesdocs: documentation and Tutorialexample: sample data and configurationlicenses: Software licenses used in Solr

Server Foldercontexts + etc + lib + modules: jetty folders logs: solr and jetty log filesresources: logging configurationscripts: utility files for ZooKeeper and mapreducesolr: solr.home directory contains core directoriessolr-webapp: Solr server + admin tool

Solr Important Environment Variables

solr.install.dir: The location where you extracted the Solr installation.

solr.solr.home (SolrHome): contains core configuration and data, also must contain solr.xml (configuration for solr).

By default it is located inside

solr.install.dir/server/solr

But can be changed to any location

Exercise 1: getting started with SolrPrerequisites:

1. Java 7 or higher is installed and JAVA_HOME is set2. You have downloaded Solr 5.4.1(tgz for Linux, zip for Windows)3. Good text editor ( Anything but Notepad)4. Downloaded bootcamp_config + nytimes_facebook_statuses.csv

Starting/Stopping Solr

1. cd to the extracted solr folder2. To Start: bin/solr start (Linux) or bin\solr.cmd start (Windows)

○ Solr will start and listen on port 8983○ bin/solr start -help will show start options (useful for changing options)

3. To Stop: bin/solr stop

Creating a solr coreAfter starting solr, you can create a core either by

1. bin/solr create command2. Creating a folder inside solr.home containing

a. core.properties (containing core configuration such as, name=$CORE_NAME)

b. conf folder containing at least solrconfig.xml and “schema.xml”c. load core using api or Solr Admin (or restart Solr)

➔ We will use create command in this session ➔ Make sure you have copied bootcamp_config folder to solr.

install.dir/server/solr/configsets

bin/solr create -c hellosolr -d bootcamp_config

Why a Custom Configuration?

❖ Create command with default confdir copies configuration from data_driven_schema_configs, which is Managed (Schemaless) schema with field-guessing support enabled and dynamic fields. It is good for quick prototyping but I always prefer to choose my field types manually!!!

❖ basic_configs configuration: schema.xml and solrconfig.xml contains many unnecessary configuration/comments and can be a bit overwhelming to start with.➢ Although they have good documentation and I encourage you to read them

at some stage

Looking at hellosolr core folder● core.properties file: contains core name and other

configuration, see https://cwiki.apache.org/confluence/display/solr/Defining+core.properties

● data folder: contains Lucene index/files● conf folder: configuration for the code, inside it

○ schema.xml: main configuration file for defining fields, text analysis and etc.○ solrconfig.xml: configuration for request handlers, data, caching and etc.

Live demo explaining important parts of these files

https://cwiki.apache.org/confluence/display/solr/Defining+core.properties



Solr Admin - Demo

Indexing NYTimes Facebook Statuses 1

● 33k of NYTimes Facebook Statuses in csv format● Add the following fields to schema.xml:

<field name="status_message" type="text_en" indexed="true" stored="true" />

<field name="link_name" type="text_en" indexed="true" stored="true" />

<field name="status_type" type="string" indexed="true" stored="true" />

<field name="status_link" type="string" indexed="true" stored="true" />

<field name="status_published" type="tdate" indexed="true" stored="true" />

<field name="num_likes" type="tint" indexed="true" stored="true" />

<field name="num_comments" type="tint" indexed="true" stored="true" />

<field name="num_shares" type="tint" indexed="true" stored="true" />

Indexing NYTimes Facebook Statuses 2● Reload core via Solr Admin● Index documents via post util

bin/post -c hellosolr nytimes_facebook_statuses.csv

● If all is good, you should have 33,295 document in your index

You can add document to Solr via

● Data Import Handler (Recommended) ● post util● APIs● ManifoldCF (Not sure if it is worth it if you don’t have diverse inputs)

NYTimes Basic QueriesAdd the following request handler to solrconfig.xml + reload core

<requestHandler name="/search" class="solr.SearchHandler">

<lst name="defaults">

<str name="defType">edismax</str> <str name="mm">2</str>

<str name="fl">*,score</str> <str name="qf">status_message^9.0 link_name^3.0</str>

<str name="q.alt">*:*</str> <str name="facet">on</str>

<str name="facet.mincount">1</str> <str name="facet.limit">20</str> <str name="facet.field">status_type</str>

<str name="indent">true</str> </lst>

<lst name="invariants">

<str name="rows">10</str> <str name="wt">json</str> </lst>

</requestHandler>

Basic QueriesThe most basic query request for solr as follows:

http://ServerName:Port/solr/coreName/select?q=QueryString

To find china in the previously mentioned schema:

http://localhost:8983/solr/hellosolr/search?q=china

Looking closer at the request, notice that there is the q parameter

q: The q parameter is the main query for the request, If you assign q=*:*, it will return all the results.

edismax Query Parser - 1

The default query parser the comes with Solr is somehow limited

To use a more advanced query parser, use edismax

mm (Minimum 'Should' Match): this parameter is useful when searching for several words, for

example if

mm=1 (At least one word in the query must exist)

mm=2 (At least two words in the query must exist)

edismax Query Parser - 1

Notice the result difference between the following queries

http://localhost:8983/solr/hellosolr/search?q=china jordan&mm=1

AND

http://localhost:8983/solr/hellosolr/search?q=china jordan&mm=2

Field Definitions• Field Attributes: name, type, indexed, stored,

multiValued <field name="id“ type="string" indexed="true" stored="true"/><field name="sku“ type="textTight” indexed="true" stored="true"/><field name="name“ type="text“ indexed="true" stored="true"/><field name=“inStock“ type=“boolean“ indexed="true“ stored=“false"/><field name=“price“ type=“sfloat“ indexed="true“ stored=“false"/><field name="category“ type="text_ws“ indexed="true" stored="true“

multiValued="true"/>

Fields▪ Fields may

▪ Be indexed or not▪ Indexed fields may or may not be analyzed (i.e., tokenized with an

Analyzer)▪ Non-analyzed fields view the entire value as a single token

(useful for URLs, paths, dates, social security numbers, ...)▪ Be stored or not

▪ Useful for fields that you’d like to display to users▪ Optionally store term vectors

▪ Like a positional index on the Field’s terms▪ Useful for highlighting, finding similar documents, categorization

copyField• Copies one field to another at index time• Usecase #1: Analyze same field different ways

– copy into a field with a different analyzer– boost exact-case–

<field name=“title” type=“text”/><field name=“title_exact” type=“text_exact” stored=“false”/><copyField source=“title” dest=“title_exact”/>

• Usecase #2: Index multiple fields into single searchable field

Custom Field TypesIn Solr you can create custom fields which specifies the Text Analysis Pipeline

<fieldType name="my_arabi" class="solr.TextField" positionIncrementGap="100">

<analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ar.txt" />

<filter class="solr.ArabicNormalizationFilterFactory"/>

<filter class="solr.ArabicStemFilterFactory"/>

</analyzer>

</fieldType>

Tokenizers And TokenFiltersAnalyzers Are Typically Comprised Of Tokenizers And TokenFilters

● Tokenizer: Controls How Your Text Is Tokenized, There can be only one Tokenizer in each Analyzer

● TokenFilter: Mutates And Manipulates The Stream Of Tokens

Solr Lets You Mix And Match Tokenizers and TokenFilters in schema.xml To Define Analyzers

Most Factories Have Customization Options

Notable Token(izers|Filters) - 1/2

WhitespaceTokenizer: Creates tokens of characters separated by splitting on whitespace

StandardTokenizerFactory: General purpose tokenizer that strips extraneous characters

LowerCaseFilterFactory: Lowercases the letters in each token

TrimFilterFactory: Trims whitespace at either end of a token.

● Example: " Kittens! ", "Duck" ==> "Kittens!", "Duck".

PatternReplaceFilterFactory: Applies a regex pattern

● Example: pattern="([^a-z])" replacement=""

Notable Token(izers|Filters) - 2/2

StopFilterFactory

SynonymFilterFactory

EdgeNGramFilterFactory: creates n-grams ( sequence of n items )

<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" />

Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria", "nigeria", "nigerian"

For a list of available Filters

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters



Analysis ToolPart of Solr Admin that allows you to enter text And See How It Would Be Analyzed For A Given Field (Or Field Type)

Displays Step By Step Information For Analyzers Configured Using Solr Factories...

Token Stream Produced By The Tokenizer

How The Token Stream Is Modified By Each TokenFilter

How The Tokens Produced When Indexing Compare With The Tokens Produced When Querying

Helpful In Deciding Which Tokenizer/TokenFilters You Want To Use For Each Field Based On Your Goals

Hands-on Tokenizers, and Filters

Live Demo

Exercise - Autocomplete using n-grams

Requirements:

1) Match from the edge of the field, e.g. if the document field is it will match, but will not ,"مرض ال" and the query is "مرض السكري"match "السكري"

2) Matches any word in the input field, with implicit truncation. This means that the field "مرض السكري" will be matched by query We use this to get partial matches, but these should be ."السكري"boosted lower

Tip: WordDelimiterFilterFactory + EdgeNGramFilterFactory

Solr @ Altibbi

Live Demo

Further Reading

Always handy to use “Apache Solr Reference Guide”

Technology

Information Retrieval - Data Science Bootcamp