23
RESULT MATCHING or top solutions for advanced data matching products: from recruitment portals to Netflix

How to Build the Best Data Matching Product

Embed Size (px)

Citation preview

Page 1: How to Build the Best Data Matching Product

RESULT MATCHING

or top solutions for advanced data matching products: from recruitment portals to Netflix

Page 2: How to Build the Best Data Matching Product

// 2

WHAT WE’RE LOOKING AT...

1. INTRODUCTION2. ES vs SQL3. BASIC FILTER4. ADVANCED SEARCH5. KEYWORD SEARCH6. SCORING7. WEIGHT MANIPULATION/SCRIPTING8. FULL PARTIAL NO MATCH9. MATCHING10.AGGREGATION11.TO SUM UP...

Page 6: How to Build the Best Data Matching Product

// 6

KEYWORD SEARCH - EXAMPLEA simple and advanced search by string or part of a string is very powerful!

TYPES OF SEARCHES

BEST_FIELDS - Finds documents which match any field, but uses the _score from the best field.

MOST_FIELDS - Finds documents which match any field and combines the _score from each field.

CROSS_FIELDS - Treats fields with the same analyzer as though they were one big field. Looks for each word in any field.

PHRASE - Runs a match_phrase query on each field and combines the _score from each field.

PHRASE_PREFIX - Runs a match_phrase_prefix query on each field and combines the _score from each field.

{

"multi_match" : {

"query": "this is a test",

"fields": [ "subject", "message"

]

}

}

Page 7: How to Build the Best Data Matching Product

EXAMPLE

Page 8: How to Build the Best Data Matching Product

// 8

SCORING

Each result (each record) in ElasticSearch has a score.

Results are ordered by score - but how can we predict which results will have the highest score (which will be on top of results)?

Score relevance comes to the rescue!

Page 9: How to Build the Best Data Matching Product

// 9

Score relevance is based on:

❏ Term frequency - How often does the term

appear in this document?

❏ Inverse document frequency - How often

does the term appear in all documents in

the collection?

❏ Field-length norm - How long is the field?

Page 13: How to Build the Best Data Matching Product

// 13

The first and the most easy way to manipulate weights is by using a vector space model. What you saw there is a kind of a joke - this is what happens when we type in Google “vector space model” and are too literal.

By the way, this is how ES works sometimes, you must know what you are doing - otherwise you can end up with a collection of planets instead of an array of numbers and you completely don’t know why ;)

Page 14: How to Build the Best Data Matching Product

// 14

WEIGHT MANIPULATION/SCRIPTING

Vector Space Model

Let’s switch to a real vector space model. It’s nothing more than a simple vector containing integer numbers how important the term is in current search.

[1,2,5,22,3,8]

Page 17: How to Build the Best Data Matching Product

// 17

Now, imagine we have three documents:

1. I am happy in summer.

2. After Christmas I’m a hippopotamus.

3. The happy hippopotamus helped Harry.

We can create a similar vector for each document, consisting of the weight of each query term—happy and hippopotamus—that appears in the document, and plot these vectors on the same graph.

Page 18: How to Build the Best Data Matching Product

// 18

The nice thing about vectors is that they can be compared.

By measuring the angle between the query vector and the document vector, it is possible to assign a relevance score to each document.

The angle between document 1 and the query is large, so it is of low relevance. Document 2 is closer to the query, meaning that it is reasonably relevant, and document 3 is a perfect match.

Page 20: How to Build the Best Data Matching Product

// 20

Aggregations help build complex summaries & analytics.

Elasticsearch is not only for searching data, but it also is a handy way to prepare summaries. The best thing about ES is it can handle both these functionalities at once.

Searches resolve the problem of finding the best matching documents, but you can have one more very crucial question:“What do these documents tell me about my business?” And that’s where aggregations come in.

Two most frequently used kinds of aggregation are buckets and metric.

Page 23: How to Build the Best Data Matching Product

Want to see how this can work with your product?Email us!

[email protected]