Transcript
Page 1: How to build your own google

How to build your own google ...

[email protected]

Data Wizards Dec 2015

Page 2: How to build your own google

Artur Grządziel few words about me

email: [email protected]

Currently: BigData and Machine Learning Leader

From Jan 2016: BigData Solution Architect at General Electric

PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute

Graduated from Warsaw University of Technology and Warsaw School of Economics

BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning

in real business cases

Privately, husband and father

pl.linkedin.com/in/ArturGrzadziel

Page 3: How to build your own google

Introduction Data Wizards

Artur represents „Data Wizards” group – informal group of

BigData/Machine Learning/Data Science professionals located in

Poland and interested in knowledge sharing and addressing business

challenges leveraging modern BigData and Machine Learning

methods.

Page 4: How to build your own google

Agenda

1. Cloudera search

2. How it works?

Page 5: How to build your own google

MySearch very high level architecture

Data

Source

Index

Page 6: How to build your own google

Cloudera search Apache Solr and Tika

1.

Other

Sources

Page 7: How to build your own google

Cloudera Search

Cloudera Search is one of Cloudera's near-real-time access products.

Cloudera Search enables non-technical users to search and explore data stored

in or ingested into Hadoop and HBase. Users do not need SQL or programming

skills to use Cloudera Search because it provides a simple, full-text interface for

searching.

Cloudera Search incorporates Apache Solr, which includes Apache Lucene,

SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated

with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search

provides these key capabilities:

- Near-real-time indexing

- Batch indexing

- Simple, full-text data exploration and navigated drill down

http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-

0/Cloudera-Search-User-Guide/csug_introducing.html

Page 8: How to build your own google

Cloudera search Tika

https://tika.apache.org/download.html

Page 9: How to build your own google

Cloudera search Tika – image

Page 10: How to build your own google

Cloudera search Tika – PDF file

Page 11: How to build your own google

Cloudera search Tika – gazeta.pl

Page 12: How to build your own google

Cloudera search Tika – formats

Supported Document Formats

• HyperText Markup Language

• XML and derived formats

• Microsoft Office document formats

• OpenDocument Format

• Portable Document Format

• Electronic Publication Format

• Rich Text Format

• Compression and packaging formats

• Text formats

• Audio formats

• Image formats

• Video formats

• Java class files and archives

• The mbox format

https://tika.apache.org/1.4/formats.html

Page 13: How to build your own google

Cloudera search Solr – how to start it …

.\bin\solr start –e cloud -noprompt http://lucene.apache.org/solr/

Page 14: How to build your own google

Cloudera Search Administration

Page 15: How to build your own google

Cloudera Search Data

id cat name price inStock author series_t sequence_i genre_s

553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy

553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy

055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy

553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi

812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy

812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi

441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy

380014300 book Nine Princes In

Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy

805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy

080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy

Page 16: How to build your own google

Cloudera Search Output format

Page 17: How to build your own google

Cloudera Search Simple query

Page 18: How to build your own google

Cloudera Search Simple query

Page 19: How to build your own google

Cloudera Search More advanced query

Page 20: How to build your own google

Cloudera Search Query with facets

Page 21: How to build your own google

Cloudera search Solr – other features

The MoreLikeThis search component enables users to query for documents

similar to a document in their result list. It is achieved leveraging terms from the

original document to find similar documents in the index

The SpellCheck component is designed to provide inline query suggestions

based on other, similar, terms.

Highlighting in Solr allows fragments of documents that match the user's query

to be included with the query response.

Synonyms, stop words

Page 22: How to build your own google

Cloudera search Solr – other features – geospacial search

Solr has sophisticated geospatial support, including searching within a

specified distance range of a given location (or within a bounding box),

sorting by distance, or even boosting results by the distance http://lucene.apache.org/solr/quickstart.html

Page 23: How to build your own google

Cloudera Search Common Use Cases

Cloudera Search lets your entire business explore and analyze data quickly and

easily for a variety of critical use cases all within a single platform, including:

- Threat detection

- Customer 360-degree visibility

- Improved user experience

- Interactive market segmentation

- Accessible global knowledge base

https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-

solr.html

Page 24: How to build your own google

Cloudera Search Other Use Cases

Instagram: Instagram (a Facebook company) is one of the famous sites, and it

uses Solr to power its geosearch API

WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and

Solr

Netflix: Solr powers basic movie searching on this extremely busy site

StubHub.com: This ticket reseller uses Solr to help visitors search for concerts

and sporting events.

https://www.safaribooksonline.com/library/view/scaling-apache-

solr/9781783981748/ch01s05.html

Page 25: How to build your own google

How it works ... ?

Page 26: How to build your own google

How it works … ? Data Source – documents …

Document Content

1 John has a cat

2 John has a dog

3 Eva has a cat

4 George has a dog

Page 27: How to build your own google

How it works … ? Data Source – documents … space of unique terms

Document Content

1 John has a cat

2 John has a dog

3 Eva has a cat

4 George has a dog

1 2 3 4

1 2 3 5

6 2 3 4

7 2 3 4

List of unique

words:

1. John

2. has

3. a

4. cat

5. dog

6. Eva

7. George

Page 28: How to build your own google

How it works … ? Data Source – Documents … boolean search with inverted index

Term Tot. freq.

John 2

has 4

a 4

cat 2

dog 2

Eva 1

George 1

Doc #

1

2

1

2

3

4

1

2

3

4

1

3

2

4

3

4

Dictionary Documents

Page 29: How to build your own google

How it works … ? Data Source – documents as vectors

Documents

document 1 John has a cat

document 2 John has a dog

document 3 Eva has a cat

document 4 George has a dog

Space of unique terms -> John has a cat dog Eva George

vector representing doc1 -> 1 1 1 1 0 0 0

vector representing doc2 -> 1 1 1 0 1 0 0

vector representing doc3 -> 0 1 1 1 0 1 0

vector representing doc4 -> 0 1 1 0 1 0 1

Page 30: How to build your own google

How it works … ? Data Source – Documents … vectors

Page 31: How to build your own google

Summary

1.

Other

Sources

Page 32: How to build your own google

Thank you Data Wizards

E-mail: [email protected]

Links:

• Cloudera Search:

http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-

3-0/Cloudera-Search-User-Guide/csug_introducing.html

• Tika

https://tika.apache.org/

• Apache Solr

http://lucene.apache.org/solr/

https://www.cloudera.com/content/www/en-us/products/apache-

hadoop/apache-solr.html

• Vectors, Inversed Index, Frequency Matrix, etc. ...

http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm