How to build your own google

  • Published on
    18-Feb-2017

  • View
    327

  • Download
    1

Embed Size (px)

Transcript

  • How to build your own google ... artur.grzadziel@gmail.com

    Data Wizards Dec 2015

    mailto:artur.grzadziel@gmail.commailto:artur.grzadziel@gmail.com

  • Artur Grzdziel few words about me

    email: artur.grzadziel@gmail.com

    Currently: BigData and Machine Learning Leader

    From Jan 2016: BigData Solution Architect at General Electric

    PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute

    Graduated from Warsaw University of Technology and Warsaw School of Economics

    BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning

    in real business cases

    Privately, husband and father

    pl.linkedin.com/in/ArturGrzadziel

    mailto:artur.grzadziel@gmail.com

  • Introduction Data Wizards

    Artur represents Data Wizards group informal group of

    BigData/Machine Learning/Data Science professionals located in

    Poland and interested in knowledge sharing and addressing business

    challenges leveraging modern BigData and Machine Learning

    methods.

  • Agenda

    1. Cloudera search

    2. How it works?

  • MySearch very high level architecture

    Data

    Source

    Index

  • Cloudera search Apache Solr and Tika

    1.

    Other

    Sources

  • Cloudera Search

    Cloudera Search is one of Cloudera's near-real-time access products.

    Cloudera Search enables non-technical users to search and explore data stored

    in or ingested into Hadoop and HBase. Users do not need SQL or programming

    skills to use Cloudera Search because it provides a simple, full-text interface for

    searching.

    Cloudera Search incorporates Apache Solr, which includes Apache Lucene,

    SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated

    with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search

    provides these key capabilities:

    - Near-real-time indexing

    - Batch indexing

    - Simple, full-text data exploration and navigated drill down

    http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-

    0/Cloudera-Search-User-Guide/csug_introducing.html

    http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.html

  • Cloudera search Tika

    https://tika.apache.org/download.html

  • Cloudera search Tika image

  • Cloudera search Tika PDF file

  • Cloudera search Tika gazeta.pl

  • Cloudera search Tika formats

    Supported Document Formats

    HyperText Markup Language

    XML and derived formats

    Microsoft Office document formats

    OpenDocument Format

    Portable Document Format

    Electronic Publication Format

    Rich Text Format

    Compression and packaging formats

    Text formats

    Audio formats

    Image formats

    Video formats

    Java class files and archives

    The mbox format

    https://tika.apache.org/1.4/formats.html

    https://tika.apache.org/1.4/formats.htmlhttps://tika.apache.org/1.4/formats.html

  • Cloudera search Solr how to start it

    .\bin\solr start e cloud -noprompt http://lucene.apache.org/solr/

    http://lucene.apache.org/solr/http://lucene.apache.org/solr/

  • Cloudera Search Administration

  • Cloudera Search Data

    id cat name price inStock author series_t sequence_i genre_s

    553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy

    553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy

    055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy

    553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi

    812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy

    812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi

    441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy

    380014300 book Nine Princes In

    Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy

    805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy

    080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy

  • Cloudera Search Output format

  • Cloudera Search Simple query

  • Cloudera Search Simple query

  • Cloudera Search More advanced query

  • Cloudera Search Query with facets

  • Cloudera search Solr other features

    The MoreLikeThis search component enables users to query for documents

    similar to a document in their result list. It is achieved leveraging terms from the

    original document to find similar documents in the index

    The SpellCheck component is designed to provide inline query suggestions

    based on other, similar, terms.

    Highlighting in Solr allows fragments of documents that match the user's query

    to be included with the query response.

    Synonyms, stop words

  • Cloudera search Solr other features geospacial search

    Solr has sophisticated geospatial support, including searching within a

    specified distance range of a given location (or within a bounding box),

    sorting by distance, or even boosting results by the distance http://lucene.apache.org/solr/quickstart.html

    http://lucene.apache.org/solr/quickstart.html

  • Cloudera Search Common Use Cases

    Cloudera Search lets your entire business explore and analyze data quickly and

    easily for a variety of critical use cases all within a single platform, including:

    - Threat detection

    - Customer 360-degree visibility

    - Improved user experience

    - Interactive market segmentation

    - Accessible global knowledge base

    https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-

    solr.html

    https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.html

  • Cloudera Search Other Use Cases

    Instagram: Instagram (a Facebook company) is one of the famous sites, and it

    uses Solr to power its geosearch API

    WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and

    Solr

    Netflix: Solr powers basic movie searching on this extremely busy site

    StubHub.com: This ticket reseller uses Solr to help visitors search for concerts

    and sporting events.

    https://www.safaribooksonline.com/library/view/scaling-apache-

    solr/9781783981748/ch01s05.html

    http://whitehouse.gov/http://whitehouse.gov/http://stubhub.com/https://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlhttps://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlhttps://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlhttps://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlhttps://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlhttps://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.html

  • How it works ... ?

  • How it works ? Data Source documents

    Document Content

    1 John has a cat

    2 John has a dog

    3 Eva has a cat

    4 George has a dog

  • How it works ? Data Source documents space of unique terms

    Document Content

    1 John has a cat

    2 John has a dog

    3 Eva has a cat

    4 George has a dog

    1 2 3 4

    1 2 3 5

    6 2 3 4

    7 2 3 4

    List of unique

    words:

    1. John

    2. has

    3. a

    4. cat

    5. dog

    6. Eva

    7. George

  • How it works ? Data Source Documents boolean search with inverted index

    Term Tot. freq.

    John 2

    has 4

    a 4

    cat 2

    dog 2

    Eva 1

    George 1

    Doc #

    1

    2

    1

    2

    3

    4

    1

    2

    3

    4

    1

    3

    2

    4

    3

    4

    Dictionary Documents

  • How it works ? Data Source documents as vectors

    Documents

    document 1 John has a cat

    document 2 John has a dog

    document 3 Eva has a cat

    document 4 George has a dog

    Space of unique terms -> John has a cat dog Eva George

    vector representing doc1 -> 1 1 1 1 0 0 0

    vector representing doc2 -> 1 1 1 0 1 0 0

    vector representing doc3 -> 0 1 1 1 0 1 0

    vector representing doc4 -> 0 1 1 0 1 0 1

  • How it works ? Data Source Documents vectors

  • Summary

    1.

    Other

    Sources

  • Thank you Data Wizards

    E-mail: artur.grzadziel@gmail.com

    Links:

    Cloudera Search:

    http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-

    3-0/Cloudera-Search-User-Guide/csug_introducing.html

    Tika

    https://tika.apache.org/

    Apache Solr

    http://lucene.apache.org/solr/

    https://www.cloudera.com/content/www/en-us/products/apache-

    hadoop/apache-solr.html

    Vectors, Inversed Index, Frequency Matrix, etc. ...

    http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm

    mailto:artur.grzadziel@gmail.comhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttps://tika.apache.org/https://tika.apache.org/http://lucene.apache.org/solr/http://lucene.apache.org/solr/https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttp://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htmhttp://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htmhttp://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htmhttp://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm