II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - Bpm-Conseil, France)

Patrick BeaucampFounder of the Vanilla Project

Mail : [email protected]

Search and Data Mining Open Source Platforms

II-SDV, Nice 15th April 2014

1II-SDV, Nice

Presentation Agenda

Open Source Search Engine & Search PlatformSome interesting Platforms

Features expected for Search Engines

Features expected for Search Platforms (Interface)

2II-SDV, Nice

Open Source Data Mining & Machine Learning PlatformDatamining subject & Usage type

Some interesting Platforms

Hadoop & Data Mining : the future of Learning Machin

If you don’t find it, it doesn’t exist… so, may be an engine will find it for you ?

Search & Data Mining Together ?

Search Interface

3II-SDV, Nice

Enhanced Search Interface

Search & Data Mining Together ?

4II-SDV, Nice

Data Mining : search for you !

• Example : correlation provides you with information about how data

are linked together

5II-SDV, Nice

Confusion Search Engine & Platforms-Search platform per domain :

- News (yahoo)

- Job (jobserve)

- Real Estate

- Human Relation

- Craig List

- Patent industry

- …

A business case : Facebook- Basic Search interface

- Proposal for connection

- Advanced graphical search

6II-SDV, Nice

Search EnginesSearch Engine Providers in Open Source - Lucene

- Lucene is a java based indexing and search API

- Solr/Lucene is the leading server extension of Lucene. 2 companies,

LucidWorks and ElasticSearch, provides packaging and extension of top of

Lucene and Solr

-Search Landscape

-• Lucene : http://lucene.apache.org

-• Solr/Lucene : http://lucene.apache.org/solr/

-• Plateform OpenSearch : http://www.open-search-server.com

-• Plateform Katta : http://katta.sourceforge.net

-• Plateform LucidWorks : http://www.lucidworks.com

-• Plateform ElasticSearch : http://www.elasticsearch.com

-• Sphinx : http://sphinxsearch.com/

http://lucene.apache.org/

7II-SDV, Nice

Before Search EnginesBefore indexing your document base, you need to access it !

-Apache Nutch is a highly extensible and scalable open source web crawler

software project.

-Reference : http://nutch.apache.org/

8II-SDV, Nice

Search Engines : focusSphinx : inside Database like MariaDb (Mysql)

Lucene : Retrieval Software library

Use existing Search Infrastructure like Solr/Lucene (Vanilla certified)

http://www.lucidworks.com/ or http://www.elasticsearch.org/

http://www.elasticsearch.org/

9II-SDV, Nice

Search Engines Basics (1/2)-Synonyms

- It is possible to extend the search to synonyms if they are listed in a

glossary. For example, to find articles containing synonyms to “TV” when

you search with the word TV.

-Metadata

- Dictionary for list of searchable keywords

10II-SDV, Nice

Search Engines Basics (2/2)-Reserved Words, Protected Words

- Indexing usually uses stemming, which is to reduce words to their root, for

example "Developp" to find items also contain the word when trying to

develop the word development. However, sometimes there are adverse

lemmatizations, indexing under one lemma two words that have no

relation. It is possible to prevent the stemming of words by listing them in

a file protwords.txt.

-StopWords

- The stopwords are meaningless words. A word considered insignificant

will be ignored. Note that some words are insignificant in some contexts,

others have homonyms signifiers. For example, can refer to a summer

season (rather mean) or past participle of the verb to be (relatively

insignificant). Stopwords.txt the file looks like this

11II-SDV, Nice

Search Engines current Limits-Multi Language support (this is where commercial search engine have still more

to bring to customer), even there is now Asian type language support (Hindi,

Thai, Chineese, …)

-Elision :

- Elisions are a feature of the French, which consist of a contraction of the

words like or when they are followed by a vowel. Example: + aircraft gives

the aircraft. It is possible to remove these elisions using a lexicon.

-Full text search interface (language with search engine)

-SubQuery support : now its better with Solr 4.7

-Scalability (this is where Solr is taking technical advantage)

12II-SDV, Nice

Hadoop SearchPlatform for BigData-Cloudera with Solr/Cloud (Solr/Lucene)

-Mapr with ElasticSearch (Lucene code)

-HortonWorks with LucidWorks (Solr/Lucene)

13II-SDV, Nice

Search Interface : expectation (1/3)-Advance indexing and querying tools.

-Provides distributed searching capabilities to prevent bottleneck for a particular

server.

-Provides document excerpts (snippets) generation that provides summary of the

search

-Relevance ranking display extracts from the documents based on the query.

14II-SDV, Nice

Search Interface : expectation (2/3)-Duplicate document detection, including fuzzy near duplicates

-Rich Document Parsing and Indexing without using Database Indexing.

-Ranking control carry out a targeted ranking of individual documents.

-Search Grouping by Type / Tag / Categories (General page, documents, images)

15II-SDV, Nice

-Multi Criteria support

-Ranking

-Natural language support

-Apps Support (Android, Ipad)

Search Interface : expectation (3/3)

16II-SDV, Nice

Migration to Open Source-Only few project, but it’s growing

SAME FUNCTIONALITY, MUCH LOWER COSTS

-http://www.searchtechnologies.com/fast-to-solr-migration.html

-http://fr.slideshare.net/janhoy/migrating-fast-to-solr

17II-SDV, Nice

Datamining-Work on your data with Dataming component such as

R : http://www.revolutionanalytics.com/

Weka : http://www.cs.waikato.ac.nz/ml/weka/

RapidMiner : http://rapidminer.com/

KNIME : http://www.knime.org/

Apache Mahout : http://mahout.apache.org/

Many DefinitionsNon-trivial extraction of implicit, previously unknown and potentially useful

information from data

Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Datamining

II-SDV, Nice

What is (not) Data Mining?

What is Data Mining?

– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Amazon”

II-SDV, Nice

Datamining

Origins of Data Mining

Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

Traditional Techniques may be unsuitable due to

Enormity of data

High dimensionality of data

Heterogeneous, distributed nature of data

Machine Learning/

Pattern

Recognition

Statistics/

AI

Data Mining

Database

systems

II-SDV, Nice

Datamining

Data Mining Tasks

Description Methods

Find human-interpretable patterns that describe the data.

Prediction Methods

Use some variables to predict unknown or future values of other variables.

II-SDV, Nice

Datamining

Data Mining Tasks...

Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Regression [Predictive]

Deviation Detection [Predictive]

II-SDV, Nice

Datamining

Classification: Definition

Given a collection of records (training set )

Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible.A test set is used to determine the accuracy of the model. Usually, the given

data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

II-SDV, Nice

Datamining

Classification with Weka – J48

II-SDV, Nice

Datamining

Classification: Application 1

Direct Marketing

Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.

Approach:

Use the data for a similar product introduced before.

We know which customers decided to buy and which decided otherwise. This {buy, don’t buy}decision forms the class attribute.

Collect various demographic, lifestyle, and company-interaction related information about all such customers.

Type of business, where they stay, how much they earn, etc.

Use this information as input attributes to learn a classifier model.

II-SDV, NiceII-SDV, Nice

Datamining


Fraud Detection

Goal: Predict fraudulent cases in credit card transactions.

Approach:

Use credit card transactions and the information on its account-holder as attributes.

When does a customer buy, what does he buy, how often he pays on time, etc

Label past transactions as fraud or fair transactions. This forms the class attribute.

Learn a model for the class of the transactions.

Use this model to detect fraud by observing credit card transactions on an account.

II-SDV, Nice

Datamining


Customer Attrition/Churn:Goal: To predict whether a customer is likely to be lost to a

competitor.

Approach:Use detailed record of transactions with each of the past and present

customers, to find attributes.How often the customer calls, where he calls, what time-of-the day he calls most,

his financial status, marital status, etc.

Label the customers as loyal or disloyal.

Find a model for loyalty.

Datamining

II-SDV, Nice

Clustering Definition

Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that

Data points in one cluster are more similar to one another.

Data points in separate clusters are less similar to one another.

Similarity Measures:

Euclidean Distance if attributes are continuous.

Other Problem-specific Measures.

II-SDV, Nice

Datamining

Clustering with Weka – K-Means

II-SDV, Nice

Datamining

Illustrating Clustering

Euclidean Distance Based Clustering in 3-D space.

Intracluster distances

are minimized

Intercluster distances

are maximized

II-SDV, Nice

Datamining

Clustering: Application 1

Market Segmentation:

Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

Approach:

Collect different attributes of customers based on their geographical and lifestyle related information.

Find clusters of similar customers.

Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

II-SDV, Nice

Datamining

Clustering: Application 2

Document Clustering:Goal: To find groups of documents that are similar to each other

based on the important terms appearing in them.

Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

II-SDV, Nice

Datamining

Association Rule Discovery: Definition

Marketing and Sales Promotion:

Let the rule discovered be

{Bagels, … } --> {Potato Chips}

Potato Chips as consequent => Can be used to determine what should be done to boost its sales.

Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.

Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

II-SDV, Nice

Datamining

Association Rules with Weka – APriori

II-SDV, Nice

Datamining

Association Rule Discovery: Application 1

Supermarket shelf management.

Goal: To identify items that are bought together by sufficiently many customers.

Approach: Process the point-of-sale data collected with barcode scanners to find

dependencies among items.A classic rule --

If a customer buys diaper and milk, then he is very likely to buy beer.

So, don’t be surprised if you find six-packs stacked next to diapers!

II-SDV, Nice

Datamining

Association Rule Discovery: Application 2

Inventory Management:

Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households.

Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.

II-SDV, Nice

Datamining

Challenges of Data Mining

Scalability

Dimensionality

Complex and Heterogeneous Data

Data Quality

Data Ownership and Distribution

Privacy Preservation

Streaming Data

II-SDV, Nice

Datamining

38II-SDV, Nice

Learning Machines- Weka

- RapidMiner (formerly Yale : Yet Another Learning Environment)

- Hadoop Mahout

Background infrastructure & Framework

Set of Api and Libraries

Provides Algorithms for DataMining

Your Data

Your Application

39II-SDV, Nice

Learning Machines ImplementationWith Apache Mahout

Classification- Naives Bayes

- Markov Models

- Logistic Regression

- Random Forest

-Clustering- K-Means and Fuzzy K-Means

- Canopy

- Spectral Clustering

-Association- Latent Derichlet Allocation

Working with documents :

Mahout – Lucene integration

40II-SDV, Nice

Learning Machines Implementation

41II-SDV, Nice

References (Search Domain)-Lucene : http://lucene.apache.org

-Plateform Solr/Lucene : http://www.lucidworks.com

-Plateform OpenSearch : http://www.open-search-server.com

-Plateform Katta : http://katta.sourceforge.net

-Plateform ElasticSearch : http://www.elasticsearch.com

-Plateform http://www.lucidworks.com/

-Sphinx : http://sphinxsearch.com/

-http://nutch.apache.org/

-http://en.wikipedia.org/wiki/List_of_search_engines

-http://www.thesearchenginelist.com/

-www.cloudera.com

-www.mapr.com

-www.hortonworks.com

http://www.elasticsearch.com/

http://sphinxsearch.com/

http://en.wikipedia.org/wiki/List_of_search_engines

http://www.thesearchenginelist.com/

http://www.mapr.com/

42II-SDV, Nice

References (DataMining Domain)R : http://www.revolutionanalytics.com/

Weka : http://www.cs.waikato.ac.nz/ml/weka/

RapidMiner : http://rapidminer.com/

KNIME : http://www.knime.org/

Apache Mahout : http://mahout.apache.org/

43II-SDV, Nice