View
443
Download
0
Tags:
Embed Size (px)
Citation preview
Patrick BeaucampFounder of the Vanilla Project
Mail : [email protected]
Search and Data Mining Open Source Platforms
II-SDV, Nice 15th April 2014
1II-SDV, Nice
Presentation Agenda
Open Source Search Engine & Search PlatformSome interesting Platforms
Features expected for Search Engines
Features expected for Search Platforms (Interface)
2II-SDV, Nice
Open Source Data Mining & Machine Learning PlatformDatamining subject & Usage type
Some interesting Platforms
Hadoop & Data Mining : the future of Learning Machin
If you don’t find it, it doesn’t exist… so, may be an engine will find it for you ?
Search & Data Mining Together ?
Search Interface
3II-SDV, Nice
Enhanced Search Interface
Search & Data Mining Together ?
4II-SDV, Nice
Data Mining : search for you !
• Example : correlation provides you with information about how data
are linked together
5II-SDV, Nice
Confusion Search Engine & Platforms-Search platform per domain :
- News (yahoo)
- Job (jobserve)
- Real Estate
- Human Relation
- Craig List
- Patent industry
- …
A business case : Facebook- Basic Search interface
- Proposal for connection
- Advanced graphical search
6II-SDV, Nice
Search EnginesSearch Engine Providers in Open Source - Lucene
- Lucene is a java based indexing and search API
- Solr/Lucene is the leading server extension of Lucene. 2 companies,
LucidWorks and ElasticSearch, provides packaging and extension of top of
Lucene and Solr
-Search Landscape
-• Lucene : http://lucene.apache.org
-• Solr/Lucene : http://lucene.apache.org/solr/
-• Plateform OpenSearch : http://www.open-search-server.com
-• Plateform Katta : http://katta.sourceforge.net
-• Plateform LucidWorks : http://www.lucidworks.com
-• Plateform ElasticSearch : http://www.elasticsearch.com
-• Sphinx : http://sphinxsearch.com/
7II-SDV, Nice
Before Search EnginesBefore indexing your document base, you need to access it !
-Apache Nutch is a highly extensible and scalable open source web crawler
software project.
-Reference : http://nutch.apache.org/
8II-SDV, Nice
Search Engines : focusSphinx : inside Database like MariaDb (Mysql)
Lucene : Retrieval Software library
Use existing Search Infrastructure like Solr/Lucene (Vanilla certified)
http://www.lucidworks.com/ or http://www.elasticsearch.org/
9II-SDV, Nice
Search Engines Basics (1/2)-Synonyms
- It is possible to extend the search to synonyms if they are listed in a
glossary. For example, to find articles containing synonyms to “TV” when
you search with the word TV.
-Metadata
- Dictionary for list of searchable keywords
10II-SDV, Nice
Search Engines Basics (2/2)-Reserved Words, Protected Words
- Indexing usually uses stemming, which is to reduce words to their root, for
example "Developp" to find items also contain the word when trying to
develop the word development. However, sometimes there are adverse
lemmatizations, indexing under one lemma two words that have no
relation. It is possible to prevent the stemming of words by listing them in
a file protwords.txt.
-StopWords
- The stopwords are meaningless words. A word considered insignificant
will be ignored. Note that some words are insignificant in some contexts,
others have homonyms signifiers. For example, can refer to a summer
season (rather mean) or past participle of the verb to be (relatively
insignificant). Stopwords.txt the file looks like this
11II-SDV, Nice
Search Engines current Limits-Multi Language support (this is where commercial search engine have still more
to bring to customer), even there is now Asian type language support (Hindi,
Thai, Chineese, …)
-Elision :
- Elisions are a feature of the French, which consist of a contraction of the
words like or when they are followed by a vowel. Example: + aircraft gives
the aircraft. It is possible to remove these elisions using a lexicon.
-Full text search interface (language with search engine)
-SubQuery support : now its better with Solr 4.7
-Scalability (this is where Solr is taking technical advantage)
12II-SDV, Nice
Hadoop SearchPlatform for BigData-Cloudera with Solr/Cloud (Solr/Lucene)
-Mapr with ElasticSearch (Lucene code)
-HortonWorks with LucidWorks (Solr/Lucene)
13II-SDV, Nice
Search Interface : expectation (1/3)-Advance indexing and querying tools.
-Provides distributed searching capabilities to prevent bottleneck for a particular
server.
-Provides document excerpts (snippets) generation that provides summary of the
search
-Relevance ranking display extracts from the documents based on the query.
14II-SDV, Nice
Search Interface : expectation (2/3)-Duplicate document detection, including fuzzy near duplicates
-Rich Document Parsing and Indexing without using Database Indexing.
-Ranking control carry out a targeted ranking of individual documents.
-Search Grouping by Type / Tag / Categories (General page, documents, images)
15II-SDV, Nice
-Multi Criteria support
-Ranking
-Natural language support
-Apps Support (Android, Ipad)
Search Interface : expectation (3/3)
16II-SDV, Nice
Migration to Open Source-Only few project, but it’s growing
SAME FUNCTIONALITY, MUCH LOWER COSTS
-http://www.searchtechnologies.com/fast-to-solr-migration.html
-http://fr.slideshare.net/janhoy/migrating-fast-to-solr
17II-SDV, Nice
Datamining-Work on your data with Dataming component such as
R : http://www.revolutionanalytics.com/
Weka : http://www.cs.waikato.ac.nz/ml/weka/
RapidMiner : http://rapidminer.com/
KNIME : http://www.knime.org/
Apache Mahout : http://mahout.apache.org/
Many DefinitionsNon-trivial extraction of implicit, previously unknown and potentially useful
information from data
Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
Datamining
II-SDV, Nice
What is (not) Data Mining?
What is Data Mining?
– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
What is not Data Mining?
– Look up phone number in phone directory
– Query a Web search engine for information about “Amazon”
II-SDV, Nice
Datamining
Origins of Data Mining
Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
Traditional Techniques may be unsuitable due to
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature of data
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
II-SDV, Nice
Datamining
Data Mining Tasks
Description Methods
Find human-interpretable patterns that describe the data.
Prediction Methods
Use some variables to predict unknown or future values of other variables.
II-SDV, Nice
Datamining
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
II-SDV, Nice
Datamining
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
II-SDV, Nice
Datamining
Classification with Weka – J48
II-SDV, Nice
Datamining
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided otherwise. This {buy, don’t buy}decision forms the class attribute.
Collect various demographic, lifestyle, and company-interaction related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
II-SDV, NiceII-SDV, Nice
Datamining
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an account.
II-SDV, Nice
Datamining
Classification: Application 3
Customer Attrition/Churn:Goal: To predict whether a customer is likely to be lost to a
competitor.
Approach:Use detailed record of transactions with each of the past and present
customers, to find attributes.How often the customer calls, where he calls, what time-of-the day he calls most,
his financial status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
Datamining
II-SDV, Nice
Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
II-SDV, Nice
Datamining
Clustering with Weka – K-Means
II-SDV, Nice
Datamining
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
Intercluster distances
are maximized
II-SDV, Nice
Datamining
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
II-SDV, Nice
Datamining
Clustering: Application 2
Document Clustering:Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them.
Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
II-SDV, Nice
Datamining
Association Rule Discovery: Definition
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } --> {Potato Chips}
Potato Chips as consequent => Can be used to determine what should be done to boost its sales.
Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
II-SDV, Nice
Datamining
Association Rules with Weka – APriori
II-SDV, Nice
Datamining
Association Rule Discovery: Application 1
Supermarket shelf management.
Goal: To identify items that are bought together by sufficiently many customers.
Approach: Process the point-of-sale data collected with barcode scanners to find
dependencies among items.A classic rule --
If a customer buys diaper and milk, then he is very likely to buy beer.
So, don’t be surprised if you find six-packs stacked next to diapers!
II-SDV, Nice
Datamining
Association Rule Discovery: Application 2
Inventory Management:
Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households.
Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.
II-SDV, Nice
Datamining
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
II-SDV, Nice
Datamining
38II-SDV, Nice
Learning Machines- Weka
- RapidMiner (formerly Yale : Yet Another Learning Environment)
- Hadoop Mahout
Background infrastructure & Framework
Set of Api and Libraries
Provides Algorithms for DataMining
Your Data
Your Application
39II-SDV, Nice
Learning Machines ImplementationWith Apache Mahout
Classification- Naives Bayes
- Markov Models
- Logistic Regression
- Random Forest
-Clustering- K-Means and Fuzzy K-Means
- Canopy
- Spectral Clustering
-Association- Latent Derichlet Allocation
Working with documents :
Mahout – Lucene integration
40II-SDV, Nice
Learning Machines Implementation
41II-SDV, Nice
References (Search Domain)-Lucene : http://lucene.apache.org
-Plateform Solr/Lucene : http://www.lucidworks.com
-Plateform OpenSearch : http://www.open-search-server.com
-Plateform Katta : http://katta.sourceforge.net
-Plateform ElasticSearch : http://www.elasticsearch.com
-Plateform http://www.lucidworks.com/
-Sphinx : http://sphinxsearch.com/
-http://nutch.apache.org/
-http://en.wikipedia.org/wiki/List_of_search_engines
-http://www.thesearchenginelist.com/
-www.cloudera.com
-www.mapr.com
-www.hortonworks.com
42II-SDV, Nice
References (DataMining Domain)R : http://www.revolutionanalytics.com/
Weka : http://www.cs.waikato.ac.nz/ml/weka/
RapidMiner : http://rapidminer.com/
KNIME : http://www.knime.org/
Apache Mahout : http://mahout.apache.org/
43II-SDV, Nice