View
430
Download
0
Embed Size (px)
Citation preview
Patrick BeaucampFounder of the Vanilla, AklaBox & Data4Citizen Projects
Mail : [email protected]
Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment
II-PIC, Bangalore 2th November 2017
1II-PIC, Bangalore
2II-PIC, Bangalore
Presentation Agenda
Open Source Search Engine & Search PlatformFeatures expected for Search Platforms (Interface)
3II-PIC, Bangalore
Open Source Platform at French MinistryProject Context
Platform Architecture
WebSite Powered by a Search engine
Personal Experience of Search – Search Ideas
You know Solr ?
4II-PIC, Bangalore
Part 1 – Search concepts and Ideas« Sharing and awaking your mind »
5II-PIC, Bangalore
Searching … and finding !
6
How many times per day do you Google ? (search,
maps, translate …)
Tribute to Open Source at II-PIC … thanks Christoph !
Search is the first Step : collecting information
II-PIC, Bangalore
Searching ???
7
Using Search Engine (and beeing influenced by Seo)
Search is a subject in itself :
II-PIC, Bangalore
Register to News Feed and Alerts : « Push Mode »
« Artificial Intelligence » facts : an algorithm is working
for you : Facebook proposal , Gmail reminder …
« minority report » is there !
8II-PIC, Bangalore
User Behavior Analysis for Sales & Marketing Team, Web Design Team
WebSite as a Vitrin :
Which Menu & Sub menu are visited ?
Where are the dead branch ?
No real « Search Approach »
Before
Browsing behavior
9II-PIC, Bangalore
Browsing behavior
User Behavior Analysis for Sales & Marketing Team, Web Design Team
WebSite as a Search Interface
What people are looking for ?
How are they searching?
Now
Review your SEO
Searching … and finding !
10II-PIC, Bangalore
Searching … and finding !
11
We all became private investigators one day or another
II-PIC, Bangalore
Searching … and finding !
12II-PIC, Bangalore
Searching … and finding !
13
Different search engine lead to different results
II-PIC, Bangalore
Searching … and finding !
14
Different search engine by country
II-PIC, Bangalore
Searching … and finding !
15
Funny word : SEO … its more « how to be found on
Internet » … and you need to pay for it !
II-PIC, Bangalore
Searching … and finding !
My personal experience
16
I tried to find a person during 23 years, roughly from 1993
to 2016
From 1993 to 1998 : no search engine available …
only private investigator ?
From 1999 to 2015 : regular Search – no results
I founded this person on facebook, not on google
From a browser : « f + tab » … « g + tab », « y + tab » …
Some years : no search, other years : multiples search
II-PIC, Bangalore
Searching … and finding !
17
The person I was looking published on facebook using
his/her real name – its his/her decision to be visible or not
Where do we stand with the « Right to Forget »
II-PIC, Bangalore
Searching … and finding !
18
Companies like Facebook have tons of data : they need to
provide search infrastructure (indexing + search interface)
I was lucky to make a try with facebook search interface
II-PIC, Bangalore
Searching … and finding !
19
Discovery of Cholera – 1854 (John Snow)
http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
II-PIC, Bangalore
Searching … and finding !
20
Bicycle Accident in Street : who is taking care of trafic management
Example in Boston : http://www.boston.com/bostonglobe/editorial_opinion/blogs/the_angle/2010/12/bike_crash_map.html
Open Data
II-PIC, Bangalore
Searching … and finding !
21
LION – 2016 (Garth Davis)
Mistake 1 : Ganesh Tanei – Mistake 2 : Saroo
II-PIC, Bangalore
« Internal » Searching Strategy
22II-PIC, Bangalore
It’s easy to add a « search » feature
In WebSite (Drupal Hosting)
Company don’t want to live
this again !
You need a Strategy for your internal data : its your digital assets
Part 2 – Search ComponentsThe « Recipe »
23II-PIC, Bangalore
OpenSource LandScape
24
Crawling
Indexing
Storing
WebSite
Reference
WebSite
AccessibilityUpdate Management
Search Interface
Result Visualization
Auto Completion
Natural Language
Voice Recognition
Maps
Ads
Unstructured data
Access Management
II-PIC, Bangalore
Search Platform Objectives
Constraints : being able to reach WebSite and content :Internal WebSites (Intranet) & External WebSites
Internal Document Repositories
25
Being able to index WebSite content (and page updates)
Beeing able to store unstructured data
Crawling
Storing
Indexing
II-PIC, Bangalore
Search Platform Objectives
26
Provide usable Search results (auto classification,
visualization)
Don’t Forget why and what you search :
• You search in existing documents
• You need visualization tools
• Its not a crystal ball : search reflects the past
Provide usable Search interfaces (semantic search, multi
language search …)
Search Interface
Result Visualization
II-PIC, Bangalore
27
Before indexing your document base, you need to access it !
Apache Nutch is a highly extensible and scalable open source web crawler
software project.
Reference : http://nutch.apache.org/
Nutch
II-PIC, Bangalore
28
Solr
• What is Solr– Indexation and Search Engine
• Promoted by the Apache Foundation
• Built on Top of Apache Lucene (Java Search library)
– Major engine characteristics• Scalable, fault tolerance, distribution indexation process, dynamic
workload balancer, centraized configuration
– Technical environment• Java
• Embeded Jetty server for platform administration
II-PIC, Bangalore
29
Solr
Main characteristics
Admin Interface
Flexible and scalable Configuration
Modular
Multiple index management with a signle instance
II-PIC, Bangalore
30
Solr
Main characteristics
Standard communication interfaces (html, xml, json)
Configuration can be done with or without schema
Real time Indexation
II-PIC, Bangalore
31
Solr
Main characteristics
Customizable Full Text analysis
Rich documents indexation (using Tika)
II-PIC, Bangalore
32
Solr
Main characteristics
Search by facet and filters
Term suggestion and orthograph correction
Geospatial Search
II-PIC, Bangalore
33
Solr
Solr behavior
II-PIC, Bangalore
34
-Synonyms
- It is possible to extend the search to synonyms if they are listed in a
glossary. For example, to find articles containing synonyms to “TV” when
you search with the word TV.
-Metadata
- Dictionary for list of searchable keywords
Search Engine Basic (1/2)
II-PIC, Bangalore
35
-Reserved Words, Protected Words
- Indexing usually uses stemming, which is to reduce words to their root, for
example "Developp" to find items also contain the word when trying to
develop the word development. However, sometimes there are adverse
lemmatizations, indexing under one lemma two words that have no
relation. It is possible to prevent the stemming of words by listing them in
a file protwords.txt.
-StopWords
- The stopwords are meaningless words. A word considered insignificant
will be ignored. Note that some words are insignificant in some contexts,
others have homonyms signifiers. For example, can refer to a summer
season (rather mean) or past participle of the verb to be (relatively
insignificant). Stopwords.txt the file looks like this
Search Engine Basic (2/2)
II-PIC, Bangalore
36
-Multi Language support (this is where commercial search engine have still more
to bring to customer), even there is now Asian type language support (Hindi,
Thai, Chineese, …)
-Elision :
- Elisions are a feature of the French, which consist of a contraction of the
words like or when they are followed by a vowel. Example: + aircraft gives
the aircraft. It is possible to remove these elisions using a lexicon.
-Limits solved other the past 3 years
• Full text search interface (language with search engine)
• SubQuery support : now its ok starting with Solr 4.7 (we are v6)
• Scalability (this is where Solr is taking technical advantage)
Search Engine Current Limits
II-PIC, Bangalore
37
-Advance indexing and querying tools.
-Provides distributed searching capabilities to prevent bottleneck for a particular
server.
-Provides document excerpts (snippets) generation that provides summary of the
search
-Relevance ranking display extracts from the documents based on the query.
Search Interface expectation (1/3)
II-PIC, Bangalore
38
-Duplicate document detection, including fuzzy near duplicates
-Rich Document Parsing and Indexing without using Database Indexing.
-Ranking control carry out a targeted ranking of individual documents.
-Search Grouping by Type / Tag / Categories (General page, documents, images)
Search Interface expectation (2/3)
II-PIC, Bangalore
39
-Multi Criteria support
-Ranking
-Natural language support
-Apps Support (Android, Ipad)
Search Interface expectation (3/3)
II-PIC, Bangalore
Part 3 – A Real Project
40II-PIC, Bangalore
Project at Ministry
Initial decision and guidelines from Ministry
41
New WebSite will be done using Drupal CMS 8.2
WebSite should be powered by a « Google alike Search Toolbar »
WebSite – Infrastructure – should connect with multiples other
WebSite
All Infra (Software) must be Open Source components
II-PIC, Bangalore
Project at Ministry
42
http://www.developpement-durable.gouv.fr/
II-PIC, Bangalore
https://www.ecologique-solidaire.gouv.fr/
Project at Ministry
43
http://www.developpement-durable.gouv.fr/
II-PIC, Bangalore
Project at Ministry - Architecture
44II-PIC, Bangalore
Project at Ministry - Architecture
45II-PIC, Bangalore
Project at Ministry - Technical
46
Projects Steps
Nutch crawler for various WebSite
• Facebook, LinkedIn, Twitter, Youtube …
• Internal WebSite, Previous WebSite
Drupal Forms for Metadata & indexation
• Specific Forms for different kind of documents
• Drupal CMS process to add new content
Drupal 8 Module for Solr : custom search, monitoring, reporting
• Existing drupal solr is limited to single instance of drupal
• Not possible to use Solr Admin interface
II-PIC, Bangalore
Project at Ministry - Technical
47
Additional PHP libraries
Curl : Communication Drupal-Solr (http-get http-post & attached file)
Ssh2 : server administration command
Zookeeper : Communication Drupal-Zookeeper
MemCached : Communication Drupal-Memcached
Solarium : Communication Drupal-Solr (abstraction layer)
GoogleApi : youtube content indexation
II-PIC, Bangalore
Paragraph : News and Content edition
Piwik : Statistics (like Google Analytics)
Project at Ministry – Admin Interface
48
Drupal8 Addon to setup the global infrastructure (Zookeeper, Solr)
II-PIC, Bangalore
Project at Ministry – Admin Interface
49
Drupal8 Addon to monitor the global infrastructure - Statistics
II-PIC, Bangalore
Project at Ministry - Validation
50
Projects Validation & Deployment
No problems with Zookeeper, Solr, Nutch
Stress tests for the global platform : initial slow down with 10 000
simultaneous connection
Sub-Project : Adressing the Single Point of Failure
Solution : Problems with Drupal & MySql -> MemCached
II-PIC, Bangalore
Project at Ministry - Next
51
Next Steps
Review of WebSite content … new Ministry
New Content to be indexed :
• Other WebSite and Social Content
• New set of document to be added in the repository
II-PIC, Bangalore
52II-PIC, Bangalore