18
By: Ramez Ibrahim AL Fayez

Solr Architecture

Embed Size (px)

Citation preview

By:  Ramez  Ibrahim  AL  Fayez  

Agenda  ¡ Introduc9ons  ¡ What  is  Solr?  ¡ Main  Solr  Features  and  A@ributes    ¡ Content,  Query,  Facet,  API,  Scalability  ¡ Interface  and  useful  commands  ¡ Live  Demo  

Introduc9on  �  Search  has  become  mission  cri9cal  for  most  enterprises  

�  Intranet  �  Web  presence  �  E-­‐commerce  

�  Exponen9al  growth  of  data  �  Cost  of  not  finding  informa9on  

�  Knowledge  (sharing)  �  Time  �  Money  

�  Informa9on  blackhole  

What  is  Solr?  Official  defini,on:  

  “Solr   is   an   open   source   enterprise   search   pla7orm   based   on   the     Lucene   Java   search   library,   with   an   HTTP   interface   using   XML,     JSON   or   other   formats.   It   provides   hit   highligh,ng,   faceted     search,   caching,   replica,on,   a   web   administra,on   interface   and     many   more   features.   It   runs   in   a   Java   servlet   container   such   as    Apache  Tomcat.”  

� h#p://lucene.apache.org/solr  

What  is  Solr?  �  In  2004,  Solr  was  created  by  Yonik  Seeley  at  CNET  Networks  as  an  in-­‐house  project  

to  add  search  capability  for  the  company  website.  

�  Open-­‐source,  license-­‐free  search  engine  

�  Built  on  top  of  Apache  Lucene  library,  and  adds  enterprise  search  server  features  and  capabili9es    

�  Web  based  applica9on  that  processes  requests  and  returns  responses  via  HTTP,  and  APIs  

Why  choosing  Solr?  �  Customizable  �  High  quality  and  easily  modifiable  relevancy  �  Very  fast  query  and  indexing  performance  �  Open  source  so^ware  is  free  �  Highly  flexible  data  processing/transforma9on  �  Easy  scalability  and  great  performance    �  Modern  solu9on  architecture  based  on  XML  and  Java  �  Well  integrated  with  the  ecosystem  around  Big  Data,  such  as  Hadoop  (also  

Nutch,  Tika)  

Solr’s  Main  Features  �  Full  text  search  

�  Field  search  

�  Number  and  date  searching  

�  Facets  

�  Spelling  assistance  –  “Did  you  mean…?”  

�  Related  hits    

�  Query  comple9on  

�  Admin  GUI  

�  Data  Import  Handler  �  Index  Databases,  Mails,  RSS,  XMLs  etc.  

�  Rich  document  support  �  PDF,  MS  Office,  Images  etc  

�  Replica9on  for  high  query  volume  

�  Distributed  search  for  large  indexes  �  Produc9on  systems  with  1B+  documents  

�  Very  extensible  and  customizable  �  Embedded  in  commercial  search  products  

from  LucidWorks,  DataStax,  Cloudera,  Hortonworks,  Amazon  CloudSearch  and  Riak  

Main  A@ribute    �  Index(ing)  �  Inverted  index  

�  Document  

�  Field  �  Stored  and/or  indexed  fields  

�  Analysis  

�  Tokeniza9on  �  Filters  �  Terms  

� Query  �  Filter  �  Func9on  �  Facet  

Content  �  Out  of  the  box  support  for  JSON  

�  Solr  handles  CSV,  XML,  Rich  Content  out  of  the  box  without  having  to  install  plugins    

Indexing  and  Ranking  �  Solr  use  Inverted  index  

�  For  ranking,  solr  use  TF-­‐IDF  and  Similarity  

�  Similarity  is  a  combina9on  of  Boolean  model  (BM)  and  Vector  Space  Model  (VSM)  

�  Another  feature,  user  can  do  re-­‐rank  to  the  query    

Query  �  Common  parameters  

�  Start,  rows,  fl,  fq,  sort  

?q=*:*&start=0&rows=10&fl=9tle&fq=collec9on:popular&sort=9tle  asc  

�  Slightly  more  advanced  �  &facets  �  &qf  

&qf=keyword^4  content1^8  content2^3  content3^2  stem1^1.5  stem2^1.2  stem3^0.5  

Facet  “Faceted  search  is  the  dynamic  clustering  of  items  or  search  results  into  categories  that  let  users  drill  into  search  results  (or  even  skip  searching  en9rely)  by  any  value  in  any  field.  “  

�  Naviga9on/discovery  technique  �  Tally  of  docs  for  each  dis9nct  field  value  �  Parameters  

�  &facet=true  �  &facet.field=category  

API  �  REST  API  for  adding  field  types,  and  dynamic  fields    

�  Managing  Request  Handlers  through  API    

�  Improved  APIs  for  managing  collec9ons    

�  Implicit  registra9on  of  replica9on,  Real  Time  Get  and  Administra9on  Handlers  

�  Out  of  the  box  support  for  JSON  

�  Solr  handles  CSV,  XML,  Rich  Content  out  of  the  box  without  having  to  install  plugins    

Scalability  �  Architecture  goals:  

�  More  queries  per  second  (qps)  �  Faster  query  execu9on  �  Bigger  indexes  �  Faster  indexing  

�  Scaling  op9ons  �  Mul9core  �  Replica9on  �  Sharding  

Useful  commands  �  ./bin/solr  {start|stop}    

�  ./bin/solr  create  -­‐c  <COLL_NAME>  

�  bin/post  -­‐c  <COLL_NAME>  <Files  to  index>    

�  /bin/solr  delete    

Main  Interface  

Finish  !