26
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP Ton van Daelen Sr. Director, Platform Product Management [email protected]

(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Embed Size (px)

DESCRIPTION

Accelrys Catalog is a new technology for searching protocol databases and usage logs. It uses the Apache SOLR search engine to index protocol and usage information. As a pro client developer it lets you quickly find example protocols that use a particular component, as a web port user you can find popular protocols in a particular scientific domain, and as an administrator you can search for user protocols that verify compliance with IT policies or to prepare for server migration or upgrade. In this talk we will discuss how to configure and administer the search index, the algorithms employed in the search, and how the different types of end-users can get the most out of this technology.

Citation preview

Page 1: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Ton van Daelen

Sr. Director, Platform Product Management

[email protected]

Page 2: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

The information on the roadmap and future software development efforts are intended to outline general product direction and should not be relied on in making a purchasing decision.

Page 3: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Outline

• Search use cases

• Deployment architecture

• Solr search index

• Search syntax

• Administration

• Demo – Pro client UI

– Web UI

– Admin UI

Page 4: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Accelrys Catalog Vision

Pro Client – Pers Productivity

Search from Pro Client Examples that use the ‘Http Connector’ component PilotScript referencing ‘rsplit()’ Protocols using MAO data

Web User

Search from Web Port Recent protocols Popular protocols Protocols searching ‘Corporate Database”

Admin

Administer Generate index

Update frequency

Next steps: • Mail Users • Post report

Search Generate index

Update frequency

Canned reports Security issues

Bad design Bad documentation

Catalog

Xml log

Page 5: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

The Size of the Challenge

• 10-100 Pro client users

• 50-1000 Web users

• 1-10 servers

• -> 5000+ protocols to be managed

Page 6: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Admin Use Cases …

• Bad design practices. Find protocols that: – have shortcuts as copies

– have saved checkpoints

– store passwords

– have components that are owner access only

– don’t have top level parameters (Web Port)

– have component with absolute file paths

• Bad documentation practices. Find protocols that: – don’t have help text (or default help)

– have components with missing captions

Page 7: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

More Admin Use Cases

• General queries. Find protocols:

– with components that are deprecated (ad hoc / report)

– not run in n days

– not changed in n days

– by client type (pro client, web port, web service, Notebook, Isentris, …)

– with components with GUID x

– with SQL components with specific DSN

Page 8: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Introduction to Text Searching

• Unstructured or minimally-structured searches

– Think “Google”

– Keyword-based, non-relational; wide range of user input

– Based on lookups using pre-built word (token) indexes

Page 9: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Introduction to Text Searching (cont’d)

• Strategies to make searches more effective

– Stop word removal: and, the, by, for, of, …

– Stemming: startedstart, clusterscluster, etc.

– Synonym aliasing: oncology=cancer, MB=megabyte, etc. (supported but only minimally implemented; extensible)

– Language-specific document and query processing (support for Asian languages)

Page 10: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Apache Solr

• Open source text search server

• Part of Apache Software Foundation

• Uses and extends Lucene Java search library

• Hosted by a web application server

• http://lucene.apache.org/solr/

Page 11: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Solr: Under the Hood…

• Schema – XML specification of document fields and their types – Specifies how fields are tokenized and processed for indexing

• Solr config file – XML specification of query and result set processing rules – E.g. field weights

• Optional auxiliary files – Stop words, synonyms, protected words (unstemmed)

• Host application container – For AEP this is Tomcat

Page 12: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Tokenization and Filtering

• Tokenization options in Solr – Break on whitespace – Break on all non-letter characters – Break on case changes (for CamelCaseTokenization) – Break on character set changes (alphanum/ideographic/katakana)

• Additional filters – Lowercase filter: converts all characters to lowercase – CJK bigram filter: outputs adjacent character pairs for Asian languages – Stem filter: applies stemming rules (many language-specific variants)

• Field indexing and query processing use same tokenization – Better search results may be obtained by using slightly different analysis for indexing

versus querying

• See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Page 13: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Customizing Solr

Page 14: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Mapping XMLDB to Solr Documents

• XML Database = Component/Protocol Database • For each item in XMLDB, an indexing protocol

– reads the item from the database – creates data record properties corresponding to Solr fields – joins in statistics from usage log – converts the data record to a JSON “document” – POSTs the document to Apache/Tomcat/Solr via HTTP

• Weighting – Protocol name and description have higher weight – Proximity has higher weight

Page 15: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

• name: protocol or component name • path: location in XMLDB • type: “component” or “protocol” • parameters: names of parameters • author: user who created protocol/component • modifieddate: data protocol/component last changed • runcount: number of times protocol has been run • lastrun: date protocol was last run • uses: list of components used by protocol • alltext: composite field for keyword search

Some Catalog Fields (defined in schema)

Page 16: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Administration

• Configure servers

• Specify update interval

• Manual rebuild

Page 17: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Configuring Accelrys Catalog

• Configuration (admin portal)

– AEP servers to index

– Indexing schedule

• Note

– Indexer runs as scheduled service

– Indexing takes ~1 to 3 minutes per 1000 XMLDB items

– Two index copies; users can continue search while index is rebuilt

– Tomcat and Solr automatically installed and launched with Apache

Page 18: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Limitations

• Usage info can be incorrect because log file doesn’t store full protocol path (“Protocol 1” !)

• No indexing at runtime – it can take a day before index is updated

Page 19: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Demo

Page 20: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
Page 21: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

Example Queries

• MAO type:"Component“ – Any components referencing ‘MAO’

• uses:"Xml Reader" -author:Accelrys – Components/protocols that have an xml reader and are not

authored by Accelrys

• lastrun:[*TO NOW-6MONTH] – Last run at least six months prior

• runcount:0 – Never been run

Page 22: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
Page 23: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
Page 24: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
Page 25: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
Page 26: (ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP

• Accelrys Catalog is powerful search technology built into AEP

• Become a beta tester (beta-2)

• Plan for 9.0 upgrade now

• (ATS4-PLAT10) Planning your deployment for a 64 bit world

Summary