Upload
sunita-shrivastava
View
78
Download
2
Embed Size (px)
Citation preview
Microsoft Confidential 1
ALM SearchSunita Shrivastava
6/20/2014
Microsoft Confidential 2
ALM Search Start with Code Search but eventually support search for other
artefacts
Agenda Discuss the current architecture and concerns Share the investigations Share the learning Get feedback on open design issues
Microsoft Confidential 3
Indexing Engine Choices BING and Elastic Search
Our requirements : Code Element Search, Phrase Search, AND/OR/NOT Search, WildCard Search, Faceting, Highlighting, Paginations
Indexing : Support for Continuous indexing, Performant, Scale out feasibility, Real timeness, Different Schemas Our Evaluation shared at : https
://microsoft.sharepoint.com/teams/EngSys/Documents/Modernizing%20Our%20Engineering%20System%20AOI/Search/TechEval/ElasticSearch/Tech%20Eval%20Summary.pptx?web=1
ES Observations so far in context of Code Search Schema-less
Multiple artefacts can be stored in the same index Can deal with change in data schema of the artefact
Main Value Add of ES over Lucene Is Aggregation! If you need aggregation of search across different indexes, aim to use the aggregator of ES! This means that it is likely that sharing a large ES cluster is the right thing to do for search aggregation across VSO artefacts
Highly Extensible Code Element Search
Move from Nested Documents to Custom Analyzer Highlighting
ES allows the REST APIs to be extended/added We chose a custom query extension mechanism
Azure Search Service, though layered on ES, hid these mechanisms making it intractable for code element search Feeding ES
For Large Scale Indexing, being able to feed it fast enough is important, so the scalability of the pipeline is important.
Microsoft Confidential 4
High level Architecture
Datacenter A
Search Service
Search Service Front End
Search Service Backend
REST API
Web UX
Index ServersIndex Servers
VSO Service
Query Pipeline
Crawl/Parse/Feed Pipeline
Datacenter B
Search Service
Search Service Front End
Search Service Backend
REST API
Index Servers
Index Servers
VSO Service
Query Pipeline
Crawl/Parse/Feed Pipeline
Mapper Data Mapper Data
Microsoft Confidential 5
Planned Service Architecture XSS Scripting and circular dependency problems during build force the Search Client
This is going to be more and more common as more standalone services come into existence Thanks to Patrick/Phecda for this picture
Microsoft Confidential 6
Deployment for MsEng Thanks to Sharad Agarwal for this Slide! Elastic Search cluster (Indexer)
3 (Master + Query) Nodes (A2) 3 Data Nodes (A5) Probably 1 Marvel node (A2) – Need data from AppInsight’s team
Search Service (CPF + Query) 3 Job Agent Nodes (A2) 3 App Tier Nodes (A2) Config DB (SQL Azure) 1 Azure Storage account
Portal UX in TFS Both Search Service, ES, Marvel cluster are within a VNET
Search Service talks to the ES query/ingestion nodes through an ILB This helped take care of DNS issues
Microsoft Confidential 7
Logging, Diagnostics and Monitoring Logging
All our code will be instrumented, including the code inside ES. Developers can get these logs.
Diagnostics Each team, provides diagnostics data, which is higher level data that provides
insights into the usage/activities happening in the context of the component. Query Pipeline Telemetry
Total Number of Queries
Successful Queries
Failed Queries
Slow Queries
Portal Telemetry Total Number Queries
Queries that don’t result in a click on the facet or result page in the top 20 results
Queries that result in a click beyond 20 results Search Usage per account
Microsoft Confidential 8
Diagnostics, Monitoring (cont) Indexing Telemetry
Storage Used for Temporary Data(Blobs) Storage Used for Entity State Data(Tables?) Storage Used for Meta Data Storage used for Provisioning Data Amount of Data (Mbytes) indexed in the last one hour Number of commits handled in the last one hour Number of pending tasks Number of pending pipelines Number of pending commits Cold Start Summary
Monitoring Use Marvel
Microsoft Confidential 9
Query Pipeline Quick, Low Overhead Query Builder uses the Arriba Parser, NEST is used to talk to Elastic Search Mapper(Not Required): The scope of search defines a unique ES Index Alias which refers to the appropriate
indices Security Trimming (Cause of Concern) : For TFVC at file level, For WorkItems at Area Level
IndexorElasticSearch
Query Pipeline
FileHash Mapper(For Dedup)
REST
En
dpoi
nt
Repo Access/Auth Mgmt Query BuilderQuery Builder
(Format Checking, Query Parsing)
Security Trimmer(only for
tfvc)Aggregator
Mapper
Highlighter AddIn
Microsoft Confidential 10
Query Pipeline Component Diagram Thanks to Bittu and Neeraj for this diagram!
Search UIRest API for Query
Interaction
Query Builder
Search String& Filters
Search Query Backend
Custom Highlighter
TFS GIT RepoQuery String Parser
ES Client
Query Monitor
OI
Query Executor
ES Cluster
Repo Access Management/Authentication
Search Response
ES Search Results
Custom Query
Custom Analyzer
Microsoft Confidential 11
Security Three Options
Use Remote Security Name Spaces for caching artefact permissions GIT + WIT -
Index level permissios Mostly Open Model
Microsoft Confidential 12
Indexing Pipeline Currently built on VSSF Framework, Backend REST APIs : Seminal objects are Tasks, Pipelines, Entities
E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on some Entity results in creation of a task
Tasks create pipelines, a task is completed when all pipelines spawned are finished
Indexing Pipeline
Crawl
BE R
EST
Endp
oint
Meta Data Analysis Cold Start Index Prep
Index Provisioning
Parse Feed
IndexorElasticSearch
Ready Index for Query
Mapper Update
Crawl Parse Feed
Cold Start Cleanup
Dedup Detection(opt)
Microsoft Confidential 13
Indexing Pipeline Component Design
TFS Commit SyncTFS Account SyncRe-indexer
Crawer Abstraction Layer Crawler Extensions
Parser
Parser Extensions
Feeder
ES Wrapper
CPF Arbritrator
ES Map and Topology
Configurator
Index Monitor
OI
Job Scheduler
ES Extensions(Custom Analyzer/
Plugins)
De-dup Multi-tenancy
...
Logger/Telemetry
Repo Content DB Abstraction Layer
Parser DB Abstraction Layer
ES Cluster
Data
Data
Thanks to Tapas for this diagram!
Microsoft Confidential 14
Indexing Pipeline (cont) Cold Start Crawl Spec :
For GIT, the ‘default branch’ is enabled for Indexing by default Others will need to get whitelisted explicitly
TFS Repo has many topic and feature branches Need Closure on UX experience on this
For TFVC : TBD For Work Items : TBD
Microsoft Confidential 15
Performance Summary For up to 5 Million Files, performance of 90% queries remained under 60
msec Feeder ran into issues quickly on A2 configurations, because of low memory
issues By not storing the file content, but only term vectors the performance came
down from a range of ~1.5 msec to 20 msec on A5 configurations. Following in Progress
Multiple Smaller Indexes on the same node Queries during Continuous Indexing Indexing Performance with Multiple Replicas Multi-Index Search
Detailed analysis available at https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sour
cedoc={83CFFEAA-1C78-46FA-BFE7-9D3E36DCA3CA}&file=Sprint66_PerftAnalysis-1.pptx&action=default
Microsoft Confidential 16
UX Requirements :
Search UI needs to be uncluttered and simple User should not lose context of what he was doing Experience should be largely similar for searching different artefacts
Sharepoint has a precedent for multi-artefact search Search launches a different page Seems like a reasonable model to follow
Microsoft Confidential 17
Indexing Pipeline (Cont) Crawler Strategy : Current plan is to use the LibGit2Sharp
Following methods were compared Crawl File by File with GitHttpClient(current implementation) Download Zipped trees using GitHttpClient Clone a Repo using Git Command line LibGit2Sharp
https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={43E4F3C7-D54E-432E-BDF7-33F96A912E58}&file=Git%20Repos%20Crawl%20Option%20Comparison.docx&action=default
Implications Entire Git repo is brought down to Azure storage(Blob Store)
To Dedup or not to Dedup TFS repo on mseng has ~35 feature branches, ~300 scope branches Results (10 Million Files, .4 M Unique Files, Duplication ratio 1:25 (what is seen in Windows SD depots/branches))
No Deduplication : 60GB index size, 19 hours indexing time, 9ms avg query time Single Document : ~3GB, 50 minutes, 3.7 ms Parent Child Mappings : 11 GB, ~1 hour 50 Min, 122 ms
https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={A19453EF-9EB8-447C-A2E0-85C245D4CA79}&file=Summary%20of%20Deduplication%20Effort.docx&action=default
Backend APIs : For diagnostics/dealing with corruption etc. Seminal objects are Tasks, Pipelines, Entities E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on
some ‘Entity’ results in creation of a task Tasks create pipelines, a task is completed when all pipelines spawned are finished
Microsoft Confidential 18
Indexing Pipeline Scaleout We want to host indexes for different artefacts on the same ES cluster
This will enable search aggregation through ES This opens ups several interesting scenarios in future
Scale-out and Isolation for different pipelines based on Job Infra is not possible To leverage efforts across teams
Implies that the Crawl/Parse/Feed pipeline should be generalized Potentially we might want to think of extensibilities at the query pipeline as well
Microsoft Confidential 19
ALM Search Deployment Topology
AT
Job
ES
Load Balancer
Private Network
Inte
rnal
Lo
ad
Bala
ncer
ALM Search Service
AT
Job
TR Data
Nodes
Search Data
Nodes
TR Query/Indexi
ng Nodes
Search
Query/Indexi
ng Nodes
Shared Master Nodes
Microsoft Confidential 20
Cross Account Search and Public Repositories There is a desire to include all public repositories in Search either by
default or as an option to the user How will VSO support the notion of a public repository? Will there be public accounts ?
Microsoft Confidential 21
On Premise and Cloud Search Federation Sharepoint and Office 365 have a precedent, supports 3 models based on Oauth
http://msdn.microsoft.com/en-us/library/dn155905.aspx Three models for federation
Outbound : Searching on the portal for the on-premise service endpoint, will return results from cloud as well Inbound : Searching on the portal for the cloud service, returns results from the on-premise TFS indexes as well Both ways : Search is symmetric
Look out for more details in this space next time!
Code Search Service
Code Search Service
Aggregator
Indexorc
Repository
Indexora
Repository
Indexora
Repository
Indexorb
Cloud
Repository
Indexora
VS IDE
VSO Web UX
Aggregator
VSO Web UX
Microsoft Confidential 22
Futures Semantic Search OSS Search Requirements Extensions to Code Search for Test Cases
Microsoft Confidential 23
Appendix
Microsoft Confidential 24
Perf Testing on 1 Node for upto 4 M Files
Microsoft Confidential 25
Indexing Rate Analysis(Thanks to Perf Crew)
A7-1N1S
A7-1N3S
A7-1N5S
A6-1N1S
A6-1N3S
A6-1N5S
A5-1N1S
A5-1N3S
A5-1N5S
0
50
100
150
200
250
300
350
Indexing Rate
10K 100K 500K 1M 2M 3M 4M
Files Indexed
Docs
Inde
xed/
sec
Setup• A5: 6GB allocated to JVM Heap• A6: 12GB allocated to JVM Heap.• A7: 20G allocated to JVM Heap.• Feeder: A4 machine feeding asynchronously. Observation• On A5 indexing rate remains same across shards.• On A6 & A7 using more than 1 shard improved Index rate. 3
and 5 shards behavior remained same.• Indexing rate remained linear across post 500K files during
whole indexing period.• On A5 maximum indexing rate is 160 Docs/sec while
minimum is 107 Docs/sec.• On A6 Maximum indexing rate is 200 Docs/sec while
minimum rate is 125 Docs/sec.• On A7 indexing rate is 302 Docs/sec while minimum is 120
Docs/sec.
Conclusion• Indexing rate remained linear once 500K docs were
indexed.• For onboarding a new repo we can clearly predict/estimate
the maximum time needed to index the repo.