Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
13 June 2019© MARKLOGIC CORPORATION
Data Hub Performance Optimization
JAMES CLIPPINGERVP Strategic Accounts
ERIN MILLERSenior Manager, Performance and
Reliability Engineering
What to check when your Data Hub is slow
Code or Infrastructure?
- Finding infrastructure problems
- Tracking down resource bottlenecks
- Debugging slow harmonization
- Figuring out a slow production application
Agenda
MarkLogic version
Performance expectations
MarkLogic system ErrorLog
Find the resource bottleneck using Meters and testing
If there is no bottleneck:
- Increase workload
- Isolate workload
Infrastructure performance checklist
Are you running the most recent version of MarkLogic?
Yes, upgrading can be a pain, but:
- Performance issues are getting fixed all the time
- Metrics and performance monitoring are improving all the time
Do you want to spend time chasing a problem that’s been fixed or that could be detected?
Don’t kick it old school
Five key resources used by MarkLogic
- Disk bandwidth, disk space, CPU, RAM, network bandwidth
Resource needs of ingest and harmonization generally easy to predict based on design
Can infrastructure meet the resource needs? Test and do the math.
Don’t be surprised by ingest, harmonization, or reindexing requirements
Reasonable performance expectations
Recent releases of MarkLogic tell you much more about infrastructure performance issues
- “Slow” messages: write, read, fsync, and background
- “Memory” messages and warnings
- “Hung” messages
Check ErrorLog.txt
2018-12-12 09:00:14.510 Info: Forest f1 state changed from open to error
2018-12-12 09:00:14.510 Info: Database DB-1 is offline
2018-12-12 09:00:14.512 Alert: XDMP-FORESTERR: Error in merge of forest f1: XDMP-MERGESPACE: Not merging due to disk space limitations,need=17039MB, have=14696MB
Analyzing logs
2017-07-06 12:10:33.553 Warning: Slow fsync/var/opt/MarkLogic/Forests/doc-stress-F4/Label, 2.044 sec2017-07-06 12:11:50.868 Notice: Slow open /var/opt/MarkLogic/Forests/doc-stress-F10/00000183/Label, 1.659 sec
2017-07-06 12:11:59.734 Warning: Slow utime/var/opt/MarkLogic/Forests/doc-stress-F10/Label, 2.006 sec
Analyzing logs
2017-02-21 10:24:47.093 Debug: Retrying xdmp:invoke 8616837058919558659 Update 1 because XDMP-DEADLOCK: Deadlock detected locking /documents/1.xml
2017-02-21 10:24:47.990 Debug: Retrying xdmp:invoke 8616837058919558659 Update 2 because XDMP-DEADLOCK: Deadlock detected locking /documents/1.xml
2017-02-21 10:24:48.851 Debug: Retrying xdmp:invoke 8616837058919558659 Update 3 because XDMP-DEADLOCK: Deadlock detected locking /documents/1.xml
….
Analyzing logs
Disk bandwidth
- Cloud: check expected bandwidth
- On prem: test using fio to establish storage performance baseline (tough for shared)
CPU
- 80-85% utilization tends to be maximum for responsive system
Lock Wait Load: how does it compare to expectations?
Resource bottlenecks visible in Meters
Crank up the workload
- More client threads
- More MarkLogic app server/task server threads
Work to get to a bottleneck
- Isolate workloads: ingest, harmonize, and user queries separately
If request rate does not go up and there’s no bottleneck, contact Support
If there is no bottleneck
MarkLogic Data Hub Platform
Transformation from “as-is” to “curated data”
What are your SLAs?
But the ingest is so fast! Managing expectations
It’s just a black box—how do I figure this out?!
Challenge: Harmonization is slow
USE methodology (Gregg, http://www.brendangregg.com/usemethod.html)
For every component in your architecture, monitor and analyze:
- Utilization
- Saturation
- Errors
For MarkLogic: use Meters to isolate bottlenecks
- Disk space/storage; IOPS; CPU; RAM
Code or Infrastructure?
Meet your friend, Request Monitoring!
- New in 9.0-7
- Fine-grained data about every query run by app server
- https://docs.marklogic.com/guide/performance/request_monitoring
- Output goes to /var/opt/MarkLogic/Logs/APP-SRV-PORT_RequestLog.txt
Challenge: Slow transformations
I just want to capture stats about all requests that are > 1 second
Configure request on the data-hub-FINAL and STAGING app servers
- In the root directory of the data-hub-MODULES database, place a .api file that contains info about the metrics you want to capture and the constraints (if any)
Request Monitoring with thresholds
My transform job
{"time":"2019-04-25T16:53:41Z", "url":"/v1/resources/ml:sjsFlow?rs=job-id=e9576f6b-7aa3-4243-bd1a-edc634343631=rs=flow-name=join=rs=target-database=data-hub-FINAL=rs=options=%7B dhf.collection = source=small%2F00%2F %2C entity = WebPage %2C flow = join %2C flowType= harmonize %7D=rs=entity-name=WebPage=rs=identifiers=...200-urls here for docs harmonized--=database=data-hub-STAGING", "user":"admin", "elapsedTime":112.311659, "requests":1, "listCacheHits":1344, "listCacheMisses":200, "listSize":811708, "inMemoryListHits":201, "expandedTreeCacheHits":369, "expandedTreeCacheMisses":631, "compressedTreeCacheHits":623, "compressedTreeCacheMisses":8, "compressedTreeSize":292496, "valueCacheHits":78, "valueCacheMisses":2943, "regexpCacheHits":408, "regexpCacheMisses":7, "filterHits":600, "fragmentsAdded":200, "dbProgramCacheHits":1002, "fsLibraryModuleCacheHits":203, "dbLibraryModuleCacheHits":26, "readLocks":48470, "writeLocks":200, "lockTime":4.382538, "commitTime":0.000703, "runTime":112.312699, "indexingTime":1.89982}
Monitoring on a DHF endpoint: Output
In DHF 4.x, harmonization code is in plugins
- Main, Collector, Content, Header, Triples all run in Query mode
- Writer runs in Update mode
Thought experiment—what happens if I do a search in my Writer plugin?
Good news: in DHF 5.0, we give you better guardrails!
DHF and transactions
My writer plugin
Characterizing the problem
- Lots of possible reasons: data growth over time, resource bottlenecks, optimized code
- When did it start? What’s the pattern?
Different ingest/harmonization flows impacting database search performance
“When I run ingest and transform, my search application slows down”
Challenge: Production application slows down over time
This assumes that you’ve followed USE and Clip’s suggestions and found an infrastructure bottleneck
Look for hockey stick—try to provision more infrastructure resources before you get there
- What to expand when you’re expanding?
- IOPS, CPU, RAM? Forests/hosts?
Solution: Cluster Expansion
Remember, all your data uses resources—memory and storage, impacts on search, term list sizes, etc. If you’re not using the data and don’t need it, archive
Use Tiered Storage for less frequently accessed data
- HDFS
Archive unused data
- S3 and Azure BLOB storage
- Even if you are an on-prem customer, this is a cheap and effective storage mechanism
Solution: Archive strategy
Keep requests independent
- Isolate your workload by request. DHF does this for you
Watch your locks
Avoid unnecessary bottlenecks
- Don’t create a serial number generator
Putting it together: writing scalable code
Limit the request’s resources
- Using SQL or SPARQL? Use cts or Optic search clauses to limit scope
- Big result set? Paginate
- Don’t write queries where result set grows with size of data
- i.e., give me all the trades in the database—what happens as DB grows?
- If you need to do this, batch!
Putting it together: writing scalable code
Realistically, bottlenecks are often a combination of un-optimized code and resource limits
To figure out what’s what, use Utilization Saturation Errors methodology
Best process to efficiently scale:
- First, optimize your code as best you can
- Then, look at expansion—add RAM, add hosts, scale out and/or up
Putting it together: USE to resolve bottlenecks
[email protected]. Really. Email Support. You can email us: [email protected] and [email protected], but Support is monitored 24/7
Trying to figure out what those logs mean? https://help.marklogic.com/Knowledgebase/
Oh look! Erin wrote a whitepaper about this:
- https://www.marklogic.com/resources/performance-testing-marklogic/
And another:
- https://www.marklogic.com/resources/understanding-system-resources/
Resources
Thank you
Data Hub Performance OptimizationAgendaInfrastructure performance checklistDon’t kick it old schoolReasonable performance expectationsCheck ErrorLog.txtAnalyzing logsAnalyzing logsAnalyzing logsResource bottlenecks visible in MetersSlide Number 11Slide Number 12If there is no bottleneckMarkLogic Data Hub PlatformChallenge: Harmonization is slowCode or Infrastructure?Challenge: Slow transformationsRequest Monitoring with thresholdsMy transform jobMonitoring on a DHF endpoint: OutputDHF and transactionsMy writer pluginChallenge: Production application slows down over timeSolution: Cluster ExpansionSolution: Archive strategyPutting it together: writing scalable codePutting it together: writing scalable codePutting it together: USE to resolve bottlenecksResourcesSlide Number 30