This presentation summarizes how we use Elasticsearch for analytics at Wingify for our product Visual Website Optimizer (http://vwo.com). This presentation was prepared for my poster session at The Fifth Elephant (https://funnel.hasgeek.com/fifthel2014/1143-using-elasticsearch-for-analytics).
Text of Using Elasticsearch for Analytics
Using Elasticsearch for Analytics How we use Elasticsearch for
Analytics at Wingify? Vaidik Kapoor github.com/vaidik
twitter.com/vaidikkapoor
Problem Statement VWO collects number of visitors and
conversions per goal per variation for every campaign created.
These numbers are used by our customers to make optimization
decisions - very useful but limiting as these numbers are overall
numbers and drilling down was not possible. There is a need to
develop an analytics engine: capable of storing millions of daily
data points, essentially JSON docs. should expose flexible and
powerful query interface for segmenting visitors and conversions
data. This is extremely useful for our customers to derive
insights. querying should not be extremely slow - response times of
2-5 seconds are acceptable. not too difficult to maintain in
production - operations should be easy for a lean team. should be
easy to extend to provide new features.
A distributed near real-time search engine, also considered as
an analytics engine since a lot of people use it that way - proven
solution. Highly available, fault tolerant, distributed - built
from the ground up to work in the cloud. Elasticsearch is
distributed - cluster management takes care of node downtimes which
makes operations rather easy instead of being a headache.
Application development remains the same no matter how you deploy
Elasticsearch i.e. a cluster or single node. Capable of performing
all the major types of searches, matches and aggregations. Also
supports limited Regular Expressions. Easy index and replica
creation on live cluster. Easy management of cluster and indices
through REST API. to the rescue
1. Store a document for every unique visitor per campaign in
Elasticsearch. Document contains: a. Visitor related segment
properties like geo data, platform information, referral, etc. b.
Information related to conversion of goals 2. Use Nested Types for
creating hierarchy between every unique visitors visit and
conversions. 3. Use Aggregations/Facets framework for generating
datewise count of visitors and conversions and basic stats like
average and total revenue, sum of squares of revenue, etc. 4. Never
use script facets/aggs to get counts of a combination of values
from the same document. Scripts are slow. Instead index result of
script at index time. Visitor documents in Elasticsearch: {
"account": 196, "experiment": 77, "combination": "5", "hit_time":
"2014-07-09T23:21:15", "ip": "71.12.234.0" "os": "Android",
"os_version": "4.1.2", "device": "Huawei Y301A2", "device_type":
"Mobile", "touch_capable": true, "browser": "Android",
"browser_version": "4.1.2", "document_encoding": "UTF-8",
"user_language": "en-us", "city": "Mandeville", "country": "United
States", "region": "Louisiana", "url": "https://vwo.com/free-
trial", "query_params": [], "direct_traffic": true,
"search_traffic": false, "email_traffic": false,
"returning_visitor": false, "converted_goals": [...], ... } How we
use Elasticsearch "converted_goals": [ { "id": 2, "facet_term":
"5_2", "conversion_time": "2014-07-09T23:32:41" }, { "id": 6,
"facet_term": "5_6", "conversion_time": "2014-07-09T23:37:04" }
]
Alongside Elasticsearch as our primary data store, we use a
bunch of other things: RabbitMQ - our central queue which receives
all the analytics data and pushes to all the consumers which write
to different data stores including Elasticsearch and MySQL. MySQL
for storing overall counters of visitors and conversions per goal
per variations of every campaign. This serves as a cache in front
of Elasticsearch - prevents us from calculating total counts by
iterating over all the documents and makes loading of reports
faster. Consumers - written in Python, responsible for sanitizing
and storing data in Elasticsearch and MySQL. New visitors are
inserted as a document in Elasticsearch. Conversions of existing
visitors are recorded in the document previously inserted for the
visitor that converted using Elasticsearchs Update API (Script
Updates). Analytics API Server - written in Python using Flask,
Gevent and Celery Exposes APIs for querying segmented data and for
other tasks such as start tracking campaign, flushing campaign
data, flushing account data, etc. Provides a custom JSON based
Query DSL which makes the Query API easy to consumer. The API
server translates this Query DSL to Elasticsearchs DSL. Example: {
and: [ { or: [ { city: New Delhi }, { city: Gurgaon } ] }, { not: {
device_type: Mobile } } ] } Current Architecture USA West
AsiaEuropeUSA East Data Acquisition Servers Central Queue 1 2 3 4
Consumers / Workers Front-end Application Analytics API Server U
pdate counters Sync visitors and conversions
Elasticsearch scales, only when planned for. Consider the
following: Make your data shardable - cannot emphasize enough on
this. If you cannot shard your data, then scaling out will always
be a problem, especially with time-series data as it always grows.
There are options like user and time based indices. You may shard
according to something else. Find what works for you. Use routing
to scale reads. Without routing, queries will hit all the shards to
find lesser number of documents out of total documents per shard
(difficult to find needle in a larger haystack). If you have a lot
of shards, then ES will not return unless response from all the
shards have arrived and aggregated at the node that received the
request. Avoid hotspots because of routing. Sometimes some shards
can have a lot more data as compared to rest of the shards. Use
Bulk API for the right things - updating or deleting large number
of documents on adhoc basis, bulk indexing from another source,
etc. Increase the number of shards per index for data distribution
but keep it sane if you are creating too many indices (like per
day) as shards are resource hungry. Increase replica count to get
higher search throughput. Plan for Scaling
Elasticsearch does not have ACL - important if you are dealing
with user data. There are existing 3rd party plugins for ACL. In
our opinion, run Elasticsearch behind Nginx (or Apache) and let
Nginx take care of ACL. This can be easily achieved using Nginx +
Lua. You may use something equivalent. Have dedicated Master nodes
- these will ensure that Elasticsearchs cluster management does not
stop (important for HA). Master-only nodes can run on relatively
small machines as compared to Data nodes. Disable deleting of
indices using wildcards or _all to avoid the most obvious disaster.
Spend some time with the JVM. Monitor resource consumption,
especially memory and see which Garbage Collector is working the
best for you. For us, G1GC worked better than CMS due to high
indexing rate requirement. Consider using Doc Values - major
advantage is that it takes off memory management out of JVM and let
the kernel do the memory management for disk cache. Use the
Snapshot API and prepare to use Restore API, hoping you never
really have to. Consider rolling restarts with Optimizing indices
before restart. Ops - What We Learned