Elasticsearch Atlanta Meetup 3/15/16

E l a s t i c s e a r c h & B i g D a t a

E l a s t i c s e a r c h M e e t u p 0 3 / 1 5 / 2 0 1 6

W H O A M I

• Roy Russo• VP Engineering, Predikto• ElasticHQ.org

A G E N D A

• What is “Predictive Analytics”?• Why Elasticsearch & Spark for ETL?• Elasticsearch in production• Lessons learned

W H Y A M I H E R E ?

& (Big Data) Predictive Analytics

W H O I S P R E D I K T O ?

• Atlanta-based• Founded in 2012• Funded• Paying Customers• Mechanical Engineers / Statisticians• Big Data Architects• Global 1000

W H A T W E D O ?

• Predictive Maintenance• Predictive Analytics • Asset Health• Anomalies• SaaS

H O W D O E S P R E D I C T I V E A N A LY T I C S W O R K ?

D a t a S o u r c e s

P R E D I K T O S A A S P L A T F O R M

OEM Specs

Inventory

Operator

Weather

Financial

Sensors

Asset Mgmt.

INPUT(APIs)

DATA LAKE TRANSFORM BUSINESS INTELLIGENC

E

LEARN OUTPUT(APIs)

MAX

PREDIKTOAPPS

T H E P R E D I K T O A D V A N T A G E ( P A T E N T P E N D I N G )

MAX

Raw data Auto dynamic engineered

features

Auto dynamic feature

selection

Method and algorithmselection

Dynamic self learning

Auto dynamic engineered features

Average Std. Deviation of pressure sensor in the last hour, compared to avg std. dev of pressure sensor last

week for locomotives leaving the same depot, and in the same series of make and

model.

125,000X 125

2

M A C H I N E L E A R N I N G A U T O M A T I O N & S C A L A B I L I T Y

Load Raw Data

Load data from Sensors, Maintenance and Repairs, Usage, Topology, Weather,

…

1

1,000

Auto dynamic feature selection

37

3

Software algorithms and logic find the best raw data and/or

engineered features that correlate to the best representation of the

health and condition of the equipment.

MAX

I N P R O D U C T I O N

• Apache Spark Integration• Why Elasticsearch?• Nginx• Elasticsearch Configuration

S O W E U S E S P A R K …

• ETL:• Shared Memory• Not Disk-Bound• Distributed workloads

• Scale horizontally• New node = more capacity• Spin up. Spin down.

• Runs from Instruction-set• DAGs

E S - H a d o o p : S e t u p

client = Elasticsearch(hosts=[es_uri])

query = '{"query": {"filtered": {"filter": {"terms": {"date_epoch": ' + some_var +'}}}}}'

es_conf = { "es.nodes": es_host_name, "es.port": es_host_port, "es.resource": es_index + '/' + es_mapping, "es.query": query }

es_RDD = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf)

dicts = es_RDD.map(lambda x: x[1]).map(lambda x: reformat_location(x))

E S - H a d o o p : T i p s

• Connects to specific shards• Spark workers - to - ES (primary) shards

• Typing matters• Beware of ‘index.mapping.coerce’

• Ignores ‘size’ parameter!

W H Y E L A S T I C S E A R C H ?

• Time-Series Data• Fast-Reads

• Fast writes with Bulk Inserts• Asynch

• Dynamic querying• Differing schema• Everything is indexed• Visualization

• GeoJSON• Scale horizontally

• New node = more capacity• Spark-ES Connector• Python lib

M I L L I O N S O F Q U E R I E S …

F R O N T I N G W I T H N G I N X

• Reverse Proxy• Load-Balance across cluster• Authentication

• read-only access• block HTTP DELETE

• Log:• request-body, response-time, response-size, etc…

• Keep-alive connections• t2.micro often suffices

E S S E T U P : C O N F I G

• index.*.slowlog = DEBUG• Plugins:

• Cloud-AWS • IAM Roles for auto-discovery• S3 Snapshot• Changes in 2.x!

• Curator• Master nodes:

• N+1/2 = discovery.zen.minimum_master_nodes• boostrap.mlockall : true

E S S E T U P : H A R D W A R E

• r3.2xlarge instance - 64GB Quad Core• 30GB Heap

• AMI deployment• 500GB mounted EBS (commonly)

• path.data = /mnt-point• Cluster-per-customer

E S S E T U P : I N D I C E S

• Many indices per project (use-case)• Mappings differ considerably

• Unixtime in seconds (2.x changes)• Type coercion (GIGO)• source: true

• Groovy scripting : off!• Index segmentation depends

• by… month, month-year, device, etc…• Sparse data sucks

oil_temp RPM80

2500

U I I M P L I C A T I O N S

• REST API access• Predikto QDSL• Dynamic Queries• Reporting BI Interface