22
Elasticsearch & Big Data Elasticsearch Meetup 03/15/2016

Elasticsearch Atlanta Meetup 3/15/16

Embed Size (px)

Citation preview

Page 1: Elasticsearch Atlanta Meetup 3/15/16

E l a s t i c s e a r c h & B i g D a t a

E l a s t i c s e a r c h M e e t u p 0 3 / 1 5 / 2 0 1 6

Page 2: Elasticsearch Atlanta Meetup 3/15/16

W H O A M I

• Roy Russo• VP Engineering, Predikto• ElasticHQ.org

Page 3: Elasticsearch Atlanta Meetup 3/15/16

A G E N D A

• What is “Predictive Analytics”?• Why Elasticsearch & Spark for ETL?• Elasticsearch in production• Lessons learned

Page 4: Elasticsearch Atlanta Meetup 3/15/16

W H Y A M I H E R E ?

& (Big Data) Predictive Analytics

Page 5: Elasticsearch Atlanta Meetup 3/15/16

W H O I S P R E D I K T O ?

• Atlanta-based• Founded in 2012• Funded• Paying Customers• Mechanical Engineers / Statisticians• Big Data Architects• Global 1000

Page 6: Elasticsearch Atlanta Meetup 3/15/16

W H A T W E D O ?

• Predictive Maintenance• Predictive Analytics • Asset Health• Anomalies• SaaS

Page 7: Elasticsearch Atlanta Meetup 3/15/16

H O W D O E S P R E D I C T I V E A N A LY T I C S W O R K ?

Page 8: Elasticsearch Atlanta Meetup 3/15/16

D a t a S o u r c e s

Page 9: Elasticsearch Atlanta Meetup 3/15/16

P R E D I K T O S A A S P L A T F O R M

OEM Specs

Inventory

Operator

Weather

Financial

Sensors

Asset Mgmt.

INPUT(APIs)

DATA LAKE TRANSFORM BUSINESS INTELLIGENC

E

LEARN OUTPUT(APIs)

MAX

PREDIKTOAPPS

Page 10: Elasticsearch Atlanta Meetup 3/15/16

T H E P R E D I K T O A D V A N T A G E ( P A T E N T P E N D I N G )

MAX

Raw data Auto dynamic engineered

features

Auto dynamic feature

selection

Method and algorithmselection

Dynamic self learning

Page 11: Elasticsearch Atlanta Meetup 3/15/16

Auto dynamic engineered features

Average Std. Deviation of pressure sensor in the last hour, compared to avg std. dev of pressure sensor last

week for locomotives leaving the same depot, and in the same series of make and

model.

125,000X 125

2

M A C H I N E L E A R N I N G A U T O M A T I O N & S C A L A B I L I T Y

Load Raw Data

Load data from Sensors, Maintenance and Repairs, Usage, Topology, Weather,

1

1,000

Auto dynamic feature selection

37

3

Software algorithms and logic find the best raw data and/or

engineered features that correlate to the best representation of the

health and condition of the equipment.

MAX

Page 12: Elasticsearch Atlanta Meetup 3/15/16

I N P R O D U C T I O N

• Apache Spark Integration• Why Elasticsearch?• Nginx• Elasticsearch Configuration

Page 13: Elasticsearch Atlanta Meetup 3/15/16

S O W E U S E S P A R K …

• ETL:• Shared Memory• Not Disk-Bound• Distributed workloads

• Scale horizontally• New node = more capacity• Spin up. Spin down.

• Runs from Instruction-set• DAGs

Page 14: Elasticsearch Atlanta Meetup 3/15/16

E S - H a d o o p : S e t u p

client = Elasticsearch(hosts=[es_uri])

query = '{"query": {"filtered": {"filter": {"terms": {"date_epoch": ' + some_var +'}}}}}'

es_conf = { "es.nodes": es_host_name, "es.port": es_host_port, "es.resource": es_index + '/' + es_mapping, "es.query": query }

es_RDD = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf)

dicts = es_RDD.map(lambda x: x[1]).map(lambda x: reformat_location(x))

Page 15: Elasticsearch Atlanta Meetup 3/15/16

E S - H a d o o p : T i p s

• Connects to specific shards• Spark workers - to - ES (primary) shards

• Typing matters• Beware of ‘index.mapping.coerce’

• Ignores ‘size’ parameter!

Page 16: Elasticsearch Atlanta Meetup 3/15/16

W H Y E L A S T I C S E A R C H ?

• Time-Series Data• Fast-Reads

• Fast writes with Bulk Inserts• Asynch

• Dynamic querying• Differing schema• Everything is indexed• Visualization

• GeoJSON• Scale horizontally

• New node = more capacity• Spark-ES Connector• Python lib

Page 17: Elasticsearch Atlanta Meetup 3/15/16

M I L L I O N S O F Q U E R I E S …

Page 18: Elasticsearch Atlanta Meetup 3/15/16

F R O N T I N G W I T H N G I N X

• Reverse Proxy• Load-Balance across cluster• Authentication

• read-only access• block HTTP DELETE

• Log:• request-body, response-time, response-size, etc…

• Keep-alive connections• t2.micro often suffices

Page 19: Elasticsearch Atlanta Meetup 3/15/16

E S S E T U P : C O N F I G

• index.*.slowlog = DEBUG• Plugins:

• Cloud-AWS • IAM Roles for auto-discovery• S3 Snapshot• Changes in 2.x!

• Curator• Master nodes:

• N+1/2 = discovery.zen.minimum_master_nodes• boostrap.mlockall : true

Page 20: Elasticsearch Atlanta Meetup 3/15/16

E S S E T U P : H A R D W A R E

• r3.2xlarge instance - 64GB Quad Core• 30GB Heap

• AMI deployment• 500GB mounted EBS (commonly)

• path.data = /mnt-point• Cluster-per-customer

Page 21: Elasticsearch Atlanta Meetup 3/15/16

E S S E T U P : I N D I C E S

• Many indices per project (use-case)• Mappings differ considerably

• Unixtime in seconds (2.x changes)• Type coercion (GIGO)• source: true

• Groovy scripting : off!• Index segmentation depends

• by… month, month-year, device, etc…• Sparse data sucks

oil_temp RPM80

2500

Page 22: Elasticsearch Atlanta Meetup 3/15/16

U I I M P L I C A T I O N S

• REST API access• Predikto QDSL• Dynamic Queries• Reporting BI Interface