Upload
roy-russo
View
196
Download
1
Embed Size (px)
Citation preview
E l a s t i c s e a r c h & B i g D a t a
E l a s t i c s e a r c h M e e t u p 0 3 / 1 5 / 2 0 1 6
W H O A M I
• Roy Russo• VP Engineering, Predikto• ElasticHQ.org
A G E N D A
• What is “Predictive Analytics”?• Why Elasticsearch & Spark for ETL?• Elasticsearch in production• Lessons learned
W H Y A M I H E R E ?
& (Big Data) Predictive Analytics
W H O I S P R E D I K T O ?
• Atlanta-based• Founded in 2012• Funded• Paying Customers• Mechanical Engineers / Statisticians• Big Data Architects• Global 1000
W H A T W E D O ?
• Predictive Maintenance• Predictive Analytics • Asset Health• Anomalies• SaaS
H O W D O E S P R E D I C T I V E A N A LY T I C S W O R K ?
D a t a S o u r c e s
P R E D I K T O S A A S P L A T F O R M
OEM Specs
Inventory
Operator
Weather
Financial
Sensors
Asset Mgmt.
INPUT(APIs)
DATA LAKE TRANSFORM BUSINESS INTELLIGENC
E
LEARN OUTPUT(APIs)
MAX
PREDIKTOAPPS
T H E P R E D I K T O A D V A N T A G E ( P A T E N T P E N D I N G )
MAX
Raw data Auto dynamic engineered
features
Auto dynamic feature
selection
Method and algorithmselection
Dynamic self learning
Auto dynamic engineered features
Average Std. Deviation of pressure sensor in the last hour, compared to avg std. dev of pressure sensor last
week for locomotives leaving the same depot, and in the same series of make and
model.
125,000X 125
2
M A C H I N E L E A R N I N G A U T O M A T I O N & S C A L A B I L I T Y
Load Raw Data
Load data from Sensors, Maintenance and Repairs, Usage, Topology, Weather,
…
1
1,000
Auto dynamic feature selection
37
3
Software algorithms and logic find the best raw data and/or
engineered features that correlate to the best representation of the
health and condition of the equipment.
MAX
I N P R O D U C T I O N
• Apache Spark Integration• Why Elasticsearch?• Nginx• Elasticsearch Configuration
S O W E U S E S P A R K …
• ETL:• Shared Memory• Not Disk-Bound• Distributed workloads
• Scale horizontally• New node = more capacity• Spin up. Spin down.
• Runs from Instruction-set• DAGs
E S - H a d o o p : S e t u p
client = Elasticsearch(hosts=[es_uri])
query = '{"query": {"filtered": {"filter": {"terms": {"date_epoch": ' + some_var +'}}}}}'
es_conf = { "es.nodes": es_host_name, "es.port": es_host_port, "es.resource": es_index + '/' + es_mapping, "es.query": query }
es_RDD = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf)
dicts = es_RDD.map(lambda x: x[1]).map(lambda x: reformat_location(x))
E S - H a d o o p : T i p s
• Connects to specific shards• Spark workers - to - ES (primary) shards
• Typing matters• Beware of ‘index.mapping.coerce’
• Ignores ‘size’ parameter!
W H Y E L A S T I C S E A R C H ?
• Time-Series Data• Fast-Reads
• Fast writes with Bulk Inserts• Asynch
• Dynamic querying• Differing schema• Everything is indexed• Visualization
• GeoJSON• Scale horizontally
• New node = more capacity• Spark-ES Connector• Python lib
M I L L I O N S O F Q U E R I E S …
F R O N T I N G W I T H N G I N X
• Reverse Proxy• Load-Balance across cluster• Authentication
• read-only access• block HTTP DELETE
• Log:• request-body, response-time, response-size, etc…
• Keep-alive connections• t2.micro often suffices
E S S E T U P : C O N F I G
• index.*.slowlog = DEBUG• Plugins:
• Cloud-AWS • IAM Roles for auto-discovery• S3 Snapshot• Changes in 2.x!
• Curator• Master nodes:
• N+1/2 = discovery.zen.minimum_master_nodes• boostrap.mlockall : true
E S S E T U P : H A R D W A R E
• r3.2xlarge instance - 64GB Quad Core• 30GB Heap
• AMI deployment• 500GB mounted EBS (commonly)
• path.data = /mnt-point• Cluster-per-customer
E S S E T U P : I N D I C E S
• Many indices per project (use-case)• Mappings differ considerably
• Unixtime in seconds (2.x changes)• Type coercion (GIGO)• source: true
• Groovy scripting : off!• Index segmentation depends
• by… month, month-year, device, etc…• Sparse data sucks
oil_temp RPM80
2500
U I I M P L I C A T I O N S
• REST API access• Predikto QDSL• Dynamic Queries• Reporting BI Interface