Upload
wso2
View
109
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Learn with WSO2 -‐ Building your Big Data Solu8on
Srinath Perera Director of Research
WSO2 Inc.
About WSO2
• Providing the only complete open source componentized cloud platform
– Dedicated to removing all the stumbling blocks to enterprise agility – Enabling you to focus on business logic and business value
• Recognized by leading analyst firms as visionaries and leaders
– Gartner cites WSO2 as visionaries in all 3 categories of application infrastructure
– Forrester places WSO2 in top 2 for API Management • Global corporation with offices in USA, UK & Sri Lanka
– 200+ employees and growing
• Business model of selling comprehensive support & maintenance for our products
150+ globally positioned support customers
Consider a day in your life • What is the best road to take? • Would there be any bad weather?
• What is the best way to invest the money?
• Should I take that loan? • Can I op8mize my day? • Is there a way to do this faster?
• What have others done in similar cases?
• Which product should I buy?
People wanted to (through ages) • To know (what happened?)
• To Explain (why it happened)
• To Predict (what will happen?)
What is Big data?
• There is lot of data available – E.g. Internet of things
• We have compu8ng power • We have technology • Goal is same
– To know – To Explain – To predict
• Challenge is the full lifecycle
Drivers of Big Data
Data Avalanche/ Moore’s law of data
• We are now collec8ng and conver8ng large amount of data to digital forms
• 90% of the data in the world today was created within the past two years.
• Amount of data we have doubles very fast
In real life, most data are Big • Web does millions of ac8vi8es per second, and so much server logs are created.
• Social networks e.g. Facebook, 800 Million ac8ve users, 40 billion photos from its user base.
• There are >4 billion phones and >25% are smart phones. There are billions of RFID tags.
• Observa8onal and Sensor data – Weather Radars, Balloons – Environmental Sensors – Telescopes – Complex physics simula8ons
Why Big Data is hard? • How store? Assuming 1TB bytes it
takes 1000 computers to store a 1PB • How to move? Assuming 10Gb
network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB
• How to search? Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277CPU days to process a 1TB and 785 CPU years to process a 1 PB
• How to process? – How to convert algorithms to work in
large size – How to create new algorithms
hap://www.susanica.com/photo/9
Why it is hard (Contd.)? • System build of many computers
• That handles lots of data • Running complex logic • This pushes us to fron8er of Distributed Systems and Databases
• More data does not mean there is a simple model
• Some models can be complex as the system
hap://www.flickr.com/photos/mariachily/5250487136, Licensed CC
Big Data Architecture
WSO2 Offerings
• Two tools – WSO2 BAM for store and process – WSO2 CEP for real8me processing
• These tools covers whole processing life cycle for your Big Data with help of few other products as needed. – WSO2 Storage server – WSO2 User Experience Server
Big Data Architecture Implementa8on
Sensors • Built sensors in WSO2 Products
• Event logs – Click streams, Emails, chat, search, tweets ,Transac8ons …
• Custom Sensors – Video surveillance, Cash flows, Traffic, Surveillance, Smart Grid, Produc8on line, RFID (e.g. Walmart), GPS sensors, Mobile Phone, Internet of Things
hap://www.flickr.com/photos/imuaoo/4257813689/ by Ian Muaoo,
hap://www.flickr.com/photos/eastcapital/4554220770/, hap://www.flickr.com/photos/patdavid/4619331472/ by Pat David copyright CC
Collec8ng Data • Data collected at sensors and sent to big data system via events or flat files
• Event Streams: we name the events by its content/ originator
• Get data through – Point to Point – Event Bus
• E.g. Data bridge – a thrij based transport we did that do about 400k events/ sec
Storing Data • Historically we used databases
– Scale is a challenge: replica8on, sharding
• Scalable op8ons – NoSQL (Cassandra, Hbase) [If data is structured]
• Column families Gaining Ground – Distributed file systems (e.g. HDFS) [If data is unstructured]
• New SQL – In Memory compu8ng, VoltDB
• Specialized data structures – Graph Databases, Data structure servers hap://www.flickr.com/photos/keso/
363133967/
Storing Data (Contd.)
• WSO2 Offerings (WSO2 Storage Server) – Small Structured Data: keep in rela8onal databases.
– Large structured data : Cassandra – Large unstructured data: HDFS
Making Sense of Data • To know (what happened?)
– Basic analy8cs + visualiza8ons (min, max, average, histogram, distribu8ons … )
– Interac8ve drill down • To explain (why)
– Data mining, classifica8ons, building models, clustering
• To forecast – Neural networks, decision models
Making Sense of Data (Contd.)
• Batch processing – WSO2 BAM – Hive Scripts – Map Reduce Jobs
• Real 8me processing – CEP – Event Query Language
• Above two are the plarorm, you need to program your usecase.
To know (what happened?) • Mainly Analy8cs
– Min, Max, average, correla8on, histograms
– Might join group data in many ways
• Implemented with MapReduce or Queries
• Data is ojen presented with some visualiza8ons
• Examples – forensics – Assessments – Historical data/ reports/ trends
hap://www.flickr.com/photos/isriya/2967310333/
To Explain (Paaerns) • Correla8on
– Scaaer plot, sta8s8cal correla8on
• Data Mining (Detec8ng Paaerns) – Clustering and classifica8on – Finding Similar items – Finding Hubs and authori8es in a Graph
– Finding frequent item sets – Making recommenda8on
• Apache Mahout hap://www.flickr.com/photos/eriwst/2987739376/ and hap://www.flickr.com/photos/focx/5035444779/
To Predict: Forecasts and Models • Trying to build a model for the
data • Theore8cally or empirically
– Analy8cal models (e.g. Physics) – Neural networks – Reinforcement learning – Unsupervised learning (clustering, dimensionality reduc8on, kernel methods)
• Examples – Transla8on – Weather Forecast models – Building profiles of users – Traffic models – Economic models
• Lot of domain specific work
hap://misterbijou.blogspot.com/2010_09_01_archive.html
Informa8on Visualiza8on
• Presen8ng informa8on – To end user – To decision takers – To scien8st
• Interac8ve explora8on • Sending alerts • WSO2 UES
– Jaggery based • BAM/ CEP can Work with most other UI tools
hap://www.flickr.com/photos/stevefaeembra/3604686097/
WSO2 UES
• Dashboards, and Store • Build your own Uis with Jaggery
MapReduce/ Hadoop • First introduced by Google, and used as the processing model for their architecture
• Implemented by opensource projects like Apache Hadoop and Spark
• Users writes two func8ons: map and reduce
• The framework handles the details like distributed processing, fault tolerance, load balancing etc.
• Widely used, and the one of the catalyst of Big data
void map(ctx, k, v){ tokens = v.split(); for t in tokens ctx.emit(t,1)
} void reduce(ctx, k, values[]){
count = 0; for v in values count = count + v; ctx.emit(k,count);
}
MapReduce (Contd.)
Data In the Move • Idea is to process data as they
are received in streaming fashion
• Used when we need – Very fast output – Lots of events (few 100k to millions)
– Processing without storing (e.g. too much data)
• Two main technologies – Stream Processing (e.g. Strom, hap://storm-‐project.net/ )
– Complex Event Processing (CEP) hap://wso2.com/products/complex-‐event-‐processor/
Complex Event Processing (CEP) • Sees inputs as Event streams and queried with SQL like language
• Supports Filters, Windows, Join, Paaerns and Sequences
from p=PINChangeEvents#win.time(3600) join t=TransactionEvents[p.custid=custid][amount>10000] #win.time(3600)
return t.custid, t.amount;
Summary
Case Study 1: Tracing Business Process • Business process is built using many services
• Track trace each step, and analyze to understand how to op8mize
• E.g. sales pipeline
Some Queries
• Conversion rate? • How many deals in pipeline at each month? • Average size of the deals? • Average 8me deal takes? • Can we guess an large size deals early? • Which is beaer? Going for few large ones or many small ones?
• Was there any delays from Ourside?
Hive: Average Size of the Deal
• Hive uses an SQL like synatax. • Easy to understand and learn
hive> LOAD DATA ..
hive> SELECT avg(value) from LEAD_ACTIVITY WHERE action=“closedWon” groupby month;
Map Reduce: How many deals in Pipeline?
How many deals in Pipeline?(Contd.) void map(ctx, k, v){
Deals deal= parse(v); int month = getMonth(deal.time); ctx.emit(month,1)
} void reduce(ctx, k, values[]){
count = 0; for v in values count = count + v; ctx.emit(k,count);
}
Case study 2: DEBS Challenge • Event Processing challenge
• Real football game, sensors in player shoes + ball
• Events in 15k Hz • Event format
– Sensor ID, TS, x, y, z, v, a
• Queries – Running Stats – Ball Possession – Heat Map of Ac8vity – Shots at Goal
Example: Detect ball Possession • Possession is 8me a player hit the ball un8l someone else hits it or it goes out of the ground
from Ball#window.length(1) as b join Players#window.length(1) as p
unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55 select ... insert into hitStream from old = hitStream ,
b = hitStream [old. pid != pid ], n= hitStream[b.pid == pid]*,
( e1 = hitStream[b.pid != pid ] or e2= ballLeavingHitStream)
select ... insert into BallPossessionStream
hap://www.flickr.com/photos/glennharper/146164820/
Conclusions
• What is Big Data? • Big Data Architecture
– Collec8ng data – Storing data – Processing Data
• WSO2 Offerings • Case Studies
Ques%ons?
Engage with WSO2
• Helping you get the most out of your deployments • From project evaluation and inception to development
and going into production, WSO2 is your partner in ensuring 100% project success