BigData @ Digital Factory!
une petite histoire en cours d’écriture!
Olivier Varene! DSIF/DFY!Orange DevDay 2013 !!
Hadoop!
Main Distributions!Licence! Business Model! Support!
Apache! Apache 2.0! Fundation! community only!
HortonWorks! Apache 2.0!HortonWorks (add-on)!
PS + Training + support!
community + Professional!
Cloudera!Apache 2.0!
Closed Source (not core)!
PS + Licencing + Training + support!
community + Professional!
MapR! Apache 2.0!Closed Source (FS)!
PS + Licencing + support!
community + Professional!
WanDisco! Apache 2.0!Closed Source (DConE)!
PS + Licencing + Training + support!
community + Professional!
PS: Professional Services!
Big Name Distributions!
• IBM InfoSphere BigInsights!
• GreenPlum (EMC)!
• Intel Distribution for Hadoop!
• …!
Paying & Closed Source !
Tools (1st level)!Tool"! Description! Licence!
Apache Pig! Scripting Platform! Apache 2!Apache Hive! Data Access & Query! Apache 2!
Apache HCatalog! Metadata Services! Apache 2!Apache HBase! NoSQL Database! Apache 2!
Apache ZooKeeper! Cluster Coordination! Apache 2!Apache Tez ! Query processing! Apache 2!
Apache Oozie! Workflow Scheduler! Apache 2!Apache Sqoop! Data Integration Services! Apache 2!
Tools (add-ons)!Tool"! Description! Licence!
Teradata connector! Connector! Terradata + Distribution!
Hive ODBC! ODBC! Distribution!
Mahout! Data Mining! Apache 2!
Cascading! Fault Tolerant API / Framework! Apache 2!
Cassandra Connector! Connector to Cassandra NoSQL! Apache 2!
MongoDB Connector! Connector to MongoDB! Apache 2!
…!
Landscape!
@ Digital Factory!DSIF / Digital Factory!
Back in Time!
• PageRank calculus on billions nodes and 10s billions edges
• regularly failed ! (hardware ...)
• 4 to 8 weeks calculus
• unscalable
• failure rate around 80%
• One person full time to supervise !
- 3 years!
Answer ?!
Internal!Development!+ full control!- long term!- €€ !
OpenSource!+ €€!+ short term!- support!- evolution!
How does it work ?!
Hadoop Axioms!
• System shall manage and heal himself"• Performance shall scale linearly"• Compute shall move to data"• Modular and extensible!
HDFS (Simple)!Self-healing High-Bandwidth Clustered Storage!
YARN!Allow plugging in new paradigms!
MapReduce V1!
Map()
Map()
Map()
Map()
Map()
Reduce()
Reduce()
partXX
partXX
Data on HDFS
Sort!Partition!
Map! Reduce!
Before map()!
Data on HDFS
Block of Data
Block of Data
Map()
Map()
SlicingPartitioning
JobTracker calculateslocality for job assignment
and input split data
…(Kin,Vin) (Kout,Vout)
Java (Api)!Mapper!Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void map(Kin,Vin,context) {
…. Your program …! }
}
before reduce()!
Map() filefilefile
RAMsorting
disk write
temporary intermediate files
sorted in each file
Combine() filefile
1 or more times
temporary intermediate files
OPTIONAL
key namespace partitioning
(Kout,Vout) (Kout,Vout)
RAMsorting
disk write
(Kout,Vout)
Partition()partpartpart
JobTrackerdistribution to
reducers
Java (Api)!Reducer!Class YourReducer extends
Reducer(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void reduce(Kin,List<Vin>,context) {
…. Your program …
}
}
Optimization tips!• JVM!
• Algorithm in MapReduce paradigm!
• Combiner!
• Sort algorithm!
• Partitioning!
Streaming!
… | <mapper> | … | <reducer> |…!
• STDIN !• STDOUT!• Text as input and output by default!• ‘\t’ as default separator!• Use your language : perl, python, shell, ruby, … !• (interpreter needed on all nodes)!
hadoop jar $streamingJar –input <inputDir> -output <outputDir> !-mapper <mapProg> -reducer <reduceProg> -file <files>!
Pipe – C++!
… | <mapper> | … | <reducer> |…!
• Socket communication!• Bytes as input and output!• C++ API!
hadoop put <binFile> <toHDFS…>!hadoop pipes –input <inputDir> -output <outputDir> ! -program <path/binfile> [-conf <confFile>]!
class MyMap: public HadoopPipes::Mapper { … }
class MyReducer: public HadoopPipes::Reducer { … }
Too difficult!
Hopefully there are tools that can generate code for you or let you do SQL queries !!!!
Tools! Algo / Libs!
PIG!Scripting Language :!
• Simple!
• Parallel execution!
• Data oriented!
• Extensible via UDF!
• Automatic performance enhancement via compiler!
set job.name calculateGraphDegres!!%default nbpigreducers 10!set default_parallel $nbpigreducers!!-- degres sortant!A = load '$degout' using PigStorage() as (url:chararray,out_deg:int);!-- keep entries where out_deg > 1!A2 = filter A by (out_deg > 1);!B = order A2 by out_deg DESC;!store B into '$degoutOrdered';!!-- distribution des degres sortants!C = foreach A generate out_deg,1 as deg_occ;!D = group C by out_deg;!E = foreach D generate FLATTEN(group) as out_deg,SUM(C.deg_occ) as deg_occ;!F = order E by out_deg ASC;!store F into '$degoutDistrib';!
Hive!Querying Language :!
• HiveQL (sql like)!
• ETL Tool!
• HDFS, HBase, Thrift …!
• MapReduce interface (with streaming to python …)!
• Extensible via UDF!
CREATE EXTERNAL TABLE b_packet (timestamp string, packet_length int, protocol string) !ROW FORMAT DELIMITED FIELDS TERMINATED BY "|" !LOCATION ‘b-file/input/';! !CREATE EXTERNAL TABLE b_packet_out (protocol string, cnt int) !ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" !LOCATION ’b-file/output/1/’;!!INSERT INTO TABLE b_packet_out!select count(*) as overall, !sum( if(protocol like '^ip:tcp',1,0) as tcp, sum( if(protocol like '^ip:udp',1.0) as udp, sum( if(protocol like '^ip:icmp'1,0) as icmp !from b_packet;!
R!Rhadoop :
https://github.com/RevolutionAnalytics/RHadoop/wiki!
• rmr : functions providing mapreduce in R!
• rhdfs : functions providing dhfs operations in R!
• rhbase : functions providing hbase operations in R!
library(rmr2) library(rhdfs) gdp <- read.csv("GDP_converted.csv") hdfs.init() gdp.values <- to.dfs(gdp) gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count)
Gui!
Tools!
Poc !
Time saver!Prototyping!Visualize complex processes!Fast changes!
But need to know the inside for optimization!
SQL!
Hbase !
Phoenix !Hive !
Tajo !
HDFS!
Impala ! Presto !
ODBC/JDBC!
HiveQL!JDBC!
SQL! HQL!ISO!PSQL!
Prod / Beta & Alpha products!
Sqoop!Transfer from/to HDFS to/from Structured storage via
JDBC connectors : PostGresql, MySQL, Oracle, Terradata, …!
RDBMS!
NoSQL!Hadoop!process!
Sqoop!import!
Sqoop!export!
Oozie!
Nowadays !@ Digital Factory ?!
In Production!• Since 2010!
• Growth by internal projects needs!
• Recycling Servers (€€ savings)!
• We learned as we walked : !* tar -> cdh3 -> cdh4 …!* optimizations!* Run processes …!
Production « PFS »!• Shared among different teams!• xx nodes on COTS!• xxx TBytes!• >xxx jobs / per day!• Monitoring : Xymon !• Graphing via NetStat (SNMP / RRD : x’s oids/second)!• Automatic Configuration!
Architecture!
HIVE!
MapReduce!
HDFS!
ZooK
eepe
r!
Mahout!
Oozie!
Khiops!
Sqoop!
Real Time Query Engine!R!
HIVE Server!
Web Service!
Flume!
App Services!
PIG!
HCatalog!HBase!
Cassandra!
Cascading!
in POC!
Benefits!• Infrastructure cost!
• Development cost!
• Robustness!
• Scalability!
• New development areas (Graph Mining, Logs statistics …)!
€ !-70% loc!-50% dev time!-75% run cost!
A few of our use cases!
Graph algorithms for http://www.lemoteur.fr/!
Scoring - Search Engine!
xx TB compressed!xx billions nodes!>xxx billions edges!
xxx TB in RAM!
xRank!
Customers’ statistical behaviors, ads display optimization, …!
Profiling!
xxx GB / daily!+!xxx GB / monthly (customer DB)!
Customer profile!
Log Analysis!
xx billion events daily!
OJD certified Measurements : Internet and Mobile, Customers’ journey analysis, …!
KPIs!
Benefits & Drawbacks!
Scalable!Stable!
RUN Cost!Development Cost!
Performance!Very fast evolution!
New Dev areas!
Learning curve!Debug!
Algorithms!Complex!
Very fast evolution!
Future!• Enhance Security and robustness!
• Create Services & Functional Catalog!
• Continue building our expertise : Fast Data, Cascading, MR2, …!
• A thousand nodes cluster !!
• Help other teams to go on Production!
CONTACT US : [email protected]!
My Thanks to!
• Apache http://www.apache.org/!
• http://hadooper.blogspot.com/!
• Cloudera http://www.cloudera.com/!
• HortonWorks http://www.hortonworks.com/!
• And all the community !!