Hadoop @ eBuddy

Hadoop @ eBuddy

eBuddy

Web based chat (Started in 2003)● Initially no statistics, msn only● Started basic logging in 2004● Today

○ 34.467.010.693 login records (34x109)○ It takes about 40min to select them all.

XMS (Launched May 23, 2011)● Today

○ 1.334.794.121 records (1,3x109)Website (google analytics)Banners (openx)

Warehousing needs

● Product owners○ Comparing product version

■ avg duration■ msg sent/received

○ Churn analysis○ Feature analysis

● Marketing○ What countries should we focus on○ What people should we target?

● Sales○ Sell banners in countries/products.

● Operations/Dev○ Help solve bugs○ Blocked in countries/providers

Interesting to know

● Developers are Java centric● Hosting in the US but BI people in Amsterdam● 18 hadoop nodes each having

○ 16 cores○ 24G ram○ 4x400G HD's

● We make money with banners○ So don't expect deep pockets

Warehouse timeline● Traditional rdbms (2004)● Custom mapreduce code (2008)

○ Joining two files (merge join/map join?)○ Repeating code○ Consider abstraction○ Changing data changing code?

● Pig scripts (2008/2009)○ Much simpler to read but domain specific

● Hive (2009)○ Generic sql but with some limitations○ Existing tools can be used

Hive

● Hey I already know this:select *from table1 t1 left outer join table2 t2 on (t1.id = t2.id)where t2.id is null;

● Java programmers will like this:○ Spring JdbcTemplates○ Existing jdbc tools (SQuirreL)○ Syntax highlighting○ Code completion

Present● App servers log to mysql

○ Brittle but it works● Hive

○ Sql (most developers know this)○ Partition pruning issues○ No rollup queries

● ETL○ Star schema○ Fair scheduling (ETL vs BI)

■ reserved for etl pool■ don't start reducers until 90% mappers done

○ Lzo on all jobs● MicroStrategy (odbc)● SQuirreL (jdbc)

Future● Look at users from a to z

○ website logs○ banners

● Cassandra handler for hive○ Looking at contact lists (not just size)

● Streaming ETL○ flume

■ No more mysql & scripts■ Directly write into the correct partition

○ avro■ Less schema related problems

○ snappy■ Lightweight compression

Questions?

Hive partition pruning

● Won't workselect count(*)from chatsessions cs inner join calendar c on (c.cldr_id = cs.login_cldr_id)where c.iso_date = '2012-06-14';

● Will workselect cldr_id from calendar where iso_date = '2012-06-14';select count(*) from chatsessions where login_cldr_id in (1234);

Left outer join in PigA = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray);B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray);C = COGROUP A BY a1, B BY b1 OUTER;X = FILTER C BY IsEmpty(B);Z = FOREACH X GENERATE flatten(A.a2);DUMP Z;

● avro & hive: https://issues.apache.org/jira/browse/HIVE-895

● flume:

https://cwiki.apache.org/FLUME/

Technology

Hadoop @ eBuddy