41
“Best friend” Big Data & Hadoop showcase Dušan Zamurović @codecentricRS

Coding serbia

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Coding serbia

“Best friend”Big Data & Hadoop

showcaseDušan Zamurović

@codecentricRS

Page 2: Coding serbia

Name: Dušan Zamurović Where I come from?

◦ codecentric Novi Sad What I do?

◦ Java web-app background◦ ♥ JavaScript ♥

Ajax with DWR lib◦ Android◦ currently Big Data (reporting QA)

Who am I?

Page 3: Coding serbia

me Big Data Map/Reduce algorithm Hadoop platform Pig language Showcase

◦ Java Map/Reduce implementation◦ Pig implementation

Conclusion

What I will talk about?

Page 4: Coding serbia
Page 5: Coding serbia

A revolution that will transform how we live, work, and think.

3 Vs of big data◦ Volume◦ Variety◦ Velocity

Every day use-cases◦ Beautiful◦ Useful◦ Funny

Big Data

Page 6: Coding serbia

The principal characteristic Studies report

◦ 1.2 trillion gigabytes of new data was created worldwide in 2011 alone

◦ From 2005 to 2020, the digital universe will grow by a factor of 300

◦ By 2020 the digital universe will amount to 40 trillion gigabytes (more than 5,200 gigabytes for every man, woman, and child in 2020)

Big Data - Volume

Page 7: Coding serbia

The biggest growth – unstructured data◦ Documents◦ Web logs◦ Sensor data◦ Videos and photos◦ Medical devices◦ Social media

>90% of this Big Data is unstructured Analytic value?

◦ 33% valuable info by 2020

Big Data - Variety

Page 8: Coding serbia

Generated at high speed Needs real-time processing

Example I◦ Financial world◦ Thousands or millions of transactions

Example II◦ Retail◦ Analyze click streams to offer recommendations

Big Data – Velocity

Page 9: Coding serbia

Value of Big Data is potentially great but can be released only with the right combination of people, processes and technologies.

…unlock significant value by making information transparent and usable at much higher frequency

Big Data - Value

Page 10: Coding serbia

Measuring heartbeat of a city - Rio de Janeiro

More examples◦ Product development – most valuable features◦ Manufacturing – indicators of quality problems◦ Distribution – optimize inventory and supply chains◦ Sales – account targeting, resource allocation

Beer and diapers

Possible issues?◦ Privacy, security, intellectual property, liability…

Big Data - Value

Page 11: Coding serbia

"Map/Reduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.“- research publication http://research.google.com/archive/mapreduce.html

Map/Reduce

Page 12: Coding serbia

Map/Reduce

Page 13: Coding serbia

In the beginning, there was Nutch

Which problems does it address?◦ Big Data◦ Not fit for RDBMS◦ Computationally extensive

Hadoop && RDBMS◦ “Get data to process” or “send code where data is”◦ Designed to run on large number of machines◦ Separate storage

Hadoop

Page 14: Coding serbia

Distributed File System◦ Designed for commodity hardware◦ Highly fault-tolerant◦ Relaxed POSIX

To enable streaming access to file system data

Assumptions and Goals◦ Hardware failure◦ Streaming data access◦ Large data sets◦ Write-once-read-many◦ Move computation, not data

HDFS

Page 15: Coding serbia

NameNode◦ Master server, central component◦ HDFS cluster has single NameNode◦ Manages client’s access◦ Keeps track where data is kept◦ Single point of failure

Secondary NameNode◦ Optional component◦ Checkpoints of the namespace

Does not provide any real redundancy

HDFS Architecture

Page 16: Coding serbia

DataNode◦ Stores data in the file system◦ Talks to NameNode and responds to requests◦ Talks to other DataNodes

Data replication

TaskTracker◦ Should be where DataNode is◦ Accepts tasks (Map, Reduce, Shuffle…)◦ Set of slots for tasks◦ ♥__ ♥__ ♥__ ________ ♥_ ♥ ♥ ♥__________________

HDFS Architecture

Page 17: Coding serbia

JobTracker◦ Farms tasks to specific nodes in the cluster◦ Point of failure for MapReduce

How it goes?1. Client submits jobs JobTracker2. JobTracker, whereis NameNode3. JobTracker locates TaskTracker4. JobTracker, tasks TaskTracker5. TaskTracker ♥__ ♥__ ♥__

1. Job failed, TaskTracker informs, JobTracker decides2. Job done, JobTracker updates status

6. Client can poll JobTracker for information

HDFS Architecture

Page 18: Coding serbia

Platform for analyzing large data sets◦ Language – Pig Latin◦ High level approach◦ Compiler◦ Grunt shell

Pig compared to SQL◦ Lazy evaluation◦ Procedural language◦ More like an execution plan

Apache Pig

Page 19: Coding serbia

Pig Latin statements◦ A relation is a bag◦ A bag is collection of tuples◦ A tuple is on ordered set of fields◦ A field is piece of data◦ A relation is referenced by name, i.e. alias

Apache Pig – Pig Latin

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);DUMP A;(John,18,4.0F)(Mary,19,3.8F)(Bill,20,3.9F)(Joe,18,3.8F)

Page 20: Coding serbia

Data types◦ Simple

int – signed 32-bit integer long – signed 64-bit integer float – 32-bit floating point double – 64-bit floating point charrarray – UTF-8 string bytearray – blob boolean – since Pig 0.10 datetime

◦ Complex tuple – an ordered set of fields (21,32) bag – a collection of tuples {(21,32),(32,43)} map – a set of key value pairs [pig#latin]

Apache Pig – Pig Latin data types

Page 21: Coding serbia

Data structure and defining schemas◦ Why to define schema?◦ Where to define schema?◦ How to define schema?

Apache Pig – Pig Latin schemas

/* data types not specified */a = LOAD '1.txt' AS (a0, b0);a: {a0: bytearray,b0: bytearray}

/* number of fields not known */a = LOAD '1.txt';a: Schema for a unknown

Page 22: Coding serbia

Arithmetic: +, -, *, /, %, ? : Boolean: AND, OR, NOT Cast Comparison: ==, !=, <, >, <=, >=, matches Type construction: (), {}, [] incl. eq. functions Relational

◦ GROUP◦ DEFINE◦ FILTER◦ FOREACH◦ JOIN◦ UNION◦ STORE◦ LOAD◦ SPLIT

Apache Pig – Pig Latin operators

Page 23: Coding serbia

Eval functions◦ AVG, MAX, MIN, COUNT, SUM, …

Load/Store functions◦ BinStorage◦ JsonLoader, JsonStorage◦ PigStorage

Math functions◦ ABS, COS, …, EXP, RANDOM, ROUND, …

String functions◦ TRIM, LOWER, SUBSTRING, REPLACE, …

Datetime functions◦ *Between, Get*, …

Tuple, Bag, Map functions◦ TOTUPLE, TOBAG, TOMAP

Apache Pig – Pig Latin built in

Page 24: Coding serbia

User Defined Functions◦ Java, Python, JavaScript, Ruby, Groovy

How to write an UDF?◦ Eval function extends EvalFunc<something>◦ Load function extends LoadFunc◦ Store function extends StoreFunc

How to use an UDF?◦ Register◦ Define the name of the UDF if you like◦ Call it

Apache Pig – extend Pig Latin

Page 25: Coding serbia

“Best friend”Hadoop showcase

Page 26: Coding serbia

Imaginary social network A lots of users…

… with their friends, girlfriends, boyfriends, wives, husbands, mistresses, etc…

New relationship arises…◦ … but new friend is not shown in news feed

Where are his/her activities?◦ Hidden, marked as not important

Showcase: a problem

Page 27: Coding serbia

Find out the value of the relationship Monitor and log user activities

◦ For each user, of course◦ Each activity has some value (event weight)◦ Records user’s activities◦ Store those logs in HDFS◦ Analyze those logs from time to time◦ Calculate needed values◦ Show only the activities of “important” friends

Showcase: a solution

Page 28: Coding serbia

Events recorded in JSON format

{ "timestamp": 1341161607860, "sourceUser": "marry.lee", "targetUser": "ruby.blue", "eventName": "VIEW_PHOTO", "eventWeight": 1}

Showcase: input data

Page 29: Coding serbia

Showcase: input data

public enum EventType { VIEW_DETAILS(3), VIEW_PROFILE(10), VIEW_PHOTO(1), COMMENT(2), COMMENT_LIKE(1), WALL_POST(3), MESSAGE(1); …}

Page 30: Coding serbia

Showcase: Java M/Rstatic public class InteractionMap extends Mapper<LongWritable, Text, Text, InteractionWritable> {@Overrideprotected void map(LongWritable offset, Text text, Context context) … { …}@Overrideprotected void reduce(Text token, Iterable<InteractionWritable> interactions, Context context) … { …}

Page 31: Coding serbia

Showcase: Java M/R

void map(LongWritable offset, Text text, Context context) { String[] tokens = MyJsonParser.parse(text); String sourceUser = tokens[1]; String targetUser = tokens[2]; int eventWeight = Integer.parseInt(tokens[4]); context.write(new Text(sourceUser), new InteractionWritable(targetUser, eventWeight));}

Page 32: Coding serbia

Showcase: Java M/Rvoid reduce(Text token, Iterable<InteractionWritable> iActions, Context context) … { Map<Text, InteractionValuesWritable> iActionsGroup = newHashMap<Text,InteractionValuesWritable>(); Iterator<InteractionWritable> iActionsIterator = iActions.iterator(); while(iActionsIterator.hasNext()) { InteractionWritable iAction = iActionsIterator.next(); Text targetUser = new Text(iAction.getTargetUser().toString()); int weight = iAction.getEventWeight().get(); int count = 1;

Page 33: Coding serbia

Showcase: Java M/R … InteractionValuesWritable iActionValues = iActionGroup.get(tUser); if (iActionsValues != null) { weight += iActionValues.getWeight().get(); count = iActionValues.getCount.get() + 1; } iActionGroup.put(targetUser, new InteractionValuesWritable(weight, count));

List orderedInteractions = sortInteractionsByWeight(iActionsGroup); for (Entry entry : orderedInteractions) { InteractionsValuesWritable value = entry.getValue(); String resLine = … // entry.key + value.weight + value.count context.write(token, new Text(resLine)); }}

Page 34: Coding serbia

Showcase: M/R resultcasie.keller petar.petrovic 97579 32554casie.keller marry.lee 97284 32094casie.keller jane.doe 97247 32400casie.keller domenico.quatro-formaggi 96712 32106casie.keller esmeralda.aguero 96665 32251casie.keller jason.bourne 96499 32043casie.keller jose.miguel 96304 31927casie.keller steve.smith 95929 32267casie.keller john.doe 95664 31996casie.keller swatka.mawa 95421 31785casie.keller lee.young 95400 31758casie.keller ruby.blue 95132 32181domenico.quatro-formaggi jane.doe 97442 32492domenico.quatro-formaggi ruby.blue 97072 31916domenico.quatro-formaggi jason.bourne 96967 3223…

Page 35: Coding serbia

Showcase: Pig M/Rclass JsonLoader extends LoadFunc { @Override public InputFormat getInputFormat() throws IOException { return new TextInputFormat(); } public ResourceSchema getSchema(String location, Job job) … { ResourceSchema schema = new ResourceSchema(); ResourceFieldSchema[] fieldSchemas = new ResourceFieldSchema[SCHEMA_FIELDS_COUNT]; fieldSchemas[0] = new ResourceFieldSchema(); fieldSchemas[0].setName(FIELD_NAME_TIMESTAMP); fieldSchemas[0].setType(DataType.LONG); … schema.setFields(fieldSchemas); return schema; }}

Page 36: Coding serbia

Showcase: Pig M/Rclass JsonLoader extends LoadFunc {… @Override public Tuple getNext() throws IOException { try { boolean notDone = in.nextKeyValue(); if (!notDone) { return null; } Text jsonRecord = (Text) in.getCurrentValue(); String[] values = MyJsonParser.parse(jsonRecord); Tuple tuple = tuppleFactory.newTuple(Arrays.asList(values)); return tuple; } catch (Exception exc) { throw new IOException(exc); } }}

Page 37: Coding serbia

Showcase: Pig M/Rclass AverageWeight extends EvalFunc<String> {… @Override public String exec(Tuple input) … { String output = null; if (input != null && input.size() == 2) { Integer totalWeight = (Integer) input.get(0); Integer totalCount = (Integer) input.get(1); BigDecimal average = new BigDecimal(totalWeight). divide(new BigDecimal(totalCount), SCALE, RoundingMode.HALF_UP); output = average.stripTrailingZeros().toPlainString(); } return output; }

}

Page 38: Coding serbia

Showcase: Pig M/RREGISTER codingserbia-udf.jarDEFINE AVG_WEIGHT com.codingserbia.udf.AverageWeight();

interactionRecords = LOAD ‘/blog/user_interaction_big.json’ USING com.codingserbia.udf.JsonLoader();

interactionData = FOREACH interactionRecords GENERATE sourceUser, targetUser, eventWeight;

groupInteraction = GROUP interactionData BY (sourceUser, targetUser);…

Page 39: Coding serbia

Showcase: Pig M/R…summarizedInteraction = FOREACH groupInteraction GENERATE group.sourceUser AS sourceUser, group.targetUser AS targetUser, SUM(interactionData.eventWeight) AS eventWeight, COUNT(interactionData.eventWeight) AS eventCount, AVG_WEIGHT( SUM(interactionData.eventWeight), COUNT(interactionData.eventWeight)) AS averageWeight;

result = ORDER summarizedInteraction BY sourceUser, eventWeight DESC;

STORE result INTO '/results/pig_mr’ USING PigStorage();

Page 40: Coding serbia

Conclusion

Page 41: Coding serbia