25
Homework 4 • Code for word count http://grepcode.com/file/reposi tory.cloudera.com/content/repos itories/releases/com.cloudera.h adoop/hadoop-examples/0.20.2-32 0/org/apache/hadoop/examples/Wo rdCount.java# WordCount

Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Embed Size (px)

Citation preview

Page 2: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Data Bases in Cloud Environments

Based on:Md. Ashfakul Islam

Department of Computer ScienceThe University of Alabama

Page 3: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Data Today

• Data sizes are increasing exponentially everyday.• Key difficulties in processing large scale data– acquire required amount of on-demand resources– auto scale up and down based on dynamic workloads– distribute and coordinate a large scale job on several

servers– Replication – update consistency maintenance

• Cloud platform can solve most of the above

Page 4: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Large Scale Data Management

• Large scale data management is attracting attention.

• Many organizations produce data in PB level.• Managing such an amount of data requires

huge resources.• Ubiquity of huge data sets inspires researchers

to think in new way.

Page 5: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Issues to Consider

• Distributed or Centralized application?• How can ACID guarantees be maintained?• CAPS theorem– Consistency, Availability, Partition– Data availability and reliability (even if network partition)

are achieved by compromising consistency– Traditional consistency techniques become obsolete

• Consistency becomes bottleneck of data management deployment in cloud– Costly to maintain

Page 6: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Evaluation Criteria for Data Management

• Evaluation criteria:– Elasticity• scalable, distribute new resources, offload unused

resources, parallelizable, low coupling

– Security• untrusted host, moving off premises, new

rules/regulations

– Replication• available, durable, fault tolerant, replication across

globe

Page 7: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Evaluation of Analytical DB

• Analytical DB handles historical data with little or no updates - no ACID properties.

• Elasticity– Since no ACID – easier

• E.g. no updates, so locking not needed

– A number of commercial products support elasticity. • Security

– requirement of sensitive and detailed data– third party vendor store data– potential risk of data leakage and privacy violation

• Replication– Recent snapshot of DB serves purpose.– Strong consistency isn’t required.

Page 8: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Analytical DBs - Data Warehousing

• Data Warehousing DW - Popular application of Hadoop• Typically DW is relational (OLAP)– but also semi-structured, unstructured data

• Can also be parallel DBs (teradata)– column oriented – Expensive, $10K per TB of data

• Hadoop for DW– Facebook abandoned Oracle for Hadoop (Hive)– Also Pig – for semi-structured

Page 9: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Evaluation of Transactional DM• Elasticity– data partitioned over sites– locking and commit protocol become complex

and time consuming– huge distributed data processing overhead

• Security– requirement of sensitive and detailed data– third party vendor store data– potential risk of data leakage and privacy violation

Page 10: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Evaluation of Transactional DM

• Replication– data replicated in cloud – CAP theorem: Consistency, Availability, data

Partition, only two can be achievable– consistency and availability – must choose one– availability is main goal of cloud– consistency is sacrificed– ACID violation

Page 11: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Transactional Data Management

Page 12: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Transactional Data Management

Needed because:• Transactional Data Management– heart of database industry– almost all financial transaction conducted through

it– rely on ACID guarantees

• ACID properties are main challenge in transactional DM deployment in Cloud.

Page 13: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Scalable Transactions for Web Applications in the Cloud

• Two important properties of Web applications– all transactions are short-lived– data request can be responded to with a small set

of well-identified data items• Scalable database services like Amazon

SimpleDB and Google BigTable allow data to be queried only by primary key.

• Eventual data consistency is maintained in these database services.

Page 14: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Relational Joins

• Hadoop is not a DB• Debate between parallel DBs and MR for

OLAPS– Dewitt/Stonebreaker call MR “step backwards”– Parallel faster because can create indexes

Page 15: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Relational Joins - Example

• Given 2 data sets S and T:– (k1, (s1,S1)) k1 is join attribute, s1 is tuple ID, S1 is rest

of attributes– (k2, (s2,S2))– (k1, (t1,T1)) info for T– (k2, (t2,T2))

• S could be user profiles – k is PK, tuple info about age, gender, etc.

• T could be logs of online activity, tuple is particular URL, k is FK

Page 16: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Reduce side Join 1:1

• Map over both datasets, emit (join key, tuple)• All tuples grouped by join key – what is needed for

join• Which is what type of join?– Parallel sort-merge join

• If one-to-one join – at most 1 tuple from S, T match•

• If 2 values, one must be from S, other from T, (don’t know which since no order), join them

Page 17: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Reduce side Join 1:N

• If one to many– If S is one (based on PK) same approach as 1 to 1

will work– But – which one is S? (no ordering)– Solution: buffer all S values in memory• Pick out tuples from S and perform join• Scalability – use memory

Page 18: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Reduce side Join 1:N

• Use value-to value conversion– Create composite key: join key and tuple ID– Define sort order so:• sort by join key • Sort by IDs from S first then• Sort by IDS from T

– Define partitioner so use only join key, so all keys from with same join key at same reducer

Page 19: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Reduce side Join 1:N

• Can remove join key and tuple ID from value to save space

• Whenever reducer finds new join key, will be from S and not T, – put into memory (only the S one)– Join with other tuples until next new join key– No more bottleneck

Page 20: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Consistency in Clouds

Page 21: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Transactional DM

• Transaction is sequence of read & write operations.

• Guarantee ACID properties of transactions:– Atomicity - either all operations execute or none.– Consistency - DB remains consistent after each

transaction execution.– Isolation - impact of a transaction can’t be altered by

another one.– Durability - guarantee impact of committed

transaction.

Page 22: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

ACID Properties

• Atomicity maintained by 2 PC.• Eventual consistency is maintained.• Isolation maintained by decomposing of

transaction.• Timestamp ordering is introduced to order

conflicting transactions.• Durability is maintained by the replication of

data items across several LTMs.

Page 23: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Consistency in Clouds

• Consistent database must remain consistent after execution of successful operations.

• Inconsistency may cause to huge damage.• Consistency is always sacrificed to achieve

availability and scalability.• Strong consistency maintenance in cloud is

very costly.

Page 24: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

• Traditional DM is becoming obsolete.• Thin portable devices and concentrated

computing power shows new way.• ACID guarantee become main challenge.• Some solutions are provided to overcome

challenge.• Consistency remains bottleneck.• Our goal to provide low cost solutions to ensure

data consistency in the cloud.

Page 25: Homework 4 Code for word count . com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Current DB Market Status

• MS SQL doesn’t support auto scaling and load.• MySQL recommended for “lower traffic”• New products: advertise replace MySQL with us• Oracle recently released on-demand resource

allocation• IBM DB2 can auto scale with dynamic workload.• Azure Relational DB – great performance