COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University

COP 6727:Advanced Database Systems

Spring 2013

Dr. Tao LiFlorida International University

COP6727 2

Student Self-Introduction

• Name– I will try to remember your names. But if you

have a Long name, please let me know how should I call you

• Anything you want us to know

COP6727 3

Course Overview

• Meeting time– Tuesday and Thursday 12:30pm – 13:45pm

• Office hours: – Thursday 2:30pm – 4:30pm or by

appointment

• Course Webpage:– http://www.cs.fiu.edu/~taoli/class/CAP6727-S

13/index.html

COP6727 4

Course Objectives

• This is an advanced database course– Already taken COP5725

• Assume knowledge of the fundamental concepts of relational databases.

• Cover the core principles and techniques of data and information management

• Discuss advanced techniques that can be applied to traditional database systems in order to provide efficient support of new emerging applications.

Tentative Topics• Query processing and optimization• Transaction management • Database tuning • Data stream systems • Spatial databases • XML • Information retrieval and Web data management • Scalable data processing • Readings in recent developments in database systems and applications

– SQL vs. non-SQL database– Nearest neighbor queries– High-dimensional indexing– Database retrieval and ranking– Stream processing– Big Data – Incremental and online query processing– Mobile database

COP6727 5

COP6727 6

Assignments and Grading• Reading/Written Assignments• Programing Projects• Midterm Exam• Final Project/Presentations• Class attendance is mandatory. • Evaluation will be a subjective process

– Effort is very important component• Regular In-class Students

– Quizzes and Class Participation: 5%– Midterm Exam: 30%– Final Project: 30%– Assignments and Projects: 35%

• Online Students– Midterm Exam: 30%– Final Project: 30%– Homework Assignments: 40%

COP6727 7

Text and References

Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. Third Edition, McGraw Hill, 2003. ISBN: 0-07-246563-8. Links to Textbook Homepage .

In addition, the course materials will also be drawn from recent research literature.

Lecture 1 & 2

• Lecture 1 & 2: Introduction To MapReduce(Most of slides are adapted from Bill Graham, Spiros Papadimitriou, Cloudera Tutorials)

COP6727 8

Outline

• Motivation for MapReduce

• What is MapReduce?

• What is Hadoop?

• What is Hive?

COP6727 9

Motivation for MapReduce

• The Big Data

• How to handle big data?

COP6727 10

The Big Data

• Big data is everywhere

• Documents– Blogs （ 77 million Tumblr and 56.6 million WordPress as of 2012

） , Micro blogs, News, Reviews

• Images– Instagram, Flickr (more than 6 billion images)

• Videos– Youtube, All broadcast

• Others– Map (Google Map)

– Human Genome

– aeronautics and space data

COP6727 11

Another view on “big”

• 2008: Google processes 20 PB a day

• 2009: Facebook has 2.5 PB user data + 15 TB/ day

• 2009: eBay has 6.5 PB user data + 50 TB/day

• 2011: Yahoo! has 180-200 PB of data

• 2012: Facebook ingests 500 TB/day

COP6727 12

Why do we care about those data?

• Modeling and predicting information flow• Recommend/predict links in social networks• Relevance classification / information filtering• Sentiment analysis and opinion mining• Topic modeling and evolution• Measuring influence in social networks• Concept mapping• Search• …

COP6727 13

Big data analysis

• Scalability (with reasonable cost)– Algorithms improvement– Intuitive way: divide and conquer

COP6727 14

Divide and Conquer

COP6727 15

Challenges

• Parallel processing is complicated – How do we assign tasks to workers? – What if we have more tasks than slots? – What happens when tasks fail? – How do you handle distributed

synchronization?

COP6727 16

Challenges – Con’t

• Data storage is not trivial – Traditional database is not reliable

• Data volumes are massive • Reliably storing PBs of data is challenging

– Disk/hardware/network failures – Probability of failure event increases with number of

machines

• For example: – 1000 hosts, each with 10 disks, a disk lasts 3 year– how many failures per day?

COP6727 17

What is MapReduce?

• A programming model for expressing distributed computations at a massive scale

• An execution framework for organizing and performing such computations

• An open-source implementation called Hadoop

COP6727 18

Workflow of Large Data Problem

COP6727 19

MapReduce paradigm

• Implement two functions:

Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3)

• Framework handles everything else*

• Value with same key go to same reducer

COP6727 20

MapReduce Flow

COP6727 21

An Example

COP6727 22

MapReduce paradigm – Con’t

• There’s more!• Partioners decide what key goes to what

reducer – partition(k’, numPartitions) -> partNumber – Divides key space into parallel reducers chunks – Default is hash-based

• Combiners can combine Mapper output before sending to reducer

– Reduce(k2, list(v2)) -> list(v3)

COP6727 23

MapReduce Flow

COP6727 24

MapReduce additional details

• Reduce starts after all mappers complete

• Mapper output gets written to disk

• Intermediate data can be copied sooner

• Reducer gets keys in sorted order

• Keys not sorted across reducers

• Global sort requires 1 reducer or smart partitioning

COP6727 25

MapReduce is good at

• Embarrassingly parallel algorithms

• Summing, grouping, filtering, joining

• Off-line batch jobs on massive data sets

• Analyzing an entire large dataset

COP6727 26

MapReduce can do

• Iterative jobs (e.g., PageRank, K-means Clustering)– Each iteration must read/write data to disk – IO and latency cost of an iteration is high

COP6727 27

MapReduce is not good at

• Jobs that need shared state/coordination– Tasks are shared-nothing– Shared-state requires scalable state store

• Low-latency jobs

• Jobs on small datasets

• Finding individual records

COP6727 28

Summary of MapReduce

• Simple programming model

• Scalable, fault-tolerant

• Ideal for (pre-)processing large volumes of data

COP6727 29

What is Hadoop?

• Hadoop is an open-source implementation based on GFS and MapReduce from Google

• Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System

• Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

COP6727 30

Hadoop provides

• Redundant, fault-tolerant data storage

• Parallel computation framework

• Job coordination

COP6727 31

Hadoop Stack

COP6727 32

Who uses Hadoop?

• Yahoo!

• Facebook

• Last.fm

• Rackspace

• Digg

• Apache Nutch

• ...

COP6727 33

HDFS

• The Hadoop Distributed File System

• Redundant storage

• Designed to reliably store data using commodity hardware

• Designed to expect hardware failures

• Intended for large files

• Designed for batch inserts

COP6727 34

Some Concepts about HDFS

• Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about

files and blocks • The SecondaryNameNode (SNN) holds a

backup of the NN data • DataNodes (DN) store and serve blocks

COP6727 35

Write

COP6727 36

Read

COP6727 37

If a datanode failures

• DNs check in with the NN to report health

• Upon failure NN orders DNs to replicate under- replicated blocks

COP6727 38

Jobs and Tasks in Hadoop

• Job: a user-submitted map and reduce implementation to apply to a data set

• Task: a single mapper or reducer task– Failed tasks get retried automatically – Tasks run local to their data, ideally

• JobTracker (JT) manages job submission and task delegation

• TaskTrackers (TT) ask for work and execute tasks

COP6727 39

Architecture

COP6727 40

How to handle failed tasks?

• JT will retry failed tasks up to N attempts

• After N failed attempts for a task, job fails

• Some tasks are slower than other

• Speculative execution is JT starting up multiple of the same task

• First one to complete wins, other is killed

COP6727 41

Data locality

• Move computation to the data

• Moving data between nodes has a cost

• Hadoop tries to schedule tasks on nodes with the data

• When not possible TT has to fetch data from DN

COP6727 42

Hadoop execution environment

• Local machine (standalone or pseudo- distributed)

• Virtual machine

• Cloud (e.g. Amazon EC2)

• Own cluster

COP6727 43

Demo: word count

• Demo

COP6727 44

Homework

• Write a Hadoop program to index the words within the text document dataset– Example:

• Input: – Doc1: Hello World!

– Doc2: Hello Java!

• Expected output: – Hello \t Doc1 Doc2

– World \t Doc1

– Java \t Doc2

• Due: beginning of the class on 01/10• If you have any questions, send emails to Jingxuan

Li ([email protected])

COP6727 45

Login Info

• Below is the login information for our Hadoop cluster– Server: datamining-node03.cs.fiu.edu– U:dbstudent p:******* (announced during the class)– Gaining the access to the working directory in HDFS (Do not

modify or remove the other directories!): hadoop fs -ls /user/dbstudent

• Input dataset for the homework (every one will be working on this dataset, so do not modify it!): /user/dbstudent/dataset

• Output directory (including the source code, the indexing results) format: /user/dbstudent/output-PID

COP6727 46

What is Hive?

• Data warehousing tool on top of Hadoop• Originally developed at Facebook

– Now a Hadoop sub-project

• Data warehouse infrastructure – Execution: MapReduce – Storage: HDFS files

• Large datasets, e.g. Facebook daily logs– 30GB (Jan’08), 200GB (Mar’08), 15+TB (2009)

• Hive QL: SQL-like query language

COP6727 47

Motivation

• Missing components when using Hadoop MapReduce jobs to process data– Command-line interface for “end users”– Ad-hoc query support– … without writing full MapReduce jobs– Schema information

COP6727 48

Hive Applications

• Log processing

• Text mining

• Document indexing

• Customer-facing business intelligence

(e.g., Google Analytics)

• Predictive modeling, hypothesis testing

COP6727 49

Hive Components

• Shell: allows interactive queries like MySQL shell connected to database– Also supports web and JDBC clients

• Driver: session handles, fetch, execute• Compiler: parse, plan, optimize• Execution engine: DAG of stages (M/R,

HDFS, or metadata)• Metastore: schema, location in HDFS

COP6727 50

Data Model

• Tables– Typed columns (int, float, string, date,

boolean)– Also, list: map (for JSON-like data)

• Partitions– e.g., to range-partition tables by date

• Buckets– Hash partitions within ranges (useful for

sampling, join optimization)COP6727 51

Metastore

• Database: namespace containing a set of Tables

• Holds table definitions (column types, physical layout)

• Partition data

• Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases

COP6727 52

Physical Layout

• Warehouse directory in HDFS– e.g., /home/hive/warehouse

• Tables stored in subdirectories of warehouse

– Partitions, buckets form subdirectories of tables

• Actual data stored in flat files– Control char-delimited text, or SequenceFiles– With custom SerDe, can use arbitrary format

COP6727 53

Useful command examples

• Start Hive: bin/hive• Show all the tables: SHOW TABLES• Create a new table: CREATE TABLE

shakespeare (freq INT, word STRING) ROW FORMAT ELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE

• Loading data into the table: LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare

COP6727 54

Useful command examples – Con’t

• Select data: SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10

• Join: INSERT OVERWRITE TABLE merged SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1

COP6727 55

Summary of Hive

• Supports rapid iteration of ad-hoc queries

• Can perform complex joins with minimal code

• Scales to handle much more data than many similar systems

COP6727 56

References

• White, T., Hadoop: The definitive guide, 2012

• http://hadoop.apache.org/

• http://hive.apache.org/

• MapReduce tutorial: http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Example%3A+WordCount+v1.0

• Bill Graham, http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/BillGraham_IntroToHadoop_Aug30.pdf

• Spiros Papadimitriou, Jimeng Sun, and Rong Yan, http://cs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/07-1.pdf

• Cloudera, http://blog.cloudera.com/wp-content/uploads/2010/01/6-IntroToHive.pdf

COP6727 57

Exercises

• To be announced

COP6727 58