View
229
Download
3
Category
Preview:
Citation preview
•• ••• ••• ••• ••• ••• ••
out
THIRD EDITION
Hadoop: The Definitive Guide
Tom White
O'REILLY® Beijing • Cambridge • Farnham • Koln • Sebastopol • Tokyo
Table of Contents
Foreword .. .......... . ....... . . . .... .... . .... . ... . .. .. .. . . . . ........ . ....... xv
Preface ..... . . .. ....... . ....... . .. . ........ . . . . .. .. . . . .. ....... ... .. .. . .. . . xvii
1. Meet Hadoop ....................................... .. . .......... . .. . ... 1 Datal 1 Data Storage and Analysis 3 Comparison with Other Systems 4
Rational Database Management System 4 Grid Computing 6 Volunteer Computing 8
A Brief History of Hadoop 9 Apache Hadoop and the Hadoop Ecosystem 12 Hadoop Releases 13
What's Covered in This Book 15 Compatibility 15
2. MapReduce ........... ..... . . .. . .. . ....... . .............. . ... . ....... . 17 A Weather Dataset 17
Data Format 17 Analyzing the Data with Unix Tools 19 Analyzing the Data with Hadoop 20
Map and Reduce 20 Java MapReduce 22
Scaling Out 30 Data Flow 30 Combiner Functions 33 Running a Distributed MapReduce Job 36
Hadoop Streaming 36 Ruby 36 Python 39
v
Hadoop Pipes 40 Writablt
Compiling and Running 41 lmplemt
The Hadoop Distributed Filesystem .............................. .. .... . .. 43 Serializa
3. Avro The Design of HDFS 43 Avro Da HDFS Concepts 45 In-Mem
Blocks 45 Avro Da Namenodes and Datanodes 46 lnteropt HDFS Federation 47 Schema HDFS High-Availability 48 Sort On
The Command-Line Interface 49 AvroM Basic Filesystem Operations 50 Sorting
Hadoop Filesystems 52 AvroM Interfaces 53 File-Based
The Java Interface 55 Sequenc Reading Data from a Hadoop URL 55 Map File Reading Data Using the FileSystem API 57 Writing Data 60 5. Developing Directories 62 The Confi Querying the Filesystem 62 Combi1 Deleting Data 67 Variabl
Data Flow 67 Setting Ur Anatomy of a File Read 67 Manag' Anatomy of a File Write 70 Generic Coherency Model 72 Writing a
Data Ingest with Flume and Sqoop 74 Mappe Parallel Copying with distcp 75 Reduct
Keeping an HDFS Cluster Balanced 76 Running 1 Hadoop Archives 77 Runnir
Using Hadoop Archives 77 Testin1 Limitations 79 Runningc
Packag
4. Hadoop 1/0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 La unci
Data Integrity 81 TheM
Data Integrity in HDFS 81 Retrie'
LocalFileSystem 82 Debug
ChecksumFileSystem 83 Hadoc
Compression 83 Remot
Codecs 85 Tuning a, Compression and Input Splits 89 Profili
Using Compression in MapReduce 90 MapRedt
Serialization 93 Decon
The Writable Interface 94 JobCc
vi I Table of Contents
40 41
. ........ . .. 43 43 45 45 46 47 48 49 50 52 53 55 55 57 60 5. 62 62 67 67 67 70 72 74 75 76 77 77 79
•• ttt t •• •• • 81 81 81 82 83 83 85 89 90 93 94
Writable Classes Implementing a Custom Writable Serialization Frameworks
Avro Avro Data Types and Schemas In-Memory Serialization and Deserialization Avro Datafiles Interoperability Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro MapReduce in Other Languages
File-Based Data Structures SequenceFile MapFile
96 103 108 110 111 114 117 118 121 123 124 128 130 130 130 137
Developing a MapReduce Application .................................... 143 The Configuration API 144
Combining Resources 145 Variable Expansion 146
Setting Up the Development Environment 146 Managing Configuration 148 GenericOptionsParser, Tool, and ToolRunner 150
Writing a Unit Test with MRUnit 154 Mapper 154 Reducer 156
Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver
Running on a Cluster Packaging a Job Launching a Job The MapReduce Web Ul Retrieving the Results Debugging a Job Hadoop Logs Remote Debugging
Tuning a Job Profiling Tasks
MapReduce Workflows Decomposing a Problem into MapReduce Jobs JobControl
157 157 160 161 162 163 165 168 170 175 177 178 179 181 181 183
Table otcontents I vii
Apache Oozie 183 User
6. How MapReduce Works ............... ...... .. .. .. . . ................... 189 Sorting
Prep Anatomy of a Map Reduce Job Run 189 Parti
Classic MapReduce (MapReduce 1) 190 Tota YARN (MapReduce 2) 196 Seco
Failures 202 Joins Failures in Classic MapReduce 202 Map Failures in YARN 204 Redt
Job Scheduling 206 SideD< The Fair Scheduler 207 Usin The Capacity Scheduler 207 Dist
Shuffle and Sort 208 MapRe The Map Side 208 The Reduce Side 210 9. Setting Configuration Tuning 211 Cluste1
Task Execution 214 Net' The Task Execution Environment 215 Cluste1 Speculative Execution 215 Inst: Output Committers 217 Cre: Task JVM Reuse 219 Inst Skipping Bad Records 220 Test
SSHC 7. MapReduce Types and Formats .. .................... . .......... . ........ 223 Hadoc
MapReduce Types 223 Cor The Default MapReduce Job 227 Env
Input Formats 234 Imr Input Splits and Records 234 Hac
Text Input 245 Oth
Binary Input 249 Use
Multiple Inputs 250 YARN
Database Input (and Output) 251 Imr
Output Formats 251 YAi
Text Output 252 Securi
Binary Output 253 Ker
Multiple Outputs 253 Del
Lazy Output 257 Otl
Database Output 258 Bench Ha,
8. MapReduce Features ......... . .... .. .................................. 259 Us( Hado,
Counters 259 Ap
Built-in Counters 259 User-Defined Java Counters 264
viii I Table ofContents
183
. .. .. ... .. .. 189 189 190 196 202 202 204 206 207 207 208 208 210 211 214 215 215 217 219 220
............ 223 223 227 234 234 245 249 250 251 251 252 253 253 257 258
. . .... . .... 259 259 259 264
User-Defined Streaming Counters
Sorting Preparation Partial Sort Total Sort Secondary Sort
Joins . Map-Side Joms Reduce-Side Joins
Side Data Distribution Using the Job Configuration Distributed Cache
MapReduce Library Classes
268 268 269 270 274 277 283 284 285 288 288 289 295
9. Setting Up a Hadoop Cluster .............. ..... . ..... .... .... . .......... 297 297 299 301 302 302 302 303 303 304 305 307 311 316 317 320 320 321 324 325 326 328 329 331 331 333 334 334
Cluster Specification Network Topology
Cluster Setup and Installation Installing Java Creating a Hadoop User Installing Hadoop Testing the Installation
SSH Configuration Hadoop Configuration
Configuration Management Environment Settings Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties User Account Creation
YARN Configuration Important YARN Daemon Properties YARN Daemon Addresses and Ports
Security Kerberos and Hadoop Delegation Tokens Other Security Enhancements
Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs
Hadoop in the Cloud Apache Whirr
Table ofContents I ix
10. Administering Hadoop ...................................... .. . ........ 339 HDFS 339
Persistent Data Structures 339 Safe Mode 344 Audit Logging Tools
Monitoring Logging Metrics Java Management Extensions
Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades
346 347 351 352 352 355 358 358 359 362
11. Pig .......... . ... . .... . .. . ......................................... . 367 Installing and Running Pig 368
Execution Types 368 Running Pig Programs 370 Grunt 370 Pig Latin Editors 371
An Example 371 Generating Examples 373
Comparison with Databases 374 Pig Latin 375
Structure 376 Statements 377 Expressions 381 Types 382 Schemas 384 Functions 388 Macros 390
User-Defined Functions 391 A Filter UDF 391 An Eva! UDF 394 A Load UDF 396
Data Processing Operators 399 Loading and Storing Data 399 Filtering Data 400 Grouping and Joining Data 402 Sorting Data 407 Combining and Splitting Data 408
Pig in Practice 409
x I Table of Contents
12.
13.
Parallel Para me
Hive ..... Installing :
The Hi· An Examr Running l
Config1 HiveS( TheM•
Com paris Schem Updat(
HiveQL Data T Opera1
Tables Manaf Partiti• Storag Impoli Alterii Dropr:
Queryin! Sortin MapR Joins Subqu Views
User-Del Writi1 Writi!
HBase .. HBasics
Back< Concepl
Whir Imp It
lnstallat Test
Clients
.. ..... .. ... 339 339 339 344 346 347 351 352 352 355 358 358 359 362
........... 367 368 368 370 370 371 371 373 374 375 376 377 381 382 384 388 390 391 391 394 396 399 399 400 402 407 408 409
Parallelism Parameter Substitution
409 410
H·ve ....... · · · · · · · · · · · · · · · ... · ....... ............ .... ...... 413 12. I ......... Installing Hive 414
The Hive Shell 415 AnExample 416 Running Hive 417
Configuring Hive 417 Hive Services 419 The Metastore 4 21
Comparison with Traditional Databases 423 Schema on Read Versus Schem~ on Write 423 Updates, Transactions, and Indexes 424
HiveQL 425 Data Types 426 Operators and Functions 428
Tables 429 Managed Tables and External Tables 429 Partitions and Buckets 431 Storage Formats 435 Importing Data 441 Altering Tables 443 Dropping Tables 443
Querying Data 444 Sorting and Aggregating 444 MapReduce Scripts 445 Joins 446 Subqueries 449 Views 450
User-Defined Functions 451 Writing a UDF 452 Writing a UDAF 454
13. HBase ... ... ................................... ...... ................ 459 HBasics
Backdrop Concepts
Whirlwind Tour of the Data Model Implementation
Installation Test Drive
Clients
459 460 460 460 461 464 465 467
Table of Contents I xi
Java Avro, REST, and Thrift
Example Schemas Loading Data Web Queries
HBase Versus RDBMS Successful Service HBase Use Case: HBase at Streamy.com
Praxis Versions HDFS Ul Metrics Schema Design Counters Bulk Load
467 470 472 472 473 476 479 480 481 481 483 483 484 485 485 486 486 487
14. ZooKeeper ............................... . . . . ... . ...... . .......... . .. 489 Installing and Running ZooKeeper 490 An Example 492
Group Membership in ZooKeeper 492 Creating the Group 493 Joining a Group 495 Listing Members in a Group 496 Deleting a Group 498
The ZooKeeper Service 499 Data Model 499 Operations Implementation Consistency Sessions States
Building Applications with ZooKeeper A Configuration Service The Resilient ZooKeeper Application A Lock Service More Distributed Data Structures and Protocols
ZooKeeper in Production Resilience and Performance Configuration
xii I Table of Contents
501 506 507 509 511 512 512 515 519 521 522 523 524
15.
16.
Sqoop. Gettin~
Sqoop ASamJ
Tex1 Genera
Add Import
Con Imp Dire
Worki1 Imp
Import Perforr Export
Exp Exp
CaseSt1 Hadoo
Last Had Gen The Sum
Hadoo Had HyJ1 Hiv1 Pro!
Nutch Dat: Sele Surr
Log Pr· Req Brie Chc Col MaJ
Cascac FieJ.
467 470 472 472 473 476 479 480 481 481 483 483 484 485 485 486 486 487
. ...... . ... .. 489 490 492 492 493 495 496 498 499 499 501 506 507 509 511 512 512 515 519 521 522 523 524
.......................... ...... . ...... ............. 527 15. Sqoop """"" · 527
Getting Sqoop Connectors 529
Sqoop 9 A Sample Import . 52
Text and Binary F1le Formats 532 Jencrared ocl 532
Additional erialization Systems 533 Im.porr : A Deeper Look 533
ntrolling the 1m port 535 Imports and Consistency 536 Direct-mode Imports 536
Working with Imported Data 536 Imported Data and Hive 537
Importing Large Objects 540 Performing an Export 542 Exports: A Deeper Look 543
Exports and Transactionality 545 Exports and SequenceFiles 545
16. Case Studies ....... .. ... . . . . . ........... . .................... . . . ... . . 547 Hadoop Usage at Last.fm 547
Last.fm: The Social Music Revolution 547 Hadoop at Last.fm 547 Generating Charts with Hadoop 548 The Track Statistics Program 549 Summary 556
Hadoop and Hive at Facebook 556 Hadoop at Facebook 556 Hypothetical Use Case Studies 559 Hive 562 Problems and Future Work 566
Nutch Search Engine Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary
Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs
Cascading Fields, Tuples, and Pipes
567 568 571 580 581 581 582 582 582 583 589 590
Table otcontents I xiii
Operations 593 Taps, Schemes, and Flows 594 Cascading in Practice 595 Flexibility 598 Hadoop and Cascading at Share This 599 Summaty 603
TeraByte Sort on Apache Hadoop 603 Using Pig and Wukong to Explore Billion-edge Network Graphs 607
Measuring Community 609 Everybody's Talkin' at Me: The Twitter Reply Graph 609 Symmetric Links 612 Community Extraction 613
. A. Installing Apache Hadoop .............................................. 617
B. Cloudera's Distribution Including Apache Hadoop ............ .. ............ 623
C. Preparing the NCDC Weather Data ......... . .......... . ...... . ........... 625
Index ..... . . . . . ................................... . ................ . ...... 629
xiv J Table of Contents
[-J,1doop got its s web search engi handful of comr route became cit having with Nut< ,1s a part of Nutc
We managed to ~ to handle the Wt moreover, that t1
Around that timt We split off the d of Yahoo!, Hado
In 2006, Tom Wi excellent article l in clear prose. I ~ to read as his pre
From the beginn: for the project. U in tweaking the s anyone to use.
Initially, Tom sg ices. Then he mfj MapReduce API work. In all case role of Hadoop c· Management Co
Tom is now are~ he's an expert in easier to use and
Recommended