Upload
others
View
38
Download
0
Embed Size (px)
Citation preview
→ What's the “Need” ? ←
❏ Big data Ocean❏ Expensive hardware❏ Frequent Failures and Difficult recovery❏ Scaling up with more machines
2
→ Hadoop ←
❏ Open source software- a Java framework- initial release: December 10, 2011
❏ It provides both,❏ Storage → [HDFS] ❏ Processing → [MapReduce]
❏ HDFS: Hadoop Distributed File System
3
→ How Hadoop addresses the need? ←
❏ Big data Ocean■ Have multiple machines. Each will store some portion of data, not the entire data.
❏ Expensive hardware■ Use commodity hardware. Simple and cheap.
❏ Frequent Failures and Difficult recovery■ Have multiple copies of data. Have the copies in different machines.
❏ Scaling up with more machines■ If more processing is needed, add new machines on the fly
4
→ HDFS ←
❏ Runs on Commodity hardware: Doesn't require expensive machines❏ Large Files; Write-once, Read-many (WORM)❏ Files are split into blocks
❏ Actual blocks go to DataNodes❏ The metadata is stored at NameNode
❏ Replicate blocks to different node❏ Default configuration:
■ Block size = 128MB■ Replication Factor = 3
5
→ Where NOT TO use HDFS ←
❏ Low latency data access■ HDFS is optimized for high throughput of data at the expense of latency.
❏ Large number of small files■ Namenode has the entire file-system metadata in memory.■ Too much metadata as compared to actual data.
❏ Multiple writers / Arbitrary file modifications■ No support for multiple writers for a file■ Always append to end of a file
9
→ Some Key Concepts ←
❏ NameNode❏ DataNodes❏ JobTracker❏ TaskTrackers❏ ResourceManager (MRv2)❏ NodeManager (MRv2)❏ ApplicationMaster (MRv2)
10
→ NameNode & DataNodes ←❏ NameNode:
■ Centerpiece of HDFS: The Master■ Only stores the block metadata: block-name, block-location etc.■ Critical component; When down, whole cluster is considered down; Single point of failure■ Should be configured with higher RAM
❏ DataNode:■ Stores the actual data: The Slave■ In constant communication with NameNode■ When down, it does not affect the availability of data/cluster■ Should be configured with higher disk space
❏ SecondaryNameNode:■ Doesn't actually act as a NameNode■ Stores the image of primary NameNode at certain checkpoint■ Used as backup to restore NameNode
11
→ JobTracker & TaskTrackers ←
❏ JobTracker:■ Talks to the NameNode to determine location of the data■ Monitors all TaskTrackers and submits status of the job back to the client■ When down, HDFS is still functional; no new MR job; existing jobs halted■ Replaced by ResourceManager/ApplicationMaster in MRv2
❏ TaskTracker:■ Runs on all DataNodes■ TaskTracker communicates with JobTracker signaling the task progress■ TaskTracker failure is not considered fatal■ Replaced by NodeManager in MRv2
13
→ ResourceManager & NodeManager ←❏ Present in Hadoop v2.0❏ Equivalent of JobTracker & TaskTracker in v1.0
❏ ResourceManager (RM):■ Runs usually at NameNode; Distributes resources among applications.■ Two main components: Scheduler and ApplicationsManager (AM)
❏ NodeManager (NM):■ Per-node framework agent■ Responsible for containers■ Monitors their resource usage■ Reports the stats to RM
Central ResourceManager and Node specific Manager together is called YARN
14
→ Hadoop 1.0 vs. 2.0 ←
❏ HDFS 1.0:■ Single point of failure■ Horizontal scaling performance issue
❏ HDFS 2.0:■ HDFS High Availability■ HDFS Snapshot■ Improved performance■ HDFS Federation
16
→ Interacting with HDFS ←
❏ Command prompt:■ Similar to Linux terminal commands■ Unix is the model, POSIX is the API
❏ Web Interface:■ Similar to browsing a FTP site on web
18
→ Notes ←
File Paths on HDFS:■ hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt■ hdfs://localhost:8020/user/USERNAME/demo/data/file.txt■ /user/USERNAME/demo/file.txt■ demo/file.txt
File System:■ Local: local file system (linux)■ HDFS: hadoop file system
At some places:The terms “file” and “directory” has the same meaning.
20
→ Before we start ←
❏ Command:■ hdfs
❏ Usage:■ hdfs [--config confdir] COMMAND
❏ Example:■ hdfs dfs ■ hdfs dfsadmin■ hdfs fsck■ hdfs namenode■ hdfs datanode
21
→ In general Syntax for `dfs` commands ←
hdfs dfs
-<COMMAND>-[OPTIONS]
<PARAMETERS>e.g.hdfs dfs -ls -R /user/USERNAME/demo/data/
23
0. Do It yourself
❏ Syntax:■ hdfs dfs -help [COMMAND … ]■ hdfs dfs -usage [COMMAND … ]
❏ Example:■ hdfs dfs -help cat■ hdfs dfs -usage cat
24
1. List the file/directory
❏ Syntax:■ hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path>
❏ Example:■ hdfs dfs -ls ■ hdfs dfs -ls /■ hdfs dfs -ls /user/USERNAME/demo/list-dir-example■ hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example
25
2. Creating a directory
❏ Syntax:■ hdfs dfs -mkdir [-p] <hdfs-dir-path>
❏ Example:■ hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example■ hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir-
example/dir1/dir2/dir3
26
3. Create a file on local & put it on HDFS
❏ Syntax:■ vi filename.txt■ hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>
❏ Example:■ vi file-copy-to-hdfs.txt■ hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-
example/
27
4. Get a file from HDFS to local
❏ Syntax:■ hdfs dfs -get <hdfs-file-path> [local-dir-path]
❏ Example:■ hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-
hdfs.txt ~/demo/
28
5. Copy From LOCAL To HDFS
❏ Syntax:■ hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>
❏ Example:■ hdfs dfs -copyFromLocal file-copy-to-hdfs.txt
/user/USERNAME/demo/copyFromLocal-example/
29
6. Copy To LOCAL From HDFS
❏ Syntax:■ hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>
❏ Example:■ hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-
example/file-copy-from-hdfs.txt ~/demo/
30
7. Move a file from local to HDFS
❏ Syntax:■ hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>
❏ Example:■ hdfs dfs -moveFromLocal /path/to/file.txt
/user/USERNAME/demo/moveFromLocal-example/
31
8. Copy a file within HDFS❏ Syntax:
■ hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path>❏ Example:
■ hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt /user/USERNAME/demo/data/
32
9. Move a file within HDFS❏ Syntax:
■ hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path>❏ Example:
■ hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt /user/USERNAME/demo/data/
33
10. Merge files on HDFS
❏ Syntax:■ hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path>
❏ Examples:■ hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/
/path/to/all-files.txt
34
11. View file contents
❏ Syntax:■ hdfs dfs -cat <hdfs-file-path>■ hdfs dfs -tail <hdfs-file-path>■ hdfs dfs -text <hdfs-file-path>
❏ Examples:■ hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt■ hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head
35
12. Remove files/dirs from HDFS
❏ Syntax:■ hdfs dfs -rm [options] <hdfs-file-path>
❏ Examples:■ hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt■ hdfs dfs -rm -R /user/USERNAME/demo/remove-example/■ hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/
36
13. Change file/dir properties
❏ Syntax:■ hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path>■ hdfs dfs -chmod [-R] <permissions> <hdfs-file-path>■ hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path>
❏ Examples:■ hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change-
properties.txt
37
14. Check the file size
❏ Syntax:■ hdfs dfs -du <hdfs-file-path>
❏ Examples:■ hdfs dfs -du /user/USERNAME/demo/data/file.txt■ hdfs dfs -du -s -h /user/USERNAME/demo/data/
38
15. Create a zero byte file in HDFS
❏ Syntax:■ hdfs dfs -touchz <hdfs-file-path>
❏ Examples:■ hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt
39
16. File test operations
❏ Syntax:■ hdfs dfs -test -[defsz] <hdfs-file-path>
❏ Examples:■ hdfs dfs -test -e /user/USERNAME/demo/data/file.txt
❏ echo $?
40
17. Get FileSystem Statistics
❏ Syntax:■ hdfs dfs -stat [format] <hdfs-file-path>
❏ Format Options:■ %b - file size in blocks, %g - group name of owner■ %n - filename %o - block size■ %r - replication %u - user name of owner■ %y - modification date
41
18. Get File/Dir Counts
❏ Syntax:■ hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path>
❏ Example:■ hdfs dfs -count -v /user/USERNAME/demo/
42
19. Set replication factor
❏ Syntax:■ hdfs dfs -setrep -w -R n <hdfs-file-path>
❏ Examples:■ hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt
43
20. Set Block Size
❏ Syntax:■ hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path>
<hdfs-file-path>❏ Examples:
■ hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt /user/USERNAME/demo/block-example/
44
22. HDFS Admin Commands: fsck
❏ Syntax:❏ hdfs fsck <hdfs-file-path>
❏ Options:[-list-corruptfileblocks |[-move | -delete | -openforwrite][-files [-blocks [-locations | -racks]]][-includeSnapshots]
47
23. HDFS Admin Commands: dfsadmin❏ Syntax:
■ hdfs dfsadmin❏ Options:
[-report [-live] [-dead] [-decommissioning]] [-safemode enter | leave | get | wait] [-refreshNodes] [-refresh <host:ipc_port> <key> [arg1..argn]] [-shutdownDatanode <datanode:port> [upgrade]] [-getDatanodeInfo <datanode_host:ipc_port>] [-help [cmd]]
❏ Examples:■ hdfs dfsadmin -report -live
49
24. HDFS Admin Commands: namenode
❏ Syntax:■ hdfs namenode
❏ Options: [-checkpoint] | [-format [-clusterid cid ] [-force] [-nonInteractive] ] | [-upgrade [-clusterid cid] ] | [-rollback] | [-recover [-force] ] | [-metadataVersion ]
❏ Examples:■ hdfs namenode -help
51
25. HDFS Admin Commands: getconf
❏ Syntax:■ hdfs getconf [-options]
❏ Options:[ -namenodes ] [ -secondaryNameNodes ][ -backupNodes ] [ -includeFile ][ -excludeFile ] [ -nnRpcAddresses ][ -confKey [key] ]
52
Again,,, THE most important command !!
❏ Syntax:■ hdfs dfs -help [options]■ hdfs dfs -usage [options]
❏ Examples:■ hdfs dfs -help help■ hdfs dfs -usage usage
53
Web HDFS
URL:http://namenode:50070/explorer.html
Examples:http://localhost:50070/explorer.htmlhttp://ec2-52-23-214-111.compute-1.amazonaws.com:50070/explorer.html
55
References
1. http://www.hadoopinrealworld.com2. http://www.slideshare.net/sanjeeb85/hdfscommandreference3. http://www.slideshare.net/jaganadhg/hdfs-105091234. http://www.slideshare.net/praveenbhat2/adv-os-presentation5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/HDFSCommands.html8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/FileSystemShell.html9. http://hadoop.apache.org/docs/r1.2.1/distcp.html
56
© 2016 DataTorrent
Resources
58
• Apache Apex website - http://apex.incubator.apache.org/
• Subscribe - http://apex.incubator.apache.org/community.html
• Download - http://apex.incubator.apache.org/downloads.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Facebook - https://www.facebook.com/ApacheApex/
• Meetup - http://www.meetup.com/topics/apache-apex
• Startup Program – Free Enterprise License for Startups, Educational Institutions, Non-Profits - https://www.datatorrent.com/product/startup-accelerator/
• Cloud Trial - http://web.datatorrent.com/cloudtrial.html
© 2016 DataTorrent
We Are Hiring
59
• Developers/Architects
• QA Automation Developers
• Information Developers
• Build and Release
© 2016 DataTorrent
Upcoming Events
60
• March 15th – …
• March 17th 6pm PST – Title
• March 24th 9am PST – Title
• …
Copy data from one node to another node in HDFS
❏ Description:❏ Copy data between clusters
❏ Syntax:■ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo■ hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs:
//nn2:8020/bar/foo■ hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo
Where srclist.file contains
■ hdfs://nn1:8020/foo/a ■ hdfs://nn1:8020/foo/b
62