62
HADOOP Interacting with HDFS 1 For University Program on Apache Hadoop & Apache Apex

HADOOP Interacting with HDFS - Meetupfiles.meetup.com/18978602/University Program - Interacting With HDFS.pdf · HADOOP Interacting with HDFS 1 For University Program on Apache Hadoop

  • Upload
    others

  • View
    38

  • Download
    0

Embed Size (px)

Citation preview

HADOOP

Interacting with HDFS

1For University Program on Apache Hadoop & Apache Apex

→ What's the “Need” ? ←

❏ Big data Ocean❏ Expensive hardware❏ Frequent Failures and Difficult recovery❏ Scaling up with more machines

2

→ Hadoop ←

❏ Open source software- a Java framework- initial release: December 10, 2011

❏ It provides both,❏ Storage → [HDFS] ❏ Processing → [MapReduce]

❏ HDFS: Hadoop Distributed File System

3

→ How Hadoop addresses the need? ←

❏ Big data Ocean■ Have multiple machines. Each will store some portion of data, not the entire data.

❏ Expensive hardware■ Use commodity hardware. Simple and cheap.

❏ Frequent Failures and Difficult recovery■ Have multiple copies of data. Have the copies in different machines.

❏ Scaling up with more machines■ If more processing is needed, add new machines on the fly

4

→ HDFS ←

❏ Runs on Commodity hardware: Doesn't require expensive machines❏ Large Files; Write-once, Read-many (WORM)❏ Files are split into blocks

❏ Actual blocks go to DataNodes❏ The metadata is stored at NameNode

❏ Replicate blocks to different node❏ Default configuration:

■ Block size = 128MB■ Replication Factor = 3

5

6

7

8

→ Where NOT TO use HDFS ←

❏ Low latency data access■ HDFS is optimized for high throughput of data at the expense of latency.

❏ Large number of small files■ Namenode has the entire file-system metadata in memory.■ Too much metadata as compared to actual data.

❏ Multiple writers / Arbitrary file modifications■ No support for multiple writers for a file■ Always append to end of a file

9

→ Some Key Concepts ←

❏ NameNode❏ DataNodes❏ JobTracker❏ TaskTrackers❏ ResourceManager (MRv2)❏ NodeManager (MRv2)❏ ApplicationMaster (MRv2)

10

→ NameNode & DataNodes ←❏ NameNode:

■ Centerpiece of HDFS: The Master■ Only stores the block metadata: block-name, block-location etc.■ Critical component; When down, whole cluster is considered down; Single point of failure■ Should be configured with higher RAM

❏ DataNode:■ Stores the actual data: The Slave■ In constant communication with NameNode■ When down, it does not affect the availability of data/cluster■ Should be configured with higher disk space

❏ SecondaryNameNode:■ Doesn't actually act as a NameNode■ Stores the image of primary NameNode at certain checkpoint■ Used as backup to restore NameNode

11

12

→ JobTracker & TaskTrackers ←

❏ JobTracker:■ Talks to the NameNode to determine location of the data■ Monitors all TaskTrackers and submits status of the job back to the client■ When down, HDFS is still functional; no new MR job; existing jobs halted■ Replaced by ResourceManager/ApplicationMaster in MRv2

❏ TaskTracker:■ Runs on all DataNodes■ TaskTracker communicates with JobTracker signaling the task progress■ TaskTracker failure is not considered fatal■ Replaced by NodeManager in MRv2

13

→ ResourceManager & NodeManager ←❏ Present in Hadoop v2.0❏ Equivalent of JobTracker & TaskTracker in v1.0

❏ ResourceManager (RM):■ Runs usually at NameNode; Distributes resources among applications.■ Two main components: Scheduler and ApplicationsManager (AM)

❏ NodeManager (NM):■ Per-node framework agent■ Responsible for containers■ Monitors their resource usage■ Reports the stats to RM

Central ResourceManager and Node specific Manager together is called YARN

14

15

→ Hadoop 1.0 vs. 2.0 ←

❏ HDFS 1.0:■ Single point of failure■ Horizontal scaling performance issue

❏ HDFS 2.0:■ HDFS High Availability■ HDFS Snapshot■ Improved performance■ HDFS Federation

16

→ Interacting with HDFS ←

❏ Command prompt:■ Similar to Linux terminal commands■ Unix is the model, POSIX is the API

❏ Web Interface:■ Similar to browsing a FTP site on web

18

Interacting With HDFS

On Command Prompt

19

→ Notes ←

File Paths on HDFS:■ hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt■ hdfs://localhost:8020/user/USERNAME/demo/data/file.txt■ /user/USERNAME/demo/file.txt■ demo/file.txt

File System:■ Local: local file system (linux)■ HDFS: hadoop file system

At some places:The terms “file” and “directory” has the same meaning.

20

→ Before we start ←

❏ Command:■ hdfs

❏ Usage:■ hdfs [--config confdir] COMMAND

❏ Example:■ hdfs dfs ■ hdfs dfsadmin■ hdfs fsck■ hdfs namenode■ hdfs datanode

21

hdfs `dfs` commands

22

→ In general Syntax for `dfs` commands ←

hdfs dfs

-<COMMAND>-[OPTIONS]

<PARAMETERS>e.g.hdfs dfs -ls -R /user/USERNAME/demo/data/

23

0. Do It yourself

❏ Syntax:■ hdfs dfs -help [COMMAND … ]■ hdfs dfs -usage [COMMAND … ]

❏ Example:■ hdfs dfs -help cat■ hdfs dfs -usage cat

24

1. List the file/directory

❏ Syntax:■ hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path>

❏ Example:■ hdfs dfs -ls ■ hdfs dfs -ls /■ hdfs dfs -ls /user/USERNAME/demo/list-dir-example■ hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example

25

2. Creating a directory

❏ Syntax:■ hdfs dfs -mkdir [-p] <hdfs-dir-path>

❏ Example:■ hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example■ hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir-

example/dir1/dir2/dir3

26

3. Create a file on local & put it on HDFS

❏ Syntax:■ vi filename.txt■ hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>

❏ Example:■ vi file-copy-to-hdfs.txt■ hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-

example/

27

4. Get a file from HDFS to local

❏ Syntax:■ hdfs dfs -get <hdfs-file-path> [local-dir-path]

❏ Example:■ hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-

hdfs.txt ~/demo/

28

5. Copy From LOCAL To HDFS

❏ Syntax:■ hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>

❏ Example:■ hdfs dfs -copyFromLocal file-copy-to-hdfs.txt

/user/USERNAME/demo/copyFromLocal-example/

29

6. Copy To LOCAL From HDFS

❏ Syntax:■ hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>

❏ Example:■ hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-

example/file-copy-from-hdfs.txt ~/demo/

30

7. Move a file from local to HDFS

❏ Syntax:■ hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>

❏ Example:■ hdfs dfs -moveFromLocal /path/to/file.txt

/user/USERNAME/demo/moveFromLocal-example/

31

8. Copy a file within HDFS❏ Syntax:

■ hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path>❏ Example:

■ hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt /user/USERNAME/demo/data/

32

9. Move a file within HDFS❏ Syntax:

■ hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path>❏ Example:

■ hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt /user/USERNAME/demo/data/

33

10. Merge files on HDFS

❏ Syntax:■ hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path>

❏ Examples:■ hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/

/path/to/all-files.txt

34

11. View file contents

❏ Syntax:■ hdfs dfs -cat <hdfs-file-path>■ hdfs dfs -tail <hdfs-file-path>■ hdfs dfs -text <hdfs-file-path>

❏ Examples:■ hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt■ hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head

35

12. Remove files/dirs from HDFS

❏ Syntax:■ hdfs dfs -rm [options] <hdfs-file-path>

❏ Examples:■ hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt■ hdfs dfs -rm -R /user/USERNAME/demo/remove-example/■ hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/

36

13. Change file/dir properties

❏ Syntax:■ hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path>■ hdfs dfs -chmod [-R] <permissions> <hdfs-file-path>■ hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path>

❏ Examples:■ hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change-

properties.txt

37

14. Check the file size

❏ Syntax:■ hdfs dfs -du <hdfs-file-path>

❏ Examples:■ hdfs dfs -du /user/USERNAME/demo/data/file.txt■ hdfs dfs -du -s -h /user/USERNAME/demo/data/

38

15. Create a zero byte file in HDFS

❏ Syntax:■ hdfs dfs -touchz <hdfs-file-path>

❏ Examples:■ hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt

39

16. File test operations

❏ Syntax:■ hdfs dfs -test -[defsz] <hdfs-file-path>

❏ Examples:■ hdfs dfs -test -e /user/USERNAME/demo/data/file.txt

❏ echo $?

40

17. Get FileSystem Statistics

❏ Syntax:■ hdfs dfs -stat [format] <hdfs-file-path>

❏ Format Options:■ %b - file size in blocks, %g - group name of owner■ %n - filename %o - block size■ %r - replication %u - user name of owner■ %y - modification date

41

18. Get File/Dir Counts

❏ Syntax:■ hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path>

❏ Example:■ hdfs dfs -count -v /user/USERNAME/demo/

42

19. Set replication factor

❏ Syntax:■ hdfs dfs -setrep -w -R n <hdfs-file-path>

❏ Examples:■ hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt

43

20. Set Block Size

❏ Syntax:■ hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path>

<hdfs-file-path>❏ Examples:

■ hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt /user/USERNAME/demo/block-example/

44

21. Empty the HDFS trash

❏ Syntax:■ hdfs dfs -expunge

❏ Location:

45

Other hdfs commands (admin)

46

22. HDFS Admin Commands: fsck

❏ Syntax:❏ hdfs fsck <hdfs-file-path>

❏ Options:[-list-corruptfileblocks |[-move | -delete | -openforwrite][-files [-blocks [-locations | -racks]]][-includeSnapshots]

47

48

23. HDFS Admin Commands: dfsadmin❏ Syntax:

■ hdfs dfsadmin❏ Options:

[-report [-live] [-dead] [-decommissioning]] [-safemode enter | leave | get | wait] [-refreshNodes] [-refresh <host:ipc_port> <key> [arg1..argn]] [-shutdownDatanode <datanode:port> [upgrade]] [-getDatanodeInfo <datanode_host:ipc_port>] [-help [cmd]]

❏ Examples:■ hdfs dfsadmin -report -live

49

50

24. HDFS Admin Commands: namenode

❏ Syntax:■ hdfs namenode

❏ Options: [-checkpoint] | [-format [-clusterid cid ] [-force] [-nonInteractive] ] | [-upgrade [-clusterid cid] ] | [-rollback] | [-recover [-force] ] | [-metadataVersion ]

❏ Examples:■ hdfs namenode -help

51

25. HDFS Admin Commands: getconf

❏ Syntax:■ hdfs getconf [-options]

❏ Options:[ -namenodes ] [ -secondaryNameNodes ][ -backupNodes ] [ -includeFile ][ -excludeFile ] [ -nnRpcAddresses ][ -confKey [key] ]

52

Again,,, THE most important command !!

❏ Syntax:■ hdfs dfs -help [options]■ hdfs dfs -usage [options]

❏ Examples:■ hdfs dfs -help help■ hdfs dfs -usage usage

53

Interacting With HDFS

In Web Browser

54

References

1. http://www.hadoopinrealworld.com2. http://www.slideshare.net/sanjeeb85/hdfscommandreference3. http://www.slideshare.net/jaganadhg/hdfs-105091234. http://www.slideshare.net/praveenbhat2/adv-os-presentation5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-

hdfs/HDFSCommands.html8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-

common/FileSystemShell.html9. http://hadoop.apache.org/docs/r1.2.1/distcp.html

56

© 2016 DataTorrent

Resources

58

• Apache Apex website - http://apex.incubator.apache.org/

• Subscribe - http://apex.incubator.apache.org/community.html

• Download - http://apex.incubator.apache.org/downloads.html

• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex

• Facebook - https://www.facebook.com/ApacheApex/

• Meetup - http://www.meetup.com/topics/apache-apex

• Startup Program – Free Enterprise License for Startups, Educational Institutions, Non-Profits - https://www.datatorrent.com/product/startup-accelerator/

• Cloud Trial - http://web.datatorrent.com/cloudtrial.html

© 2016 DataTorrent

We Are Hiring

59

[email protected]

• Developers/Architects

• QA Automation Developers

• Information Developers

• Build and Release

© 2016 DataTorrent

Upcoming Events

60

• March 15th – …

• March 17th 6pm PST – Title

• March 24th 9am PST – Title

• …

APPENDIX

61

Copy data from one node to another node in HDFS

❏ Description:❏ Copy data between clusters

❏ Syntax:■ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo■ hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs:

//nn2:8020/bar/foo■ hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo

Where srclist.file contains

■ hdfs://nn1:8020/foo/a ■ hdfs://nn1:8020/foo/b

62