Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal...

Preview:

Citation preview

Computational Tools for Data Science

Week 2: The UNIX Shell & Version Control & Amazon EC2

Evaluations for week 1

2

• The lecture was too long. Break it up with exercises or have students read beforehand.

!• Focus on most important aspects and explain it

deeper. !• Lectures on syntax does not work.

Evaluations for week 1

3

• The slides for week 1 were nowhere to be found.

!• Clear info about which exercises are expected

of the students to finish. This info should be on the webpage.

I expect you to finish all exercises,!

unless otherwise stated. For the weeks

where I have not created the material, I

will give further instructions.

Evaluations for week 1• Clearer description of demand for passing the

course. What actually happens at the final presentation, and does the work in lectures/exercises count?

!• Is the final presentation a project and can we

start working on it earlier?

4

The final presentation is a presentation. It

does not have to be a large project. You

can start working now if you want.!

!

Your work during the lectures/exercises

does not count towards you passing/

failing the course.

Ideas for final presentation‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout ‣ A cool deep learning library ‣ Hashing Tricks ‣ Locality-sensitive hashing ‣ Feature hashing

5

Computational Tools for Data Science

Week 2: The UNIX Shell & Version Control & Amazon EC2

Which commands?

top, screen, chmod, diff, find which, apt-get, ssh, wget, curl

7

nano, vi, emacs

head, tail, more, less, cat, grep, sed, cut, sort, uniq, awk

Standard tools

Working with files

Editors

Covered here

top, screen, chmod, diff, find which, apt-get, ssh, wget, curl

8

nano, vi, emacs

head, tail, more, less, cat, grep, sed, cut, sort, uniq, awk

Standard tools

Working with files

Editors

SSH

~ ssh dawi@hald.gbar.dtu.dkThe authenticity of host 'hald.gbar.dtu.dk (192.38.95.41)' can't be established.RSA key fingerprint is 78:74:43:13:9d:23:02:95:78:18:48:24:47:cf:6d:05.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added 'hald.gbar.dtu.dk,192.38.95.41' (RSA) to the list of known hosts.Password: ~gray1(dawi) $ pwd/zhome/5b/c/51358

Secure Shell (SSH) is a cryptographic network protocol for secure data communication, remote command-line login, remote

command execution, and other secure network services between two networked computers.

We could do a whole lecture on RSA-keys for SSH.

9

Pipes and redirections

10

> Redirect output from a command to a file on disk.!

>> Append output from a command to a file on disk.!

< Read a command’s input from a disk on file.!

| Pass the output of one command to another for further processing.

tee Redirect output to a file and pass it to further commands.

CAT

11

~ cat > temp1This will go in temp1~ cat > temp2This will go in temp2~ cat temp1This will go in temp1~ cat temp1 temp2This will go in temp1This will go in temp2~ cat temp1 temp2 > temp3~ cat temp3This will go in temp1This will go in temp2

The cat program is a utility that will output the contents of a specific file and can be used to concatenate and list files.

CAT

12

~ cat temp3This will go in temp1This will go in temp2~ cat >> temp3This will also go in temp3~ cat temp3This will go in temp1This will go in temp2This will also go in temp3~ cat -n temp3 1 This will go in temp1 2 This will go in temp2 3 This will also go in temp3

CUT

13

~ cat > tempThis is a temp file1234567890With some content~ cut -c4 temps4h~ cut -c4,6 tempsi46hs

CUT

14

~ cut -c4-6 temps i456h s~ cut -c-6 temp This i123456With s~ cut -d' ' -f2 temp is1234567890some~ cut -d' ' -f2,3 tempis a1234567890some content

SORT

15

~ cat > temp1001 Søren33 Emil25 Helge1001 David~ sort temp1001 David1001 Søren25 Helge33 Emil~ sort -n temp25 Helge33 Emil1001 David1001 Søren

~ sort -k2 temp1001 David33 Emil25 Helge1001 Søren~ sort -k2 -r temp1001 Søren25 Helge33 Emil1001 David~ sort -k1,1 -k2,2 temp1001 David1001 Søren25 Helge33 Emil~ sort -k1,1 -k2,2r temp1001 Søren1001 David25 Helge33 Emil

SORT

16

~ cat > temp2David,1Helge,10Søren,5~ sort -t',' -nk2 temp2David,1Søren,5Helge,10

UNIQ

17

~ cat > temp11232~ uniq temp1232~ sort -n temp | uniq123~ sort -n temp | uniq -c 2 1 2 2 1 3

GREP

18

~ cat > tempbig bad bugbagbiggerboogy~ grep big tempbigbigger~ grep b.g tempbigbad bugbagbigger~ grep "b.*g" tempbigbad bugbagbiggerboogy~ grep -w big tempbig

GREP

19

~ cat > temp1this is a line in the first file~ cat > temp2this is a line in the second file~ grep this temp*temp1:this is a line in the first filetemp2:this is a line in the second file

Display N lines after matchgrep -A <N> "string" FILENAME!Display N lines before matchgrep -B <N> "string" FILENAME!Search recursively in foldersgrep -r "string" *!Invert matchgrep -v "string" FILENAME!Count number of matchesgrep -c "string" FILENAME!Display only the file namesgrep -l "string" FILENAME

SED

20

~ cat > temp one two three, one two three four three two oneone hundred~ sed 's/one/ONE/' tempONE two three, one two three four three two ONEONE hundred~ sed 's_one_ONE_' tempONE two three, one two three four three two ONEONE hundred~ sed 's/[a-z]*/LOL/' tempLOL two three, one two three LOL three two oneLOL hundred~ sed 's/[a-z]*/(&)/' temp(one) two three, one two three (four) three two one(one) hundred~ sed 's/[a-z]*/(&)/g' temp(one) (two) (three),() (one) (two) (three) ()(four) (three) (two) (one)(one) (hundred)

Combinations

21

~ head tempId,Prediction17000,017001,017002,117003,117004,117005,017006,017007,017008,0~ cut -d',' -f2 temp | grep -c '0'1983~ cut -d',' -f2 temp | grep -c '1'2075

Computational Tools for Data Science

Week 2: The UNIX Shell & Version Control & Amazon EC2

23

24

“Git is a distributed revision control

and source code management

system with an emphasis on speed,

data integrity, and support for

distributed, non-linear workflows.”

25

GitHub is a web-based hosting service for software development projects that use the Git revision control system.

26

➜ ~ mkdir temp_git➜ ~ cd temp_git ➜ temp_git cat > file1 This is in file1 ➜ temp_git cat > file2This is in file2➜ temp_git git initInitialized empty Git repository in /Users/dawi/temp_git/.git/➜ temp_git git:(master) ✗ git add file1 file2➜ temp_git git:(master) ✗ git commit -m "first commit" 2 files changed, 2 insertions(+) create mode 100644 file1 create mode 100644 file2➜ temp_git git:(master) git statusOn branch masternothing to commit, working directory clean➜ temp_git git:(master) git remote add origin https://github.com/utdiscant/ctfds.git➜ temp_git git:(master) git push -u origin masterUsername for 'https://github.com': utdiscant Password for 'https://utdiscant@github.com': Counting objects: 4, done.Delta compression using up to 4 threads.Compressing objects: 100% (2/2), done.Writing objects: 100% (4/4), 284 bytes | 0 bytes/s, done.Total 4 (delta 0), reused 0 (delta 0)To https://github.com/utdiscant/ctfds.git * [new branch] master -> masterBranch master set up to track remote branch master from origin.

27

➜ n-62-14-11(dawi) $ git clone https://github.com/utdiscant/ctfds.gitInitialized empty Git repository in /zhome/5b/c/51358/git_temp/ctfds/.git/remote: Counting objects: 4, done.remote: Compressing objects: 100% (2/2), done.remote: Total 4 (delta 0), reused 4 (delta 0)Unpacking objects: 100% (4/4), done.!➜ n-62-14-11(dawi) $ lsctfds!➜ n-62-14-11(dawi) $ cd ctfds/!➜ n-62-14-11(dawi) $ lsfile1 file2

28

Computational Tools for Data Science

Week 2: The UNIX Shell & Version Control & Amazon EC2

30

31

As part of AWS’s Free Usage Tier, new AWS customers can get started with Amazon EC2 for free. Upon sign-up, new AWS customers receive the following EC2 services each month for one year: !• 750 hours of EC2 running Linux, RHEL, or SLES t2.micro

instance usage • 750 hours of Elastic Load Balancing plus 15 GB data

processing • 30 GB of Amazon Elastic Block Storage in any combination of

General Purpose (SSD) or Magnetic, plus 2 million I/Os (with Magnetic) and 1 GB of snapshot storage

• 15 GB of bandwidth out aggregated across all AWS services • 1 GB of Regional Data Transfer

Free tier

32

Paid

http://aws.amazon.com/ec2/pricing/

Recommended