Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Computational Tools for Data Science
Week 2: The UNIX Shell & Version Control & Amazon EC2
Evaluations for week 1
2
• The lecture was too long. Break it up with exercises or have students read beforehand.
!• Focus on most important aspects and explain it
deeper. !• Lectures on syntax does not work.
Evaluations for week 1
3
• The slides for week 1 were nowhere to be found.
!• Clear info about which exercises are expected
of the students to finish. This info should be on the webpage.
I expect you to finish all exercises,!
unless otherwise stated. For the weeks
where I have not created the material, I
will give further instructions.
Evaluations for week 1• Clearer description of demand for passing the
course. What actually happens at the final presentation, and does the work in lectures/exercises count?
!• Is the final presentation a project and can we
start working on it earlier?
4
The final presentation is a presentation. It
does not have to be a large project. You
can start working now if you want.!
!
Your work during the lectures/exercises
does not count towards you passing/
failing the course.
Ideas for final presentation‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout ‣ A cool deep learning library ‣ Hashing Tricks ‣ Locality-sensitive hashing ‣ Feature hashing
5
Computational Tools for Data Science
Week 2: The UNIX Shell & Version Control & Amazon EC2
Which commands?
top, screen, chmod, diff, find which, apt-get, ssh, wget, curl
7
nano, vi, emacs
head, tail, more, less, cat, grep, sed, cut, sort, uniq, awk
Standard tools
Working with files
Editors
Covered here
top, screen, chmod, diff, find which, apt-get, ssh, wget, curl
8
nano, vi, emacs
head, tail, more, less, cat, grep, sed, cut, sort, uniq, awk
Standard tools
Working with files
Editors
SSH
~ ssh [email protected] authenticity of host 'hald.gbar.dtu.dk (192.38.95.41)' can't be established.RSA key fingerprint is 78:74:43:13:9d:23:02:95:78:18:48:24:47:cf:6d:05.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added 'hald.gbar.dtu.dk,192.38.95.41' (RSA) to the list of known hosts.Password: ~gray1(dawi) $ pwd/zhome/5b/c/51358
Secure Shell (SSH) is a cryptographic network protocol for secure data communication, remote command-line login, remote
command execution, and other secure network services between two networked computers.
We could do a whole lecture on RSA-keys for SSH.
9
Pipes and redirections
10
> Redirect output from a command to a file on disk.!
>> Append output from a command to a file on disk.!
< Read a command’s input from a disk on file.!
| Pass the output of one command to another for further processing.
tee Redirect output to a file and pass it to further commands.
CAT
11
~ cat > temp1This will go in temp1~ cat > temp2This will go in temp2~ cat temp1This will go in temp1~ cat temp1 temp2This will go in temp1This will go in temp2~ cat temp1 temp2 > temp3~ cat temp3This will go in temp1This will go in temp2
The cat program is a utility that will output the contents of a specific file and can be used to concatenate and list files.
CAT
12
~ cat temp3This will go in temp1This will go in temp2~ cat >> temp3This will also go in temp3~ cat temp3This will go in temp1This will go in temp2This will also go in temp3~ cat -n temp3 1 This will go in temp1 2 This will go in temp2 3 This will also go in temp3
CUT
13
~ cat > tempThis is a temp file1234567890With some content~ cut -c4 temps4h~ cut -c4,6 tempsi46hs
CUT
14
~ cut -c4-6 temps i456h s~ cut -c-6 temp This i123456With s~ cut -d' ' -f2 temp is1234567890some~ cut -d' ' -f2,3 tempis a1234567890some content
SORT
15
~ cat > temp1001 Søren33 Emil25 Helge1001 David~ sort temp1001 David1001 Søren25 Helge33 Emil~ sort -n temp25 Helge33 Emil1001 David1001 Søren
~ sort -k2 temp1001 David33 Emil25 Helge1001 Søren~ sort -k2 -r temp1001 Søren25 Helge33 Emil1001 David~ sort -k1,1 -k2,2 temp1001 David1001 Søren25 Helge33 Emil~ sort -k1,1 -k2,2r temp1001 Søren1001 David25 Helge33 Emil
SORT
16
~ cat > temp2David,1Helge,10Søren,5~ sort -t',' -nk2 temp2David,1Søren,5Helge,10
UNIQ
17
~ cat > temp11232~ uniq temp1232~ sort -n temp | uniq123~ sort -n temp | uniq -c 2 1 2 2 1 3
GREP
18
~ cat > tempbig bad bugbagbiggerboogy~ grep big tempbigbigger~ grep b.g tempbigbad bugbagbigger~ grep "b.*g" tempbigbad bugbagbiggerboogy~ grep -w big tempbig
GREP
19
~ cat > temp1this is a line in the first file~ cat > temp2this is a line in the second file~ grep this temp*temp1:this is a line in the first filetemp2:this is a line in the second file
Display N lines after matchgrep -A <N> "string" FILENAME!Display N lines before matchgrep -B <N> "string" FILENAME!Search recursively in foldersgrep -r "string" *!Invert matchgrep -v "string" FILENAME!Count number of matchesgrep -c "string" FILENAME!Display only the file namesgrep -l "string" FILENAME
SED
20
~ cat > temp one two three, one two three four three two oneone hundred~ sed 's/one/ONE/' tempONE two three, one two three four three two ONEONE hundred~ sed 's_one_ONE_' tempONE two three, one two three four three two ONEONE hundred~ sed 's/[a-z]*/LOL/' tempLOL two three, one two three LOL three two oneLOL hundred~ sed 's/[a-z]*/(&)/' temp(one) two three, one two three (four) three two one(one) hundred~ sed 's/[a-z]*/(&)/g' temp(one) (two) (three),() (one) (two) (three) ()(four) (three) (two) (one)(one) (hundred)
Combinations
21
~ head tempId,Prediction17000,017001,017002,117003,117004,117005,017006,017007,017008,0~ cut -d',' -f2 temp | grep -c '0'1983~ cut -d',' -f2 temp | grep -c '1'2075
Computational Tools for Data Science
Week 2: The UNIX Shell & Version Control & Amazon EC2
23
24
“Git is a distributed revision control
and source code management
system with an emphasis on speed,
data integrity, and support for
distributed, non-linear workflows.”
25
GitHub is a web-based hosting service for software development projects that use the Git revision control system.
26
➜ ~ mkdir temp_git➜ ~ cd temp_git ➜ temp_git cat > file1 This is in file1 ➜ temp_git cat > file2This is in file2➜ temp_git git initInitialized empty Git repository in /Users/dawi/temp_git/.git/➜ temp_git git:(master) ✗ git add file1 file2➜ temp_git git:(master) ✗ git commit -m "first commit" 2 files changed, 2 insertions(+) create mode 100644 file1 create mode 100644 file2➜ temp_git git:(master) git statusOn branch masternothing to commit, working directory clean➜ temp_git git:(master) git remote add origin https://github.com/utdiscant/ctfds.git➜ temp_git git:(master) git push -u origin masterUsername for 'https://github.com': utdiscant Password for 'https://[email protected]': Counting objects: 4, done.Delta compression using up to 4 threads.Compressing objects: 100% (2/2), done.Writing objects: 100% (4/4), 284 bytes | 0 bytes/s, done.Total 4 (delta 0), reused 0 (delta 0)To https://github.com/utdiscant/ctfds.git * [new branch] master -> masterBranch master set up to track remote branch master from origin.
27
➜ n-62-14-11(dawi) $ git clone https://github.com/utdiscant/ctfds.gitInitialized empty Git repository in /zhome/5b/c/51358/git_temp/ctfds/.git/remote: Counting objects: 4, done.remote: Compressing objects: 100% (2/2), done.remote: Total 4 (delta 0), reused 4 (delta 0)Unpacking objects: 100% (4/4), done.!➜ n-62-14-11(dawi) $ lsctfds!➜ n-62-14-11(dawi) $ cd ctfds/!➜ n-62-14-11(dawi) $ lsfile1 file2
28
Computational Tools for Data Science
Week 2: The UNIX Shell & Version Control & Amazon EC2
30
31
As part of AWS’s Free Usage Tier, new AWS customers can get started with Amazon EC2 for free. Upon sign-up, new AWS customers receive the following EC2 services each month for one year: !• 750 hours of EC2 running Linux, RHEL, or SLES t2.micro
instance usage • 750 hours of Elastic Load Balancing plus 15 GB data
processing • 30 GB of Amazon Elastic Block Storage in any combination of
General Purpose (SSD) or Magnetic, plus 2 million I/Os (with Magnetic) and 1 GB of snapshot storage
• 15 GB of bandwidth out aggregated across all AWS services • 1 GB of Regional Data Transfer
Free tier
32
Paid
http://aws.amazon.com/ec2/pricing/