47
2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: [email protected] Lecture 6

2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: [email protected] Lecture 6

Embed Size (px)

Citation preview

Page 1: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

UNIX ToolsG22.2245-001, Fall 2000

Danielle S. Lahmani

email: [email protected]

Lecture 6

Page 2: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Overview

• Awk

• SED

Page 3: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

AWK

• developed in 1978 at Bell Labs, by Aho, Weinberger, and Kerninghan.

• pattern scanning and processing language

• programmable filter for text files

Page 4: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

AWK: programming language

search a set of files for patterns, perform specified actions upon lines or

fields that contain instances of patterns.

• does not alter input files.

• process one input line at a time

Page 5: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

AWK: features

convenient numeric processing variables, general selection (based on

patterns) and control flow in the actions.

convenient way of accessing fields within lines.

Page 6: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

AWK: usage

• Usage: awk 'program' [filename]*

awk -f cmdfile [filename]*

( ‘program’ single quote to suppress parameter substitution)

• program or cmdfile contain a set of statements of the form:

• pattern {action}• pattern {action}• …

Page 7: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

AWK: Examples

• prints the third and second columns of a table in that order

{ print $3 $2}• print all lines in which the first field is different from the previous first field

–$1 !=prev { print; prev = $1 }

Page 8: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

AWK: patterns selector that determines whether action

is to be executed pattern can be:

the special token BEGIN or END regular expressions arithmetic relation operators string-valued expressions arbitrary combination of the above

Page 9: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

BEGIN and END patterns

• BEGIN and END provide a way to gain control before and after processing, for initialization and wrap-up.

• BEGIN: actions are performed before the first input line is read.

• END: actions are done after the last input line has been processed.

Page 10: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

AWK: actions action may include a list of one or more

C like statements, as well as arithmetic and string expressions and assignments and multiple output streams.

action is performed on every line that matches pattern.

If pattern is not provided, action is performed on every input line

Page 11: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

AWK: actions (continued)

If action is not provided, all matching lines are sent to standard output.

Since patterns and actions are optional, actions must be enclosed in braces to distinguish them from pattern.

Page 12: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

AWK: RECORDS• newline: Default record separator

• So, by default, AWK processes its input a line at a time.

• NR is the variable whose value is the number of the current record.

• RS: record separator

Page 13: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

AWK: FIELDS• Each input line is split into fields.• FS: field separator: default is blanks or tabs• -Fc option sets FS to the character c• $0 is the entire line• $1 is the first field, $2 is the second field, ….

$NF• NF is a built-in variable whose value is set to the

number of fields.• Only fields begin with $, variables are unadorned

Page 14: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Printing:• print and printf (for formatted output)

• the following prints the first two fields in reverse order:

print $2, $1• The following numbers all the lines:

$awk '{ print NR, $0 }'• Output may be diverted to multiple files

(maximum 10 output files)

{ print $1 > "foo1" ; print $2 > "foo2" }

Page 15: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Built-in functions include:

• "length" function to compute length of a string

{ print length, $0}

• substr(s, m, n) produces the substring of s that begins at position m and is at most n characters long.

Page 16: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Arithmetic and variables

• AWK variables take on numeric (floating point) or string values according to context.

• User-defined variables are unadorned they need not be declared.

• By default, user-defined variables are initialized to the null string which has numerical value zero.

Page 17: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Flow of control statements:• Supports most of the standard control structures of C• This program looks for pairs of identical adjacent

words

NF > 0 {

If ( $1 == lastword)

Print "double:", $1, "Line:", NR

for ( i = 2; i <= NF; i++) {

If ( $i == $(i-1))

{ print "Double:", $i, "Line:", NR}

}

lastword = $NF

}

Page 18: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Arrays and associative arrays

• Array elements are not declared.

• Subscripts may have any non-null value, including non-numeric strings

Page 19: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: Stream-oriented, Non-Interactive, Text Editor

• Typical Usage:– edit files too large for interactive editing– edit any size files where editing sequence is too

complicated to type in interactive mode– perform “multiple global” editing functions

efficiently in one pass through the input– edit multiples files automatically– good tool for writing conversion programs

Page 20: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED Usage

• sed ‘list of ed commands’ filenames….

• Reads on line at a time from input file

• applies the commands from list in order to each line

• writes its edited form on standard output

Page 21: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED Usage•sed [-n] -e ‘command’ [file]*

•sed [-n] -f scriptfile [file]*-n suppresses default output (except for

lines specified with the p command, or pflag of the s (substitute) command.

Page 22: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: Overall OperationReferences: Unix In a Nutshell (o’reilly)

• input file is unchanged

• processes one line at the time

• copies standard input to standard output, perhaps performing one or more editing commands on each input line

Page 23: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: pattern and hold spaces

• pattern space: workspace or temporary buffer where a single line of input (with N command, multi-line) is held while the editing commands are applied

• hold space: secondary temporary buffer for temporary storage only (see discussion later)

Page 24: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: conceptual overview Each line of input is copied into a pattern

space (range of pattern matches) Before any editing is done, all editing

commands are compiled into a form to be more efficient during the execution phase.

All editing commands in a sed script are applied in order to each input line.

Page 25: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: conceptual overview (cont’)

If a command changes the input, subsequent command address will be applied to the current line in the pattern space, not the original input line.

The original input file is unchanged (editing commands modify a copy of the input file).

• The copy is sent to standard output. (but can be redirected to a file) Editing commands are applied to all lines (globally) unless line addressing restricts the lines affected

Page 26: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: GENERAL FORMAT OF AN EDITING COMMAND

• [address1, address2] [function] [arguments]

• addresses selecting lines for editing by– line numbers: (decimal integers)– context addresses (using regular

expressions)

Page 27: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: REGULAR EXPRESSIONS

c: ordinary character, matches that character

^ matches the beginning of the line

$ matches the end of the line

'\n' matches an embedded newline character, nut not the newline at the end of a pattern space.

. period matches any single character , but not newline

r* matches any number (zero or more) of the regular expression preceding it.

Page 28: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: Regular Expressions (cont’)

[…] matches any character in the …

[^…] matches any character not in …

r1r2 matches the concatenation of r1r2

\(..\) is a tagged regular expression

'\d' means the same string of characters matched by an expression enclosed in '\(' and '\)' earlier in the same pattern; d is a single digit

// null regular expression is equivalent to the last regular expression compiled.

Page 29: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Sed: examples

• $ print last line of last input file

• 1 print first line of first input file

• /pattern/ print lines containing pattern

Page 30: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Sed: pattern addressing

If the command has then the command is applied to• No address each input line• One address all lines that match the address.Some

commands accept only one Address: a, i, r, q and =

• Two comma separated first matching line and all addresses succeeding lines up to and

including a line matching the second address.

• address followed by ! all lines that do not match the address

Page 31: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: number of addresses(cont’)

• Braces {} are used to apply multiple commands to one address or address pair

[/pattern1/][,/pattern2/] {

command1

command2} (give examples)

Page 32: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: Whole line oriented functions

• DELETE d

• APPEND a

• CHANGE c

• SUBSTITUTE s

• INSERT i

• n

Page 33: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: Whole line oriented functions

• DELETE:

[address1][,address2]d delete the addressed line(s) from the pattern space; line(s) not passed to standard output.

• A new line of input is read and editing resumes with the first command of the script.

Page 34: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: whole line functions:

• APPEND

[address]a\

<text> • append text after each line matched by

address• text is not available in the pattern space• subsequent commands cannot be applied to

it( no change in line-number counter)

Page 35: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: whole line functions

• INSERT:

[address]i \

<text>

• insert text before each line matched by address.

• Same as function a for text treatment.

Page 36: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED:Whole line functions (cont')

• CHANGE:

[address1][,address2]c\

<text> • replace the lines selected by the address

with text.• Contents of pattern space are deleted no

subsequent editing can be applied to it or to <text>.

Page 37: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: Whole line functions

• n read next input line in pattern space, replacing current line.

• Current line is written to output if it should be.

• Control passes to the command following n instead of resuming at the top of the script.

Page 38: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED:s: Substitute function

• [address]s<pattern><replacement><flags>

• substitute replacement for pattern on each addressed line.

• [address] can be 0, 1, or 2 addresses.

Page 39: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED:s: substitute command• <flags> that modify the substitution can be:

n:number (1 to 512) replacement for only the nth

occurrence of pattern.

g: replace all instances of <pattern> on each addressed line, not just the first instance.

p: print the pattern space if successful replacement was done

w file: write pattern space to file if a successful replacement was done. A maximum of 10 different files can be opened.

Page 40: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: SUBSTITUTE FUNCTION (cont')• <replacement> is a string of characters,

may contain special metacharacters:

& replaced by the string matched by <pattern>

\d matches the dth substring (d is a single digit) previously specified in <pattern> enclosed by '\(' and '\)'.

• (give examples here)

Page 41: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED:Input-output functions

• p print• w <filename> write input lines to filename

• r <filename> read another file's contents into the input

• q quit the sed script (no further output)

Page 42: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED Line information

= display the line number of a line

l display control characters in ascii

p display the line

Page 43: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Flow of control functions• ! don't• { grouping• b <label> branch to label or at end of

script• t <label> same as b, but branch only

after substitution• : label place a label branched to by t

or b

Page 44: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Sed Drawbacks (references: The Unix Programming Environment,

Kernighan & Pike)

• hard to remember text from one line to another

• not possible to go backward in the file

• no way to do forward references like /…./+1

• no facilities to manipulate numbers

Page 45: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

SED: Multiple input-output functions• Functions spelled out in capital letters, to deal

with pattern spaces containing embedded newlines, to provide pattern matches across lines in the input.

• N next input line is appended to the current line in the pattern space. (create embedded newline)

• D delete first part of the pattern space up to embedded newline

• P print first part of the pattern space up to embedded newline

Page 46: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Hold and get Functions

• h hold pattern space:– copies the contents of the pattern space

into a hold area (wipe out hold area)

• H hold pattern space– Copies contents of pattern space into hold

area ; append to what's in the hold area.

Page 47: 2000 Copyrights, Danielle S. Lahmani UNIX Tools G22.2245-001, Fall 2000 Danielle S. Lahmani email: lahmani@cs.nyu.edu Lecture 6

2000 Copyrights, Danielle S. Lahmani

Hold and Get Functions (cont’)

• g get contents of hold area– copies contents of hold space in pattern

space;destroys previous contents of pattern space.

• G get contents of hold area– Appends the contents of the hold area to the

contents of pattern space; former and new contents are separated by a newline

• -x exchange contents of hold space and pattern space