71
TEXT PROCESSING TEXT PROCESSING UTILITIES UTILITIES

TEXT PROCESSING UTILITIES

  • Upload
    candie

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

TEXT PROCESSING UTILITIES. THE cat COMMAND. $ cat emp1.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m |product | 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product | 05/02/89 |23000 - PowerPoint PPT Presentation

Citation preview

Page 1: TEXT PROCESSING UTILITIES

TEXT PROCESSING TEXT PROCESSING UTILITIESUTILITIES

Page 2: TEXT PROCESSING UTILITIES

THE cat COMMANDTHE cat COMMAND

$ cat emp1.lst$ cat emp1.lst2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 200009876 | sharma |d.g.m |product | 12/03 60 | 150009876 | sharma |d.g.m |product | 12/03 60 | 150007898 | akash |dir. |mark. | 11/06/70 |90007898 | akash |dir. |mark. | 11/06/70 |90003456 | tiwary |g.m |product | 05/02/89 |230003456 | tiwary |g.m |product | 05/02/89 |230001234 | kumar | mgr |accnts | 18/03/79 |150001234 | kumar | mgr |accnts | 18/03/79 |150003456 | anil |chman |sales | 30/02/69 |400003456 | anil |chman |sales | 30/02/69 |400006789 | lalith |mrg | mark. | 17/01/80 |600006789 | lalith |mrg | mark. | 17/01/80 |600005678 | a | d | m | 12/12/80 |120005678 | a | d | m | 12/12/80 |12000This is the empThis is the empdatabase which storesdatabase which storesthe information about variousthe information about variousemployees.employees.that is employeenumber.that is employeenumber.emp nameemp namedesignationdesignationdepartmentdepartmentdate of birthdate of birthand their salary.and their salary.

Page 3: TEXT PROCESSING UTILITIES

DISPLAYING THE BEGINNING OF A FILE – DISPLAYING THE BEGINNING OF A FILE – THE head COMMANDTHE head COMMAND

The head command as the name implies The head command as the name implies displays the displays the top LINEStop LINES of the file. When of the file. When used used without an optionwithout an option it displays the it displays the first first ten recordsten records of the argument file. of the argument file.

Page 4: TEXT PROCESSING UTILITIES

$ head emp.lst$ head emp.lst2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 20000

9876 | sharma |d.g.m |product| 12/03 60 | 150009876 | sharma |d.g.m |product| 12/03 60 | 15000

7898 | akash |dir. |mark. | 11/06/70 |90007898 | akash |dir. |mark. | 11/06/70 |9000

3456 | tiwary |g.m |product| 05/02/89 |230003456 | tiwary |g.m |product| 05/02/89 |23000

1234 | kumar | mgr |accnts | 18/03/79 |150001234 | kumar | mgr |accnts | 18/03/79 |15000

3456 | anil |chman |sales | 30/02/69 |400003456 | anil |chman |sales | 30/02/69 |40000

6789 | lalith |mrg | mark. | 17/01/80 |600006789 | lalith |mrg | mark. | 17/01/80 |60000

5678 | a | d | m | 12/12/80 |120005678 | a | d | m | 12/12/80 |12000

This is the empThis is the emp

database which storesdatabase which stores

Page 5: TEXT PROCESSING UTILITIES

You can specify the line count and display say the You can specify the line count and display say the first three lines of the file. Use the first three lines of the file. Use the – – symbol, symbol, followed by a numeric argument.followed by a numeric argument.

Ex: Ex: $ head -3 emp.lst$ head -3 emp.lst

2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 20000

9876 | sharma |d.g.m |product| 12/03 60 | 150009876 | sharma |d.g.m |product| 12/03 60 | 15000

7898 | akash |dir. |mark. | 11/06/70 |90007898 | akash |dir. |mark. | 11/06/70 |9000

If the linecount specified exceeds the number of lines If the linecount specified exceeds the number of lines actually present in the file, actually present in the file, head head displays the entire file.displays the entire file.

You can also find out the “record length” by word counting You can also find out the “record length” by word counting the first line of the file :the first line of the file :

$ head -1 emp.lst | wc -c$ head -1 emp.lst | wc -c

4747

Page 6: TEXT PROCESSING UTILITIES

head also works with multiple files. For each file it head also works with multiple files. For each file it indicates the filename and the lines extracted:indicates the filename and the lines extracted:

$ head -2 emp.lst f1.lst$ head -2 emp.lst f1.lst==> emp.lst <====> emp.lst <==2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 20000

9876 | sharma |d.g.m|product| 12/03 60 | 150009876 | sharma |d.g.m|product| 12/03 60 | 15000

==> f1.lst <====> f1.lst <==

root tty7 2009-07-25 09:56 (:0)root tty7 2009-07-25 09:56 (:0)

root pts/1 2009-07-25 09:56 (:0)root pts/1 2009-07-25 09:56 (:0)

Page 7: TEXT PROCESSING UTILITIES

DISPLAYING THE END OF A FILE – DISPLAYING THE END OF A FILE – THE tail COMMANDTHE tail COMMAND

The tail command displays the The tail command displays the end of the fileend of the file. It . It provides an additional method of addressing provides an additional method of addressing lines, and can also extract information in units of lines, and can also extract information in units of blocks and characters.blocks and characters.

Like head it displays the last ten lines when used Like head it displays the last ten lines when used without arguments.without arguments.

Ex: Ex: $ tail -3 emp.lst$ tail -3 emp.lstdepartmentdepartmentdate of birthdate of birthand their salary.and their salary.

Page 8: TEXT PROCESSING UTILITIES

$ tail emp.lst$ tail emp.lstThis is the empThis is the emp

database which storesdatabase which stores

the information about variousthe information about various

employees.employees.

that is employeenumber.that is employeenumber.

emp nameemp name

designationdesignation

departmentdepartment

date of birthdate of birth

and their salaryand their salary..

Page 9: TEXT PROCESSING UTILITIES

[itlaxmi@snist ~]$ tail -40c emp.lst[itlaxmi@snist ~]$ tail -40c emp.lstartmentartmentdate of birthdate of birthand their salary.and their salary. Ex: Ex: $ tail -v emp.lst$ tail -v emp.lst ==> emp.lst <====> emp.lst <== This is the empThis is the emp database which storesdatabase which stores the information about variousthe information about various employees.employees. that is employeenumber.that is employeenumber. emp nameemp name designationdesignation departmentdepartment date of birthdate of birth and their salary.and their salary.

Page 10: TEXT PROCESSING UTILITIES

The disadvantage with head and tail is that they The disadvantage with head and tail is that they cannot display a range of lines. Moreover what is cannot display a range of lines. Moreover what is displayed is final. That is if we have displayed the displayed is final. That is if we have displayed the first 50 lines in a file, we cannot move back and first 50 lines in a file, we cannot move back and view say the 10 lines.view say the 10 lines.

-v-v If you use this option it will always print the If you use this option it will always print the

headers giving the file name.headers giving the file name.

Page 11: TEXT PROCESSING UTILITIES

Tail also address lines from the beginningTail also address lines from the beginning of the of the file instead of the end. The file instead of the end. The + + count option allows count option allows you to do that, where count represents the line you to do that, where count represents the line number from where the selection should begin.number from where the selection should begin.

Ex:Ex: $ tail -n +8 emp.lst$ tail -n +8 emp.lst

5678 | a | d | m | 12/12/80 |120005678 | a | d | m | 12/12/80 |12000This is the empThis is the empdatabase which storesdatabase which storesthe information about variousthe information about variousemployees.employees.that is employeenumber.that is employeenumber.emp nameemp namedesignationdesignationdepartmentdepartmentdate of birthdate of birthand their salary.and their salary.

Page 12: TEXT PROCESSING UTILITIES

SLITTING A FILE VERTICALLY – SLITTING A FILE VERTICALLY – THE cut COMMANDTHE cut COMMAND

While head and tail are used to slice a file While head and tail are used to slice a file horizontally, you can slice a file vertically with the horizontally, you can slice a file vertically with the cut command. Cut identifies both columns and cut command. Cut identifies both columns and fields.fields.

Syntax:Syntax:

cut <options> <character or field list> <file(s)>cut <options> <character or field list> <file(s)> Ex: store the first 5 lines of the fileEx: store the first 5 lines of the file emp.lst emp.lst in a in a

file file shortlist.shortlist. $ $ head -5 emp.lst >shortlisthead -5 emp.lst >shortlist

Page 13: TEXT PROCESSING UTILITIES

$ cat shortlist$ cat shortlist2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 200009876 | sharma |d.g.m|product| 12/03 60 | 150009876 | sharma |d.g.m|product| 12/03 60 | 150007898 | akash |dir. |mark. | 11/06/70 |90007898 | akash |dir. |mark. | 11/06/70 |90003456 | tiwary |g.m |product| 05/02/89 |230003456 | tiwary |g.m |product| 05/02/89 |230001234 | kumar | mgr |accnts | 18/03/79 |150001234 | kumar | mgr |accnts | 18/03/79 |15000

cut can be used to extract specific columns from this file. cut can be used to extract specific columns from this file. Use the –c (columns) option for cutting columns:Use the –c (columns) option for cutting columns:

$ cut -c5-20 shortlist$ cut -c5-20 shortlist| shukla | g.m| shukla | g.m | sharma |d.g.m| sharma |d.g.m | akash |dir.| akash |dir. | tiwary |g.m| tiwary |g.m | kumar | mgr| kumar | mgr

Column numbers must immediately follow the option. Column numbers must immediately follow the option. Ranges are permitted, and commas are used to separate Ranges are permitted, and commas are used to separate the column chunks.the column chunks.

Page 14: TEXT PROCESSING UTILITIES

$ cut -c2-5,10-15,40- shortlist$ cut -c2-5,10-15,40- shortlist233 ukla || 20000233 ukla || 20000

876 arma || 15000876 arma || 15000

898 ash ||9000898 ash ||9000

456 wary ||23000456 wary ||23000

234 mar ||15000234 mar ||15000 The expression 40- indicates column number 55 to end of The expression 40- indicates column number 55 to end of

the line.the line. The method of tracking fields by column positions is tedious The method of tracking fields by column positions is tedious

and also the file may doesn’t contain fixed length records.and also the file may doesn’t contain fixed length records. You can extract specific fields using two options -d You can extract specific fields using two options -d

(delimiter) for specification of the field delimiter and –f (delimiter) for specification of the field delimiter and –f (field) for specifying the field list:(field) for specifying the field list:

When you use the –f option, don’t forget to use the –d When you use the –f option, don’t forget to use the –d option too, unless the file has the default delimiter (the option too, unless the file has the default delimiter (the tab).tab).

Page 15: TEXT PROCESSING UTILITIES

Ex: $ cut -d"|" -f2,3 shortlist | tee clist1Ex: $ cut -d"|" -f2,3 shortlist | tee clist1

shukla | g.mshukla | g.m

sharma |d.g.msharma |d.g.m

akash |dir.akash |dir.

tiwary |g.mtiwary |g.m

kumar | mgrkumar | mgr The The teetee command saves the output in the file clist1, and command saves the output in the file clist1, and

also displays it on the terminal.also displays it on the terminal. $ cat clist1$ cat clist1

shukla | g.mshukla | g.m

sharma |d.g.msharma |d.g.m

akash |dir.akash |dir.

tiwary |g.mtiwary |g.m

kumar | mgrkumar | mgr

Page 16: TEXT PROCESSING UTILITIES

PASTING FILES – THE paste COMMANDPASTING FILES – THE paste COMMAND What you “cut” with the previous command can be pasted What you “cut” with the previous command can be pasted

with the paste command. with the paste command. In this respect it resembles the cat command. But while cat In this respect it resembles the cat command. But while cat

pastes more than one file horizontally, paste does it pastes more than one file horizontally, paste does it vertically.vertically.

$ cut -d"|" -f6 shortlist | tee clist2$ cut -d"|" -f6 shortlist | tee clist2

2000020000

1500015000

90009000

2300023000

1500015000 Cut was used to create two files clist1 and clist2, containing Cut was used to create two files clist1 and clist2, containing

two cut-out portions of the same file. two cut-out portions of the same file.

Page 17: TEXT PROCESSING UTILITIES

$ paste clist1 clist2$ paste clist1 clist2

shukla | g.m 20000shukla | g.m 20000

sharma |d.g.m 15000sharma |d.g.m 15000

akash |dir. 9000akash |dir. 9000

tiwary |g.m 23000tiwary |g.m 23000

kumar | mgr 15000kumar | mgr 15000 By default paste uses the tab character for pasting files. By default paste uses the tab character for pasting files.

You can specify a delimiter of your choice:You can specify a delimiter of your choice:

$ paste -d"|" clist1 clist2$ paste -d"|" clist1 clist2

shukla | g.m # 20000shukla | g.m # 20000

sharma |d.g.m # 15000sharma |d.g.m # 15000

akash |dir. # 9000akash |dir. # 9000

tiwary |g.m # 23000tiwary |g.m # 23000

kumar | mgr # 15000kumar | mgr # 15000

Page 18: TEXT PROCESSING UTILITIES

While using the –d option along with several files in the While using the –d option along with several files in the command line, you can specify more than one delimiter. command line, you can specify more than one delimiter. For ex:For ex:

$ paste –d” |#~” file1 file2 file3 file4 file5$ paste –d” |#~” file1 file2 file3 file4 file5 The above example uses the space character for pasting The above example uses the space character for pasting

file1 and file2, the | character for pasting file2 and file3 and file1 and file2, the | character for pasting file2 and file3 and so forth.so forth.

Page 19: TEXT PROCESSING UTILITIES

ORDERING A FILE – THE sort COMMANDORDERING A FILE – THE sort COMMAND

Sorts the contents of a file.Sorts the contents of a file. It can merge multiple sorted files and store the It can merge multiple sorted files and store the

result in the specified output file.result in the specified output file. When the command is invoked without options, it When the command is invoked without options, it

sorts the entire line :sorts the entire line : Ex:Ex: $ sort shortlist$ sort shortlist1234 | kumar | mgr |accnts | 18/03/79 |150001234 | kumar | mgr |accnts | 18/03/79 |150002233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 200003456 | tiwary |g.m |product| 05/02/89 |230003456 | tiwary |g.m |product| 05/02/89 |230007898 | akash |dir. |mark. | 11/06/70 |90007898 | akash |dir. |mark. | 11/06/70 |90009876 | sharma |d.g.m|product| 12/03 60 | 150009876 | sharma |d.g.m|product| 12/03 60 | 15000

Page 20: TEXT PROCESSING UTILITIES

Sorting starts with the first character of each line in the file. Sorting starts with the first character of each line in the file. If the first character of two lines is same then the second If the first character of two lines is same then the second character in each line is compared and so on.character in each line is compared and so on.

The sorting is done according to the ASCII collating The sorting is done according to the ASCII collating sequence. That is, it sorts the spaces and tabs first, then sequence. That is, it sorts the spaces and tabs first, then the punctuation marks followed by numbers, uppercase the punctuation marks followed by numbers, uppercase letters and lowercase letters in that order.letters and lowercase letters in that order.

Like cut and paste, sort also works on fields, and the default Like cut and paste, sort also works on fields, and the default field separator is the space character. The –t option, field separator is the space character. The –t option, followed immediately by the delimiter, overrides the followed immediately by the delimiter, overrides the default. This lets you to sort the file on any field, for default. This lets you to sort the file on any field, for instance, the second field (name):instance, the second field (name):

$ sort –t”|” –k2 shortlist$ sort –t”|” –k2 shortlist

Page 21: TEXT PROCESSING UTILITIES

The sort order can be reversed with the –r (reverse) option.The sort order can be reversed with the –r (reverse) option. Ex:Ex: $ sort -r shortlist$ sort -r shortlist

9876 | sharma |d.g.m|product| 12/03 60 | 150009876 | sharma |d.g.m|product| 12/03 60 | 15000

7898 | akash |dir. |mark. | 11/06/70 |90007898 | akash |dir. |mark. | 11/06/70 |9000

3456 | tiwary |g.m |product| 05/02/89 |230003456 | tiwary |g.m |product| 05/02/89 |23000

2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 20000

1234 | kumar | mgr |accnts | 18/03/79 |150001234 | kumar | mgr |accnts | 18/03/79 |15000

We can sort the contents of several files at one shot as in:We can sort the contents of several files at one shot as in: $ sort file1 file2 file3$ sort file1 file2 file3

Page 22: TEXT PROCESSING UTILITIES

Instead of displaying the sorted output on the screen we Instead of displaying the sorted output on the screen we can store it in a file by saying,can store it in a file by saying,

$ sort –o result clist1$ sort –o result clist1 $ cat result$ cat result

akash |dir.akash |dir.

kumar | mgrkumar | mgr

sharma |d.g.msharma |d.g.m

shukla | g.mshukla | g.m

tiwary |g.mtiwary |g.m

To check whether the file has actually been sorted, useTo check whether the file has actually been sorted, use $ sort –c shortlist$ sort –c shortlist

Page 23: TEXT PROCESSING UTILITIES

Sorting on secondary key:Sorting on secondary key: You can sort on more than one key, i.e., you can provide a You can sort on more than one key, i.e., you can provide a

secondary key to sort. For example, if the primary key is secondary key to sort. For example, if the primary key is the 3the 3rdrd field, and the secondary key is the 2 field, and the secondary key is the 2ndnd field, then you field, then you need to specify for every –k option, where the sort ends. need to specify for every –k option, where the sort ends. This is done in this way:This is done in this way:

$ sort -t"|" -k3,3 -k2,2 shortlist$ sort -t"|" -k3,3 -k2,2 shortlist9876 | sharma |d.g.m|product| 12/03 60 | 150009876 | sharma |d.g.m|product| 12/03 60 | 150007898 | akash |dir. |mark. | 11/06/70 |90007898 | akash |dir. |mark. | 11/06/70 |90002233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 200003456 | tiwary |g.m |product| 05/02/89 |230003456 | tiwary |g.m |product| 05/02/89 |230001234 | kumar | mgr |accnts | 18/03/79 |150001234 | kumar | mgr |accnts | 18/03/79 |15000

This sorts the file by designation and name. the –k3,3 This sorts the file by designation and name. the –k3,3 option indicates that sorting starts on the 3option indicates that sorting starts on the 3rdrd field and ends field and ends on the same field.on the same field.

Page 24: TEXT PROCESSING UTILITIES

Sorting on columns :Sorting on columns : You can also specify a character position within a field to be You can also specify a character position within a field to be

the beginning of sort. For example, if you are to sort the file the beginning of sort. For example, if you are to sort the file according to the year of birth, then you need to sort on the according to the year of birth, then you need to sort on the 77thth and 8 and 8thth column positions within 5 column positions within 5thth field: field:

$ sort -t"|" -k5.7,5.8 shortlist$ sort -t"|" -k5.7,5.8 shortlist

2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 20000

9876 | sharma |d.g.m|product| 12/03 60 | 150009876 | sharma |d.g.m|product| 12/03 60 | 15000

1234 | kumar | mgr |accnts | 18/03/79 |150001234 | kumar | mgr |accnts | 18/03/79 |15000

7898 | akash |dir. |mark. | 11/06/70 |90007898 | akash |dir. |mark. | 11/06/70 |9000

3456 | tiwary |g.m |product| 05/02/89 |230003456 | tiwary |g.m |product| 05/02/89 |23000

Page 25: TEXT PROCESSING UTILITIES

Numeric sort (-n):Numeric sort (-n): When sort acts on numerals, strange things can happen.When sort acts on numerals, strange things can happen. [itlaxmi@snist ~]$ cat>nfile[itlaxmi@snist ~]$ cat>nfile

224410102727

[itlaxmi@snist ~]$ sort nfile[itlaxmi@snist ~]$ sort nfile101022272744

This is probably not what you expected, but the ASCII This is probably not what you expected, but the ASCII collating sequence places 1 above 2, and 2 above 4. That’s collating sequence places 1 above 2, and 2 above 4. That’s why 10 preceded 2 and 27 preceded 4. This can be why 10 preceded 2 and 27 preceded 4. This can be overridden by the –n (numeric ) option.overridden by the –n (numeric ) option.

Page 26: TEXT PROCESSING UTILITIES

[itlaxmi@snist ~]$ sort -n nfile[itlaxmi@snist ~]$ sort -n nfile

22

44

1010

2727

Page 27: TEXT PROCESSING UTILITIES

Removing Repeated Lines (-u):Removing Repeated Lines (-u): The –u (unique) option lets you remove repeated lines from The –u (unique) option lets you remove repeated lines from

a file. To find out the unique designations that occur in the a file. To find out the unique designations that occur in the file, cut out the designation field and pipe it to sort :file, cut out the designation field and pipe it to sort :

$ cut -d"|" -f3 e.lst | sort -u |tee desg.lst$ cut -d"|" -f3 e.lst | sort -u |tee desg.lst

dir.dir.

g.mg.m

mgrmgr Merge sort (-m):Merge sort (-m): When sort is used with multiple filenames as arguments, it When sort is used with multiple filenames as arguments, it

concatenates them and sorts them collectively. concatenates them and sorts them collectively. When large files are sorted in this way, performance often When large files are sorted in this way, performance often

suffers. The –m (merge) option can merge two or more files suffers. The –m (merge) option can merge two or more files that are sorted individually.that are sorted individually.

$ sort –m f1 f2 f3$ sort –m f1 f2 f3

Page 28: TEXT PROCESSING UTILITIES

sort optionssort options OptionOption DescriptionDescription -tchar-tchar Uses delimeter Uses delimeter char char to identify fieldsto identify fields -k n-k n Sorts on Sorts on nnth fieldth field -k m,n-k m,nStarts sort on Starts sort on mmth field and ends sort on th field and ends sort on nnth fieldth field -k m.n-k m.n Starts sort on Starts sort on nnth column of th column of mmth fieldth field -u -u Removes repeated linesRemoves repeated lines -n -n Sorts numericallySorts numerically -r-r Reverses sort orderReverses sort order -f-f Folds lowercase to equivalent uppercase (case Folds lowercase to equivalent uppercase (case

insensitive sort)insensitive sort) -m list-m list Merges sorted files in Merges sorted files in listlist --cc Checks if the file is sortedChecks if the file is sorted -o -o flnameflname Places output in file Places output in file flnameflname

Page 29: TEXT PROCESSING UTILITIES

THE uniq COMMANDTHE uniq COMMAND

There is often problem of duplicate entries creeping in due to There is often problem of duplicate entries creeping in due to faulty data entry. Unix offers a special tool to handle these faulty data entry. Unix offers a special tool to handle these records -- the records -- the uniquniq command. command.

The command is most useful when placed in pipelines, and The command is most useful when placed in pipelines, and can be used as an SQL type query tool (distinct).can be used as an SQL type query tool (distinct).

Ex: $ cat dept.lstEx: $ cat dept.lst01 | accounts | 621301 | accounts | 621301 | accounts | 621301 | accounts | 621302 | admin | 542302 | admin | 542303 | marketing | 652103 | marketing | 652103 | marketing | 652103 | marketing | 6521

$ uniq dept.lst$ uniq dept.lst01 | accounts | 621301 | accounts | 621302 | admin | 542302 | admin | 542303 | marketing | 652103 | marketing | 6521

Page 30: TEXT PROCESSING UTILITIES

uniquniq simply fetches one copy of the redundant records, simply fetches one copy of the redundant records, writing them to the standard output.writing them to the standard output.

Since uniq requires a sorted file as input, the general Since uniq requires a sorted file as input, the general procedure is to sort a file and pipe the process to uniq. The procedure is to sort a file and pipe the process to uniq. The following pipeline also produces the same output, except following pipeline also produces the same output, except that the output is saved in a file :that the output is saved in a file :

$ sort dept.lst | uniq - ulist$ sort dept.lst | uniq - ulist [itlaxmi@snist d1]$ cat ulist[itlaxmi@snist d1]$ cat ulist

01 | accounts | 621301 | accounts | 6213

02 | admin | 542302 | admin | 5423

03 | marketing | 652103 | marketing | 6521

Like sort, uniq also accepts the filename as an argument. Like sort, uniq also accepts the filename as an argument. Since it is done without using an option (unlike –o in sort), Since it is done without using an option (unlike –o in sort), you should make sure that you don’t specify multiple you should make sure that you don’t specify multiple filenames as input to this command;filenames as input to this command;

uniq uses only one file at a time.uniq uses only one file at a time.

Page 31: TEXT PROCESSING UTILITIES

If we use two filenames, then uniq simply processes first file If we use two filenames, then uniq simply processes first file and overwrites the second with its output. So you lose the and overwrites the second with its output. So you lose the data in the second file.data in the second file.

If uniq is to merely select unique lines, it is preferable to If uniq is to merely select unique lines, it is preferable to use sort –u. But uniq has a couple of options which can be use sort –u. But uniq has a couple of options which can be used to make simple database queries.used to make simple database queries.

Ex: To determine the designation that occurs uniquely in Ex: To determine the designation that occurs uniquely in the file e.lst, cut out the 3the file e.lst, cut out the 3rdrd field, sort it, and then pipe it to field, sort it, and then pipe it to uniq.uniq.

$ cat e.lst$ cat e.lst2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 20000

9876 | sharma | mgr |product| 12/03 60 | 150009876 | sharma | mgr |product| 12/03 60 | 15000

7898 | akash | dir. |mark. | 11/06/70 |90007898 | akash | dir. |mark. | 11/06/70 |9000

3456 | tiwary | g.m |product| 05/02/89 |230003456 | tiwary | g.m |product| 05/02/89 |23000

1234 | kumar | mgr |accnts | 18/03/79 |15001234 | kumar | mgr |accnts | 18/03/79 |1500

Page 32: TEXT PROCESSING UTILITIES

The –u (unique) option selects only the non-repeated lines.The –u (unique) option selects only the non-repeated lines.Ex:Ex: $ cut -d"|" -f3 e.lst |sort |uniq -u$ cut -d"|" -f3 e.lst |sort |uniq -u

dir.dir.

The –d (duplicate) option selects only one copy of the repeated The –d (duplicate) option selects only one copy of the repeated lines:lines:

Ex:Ex: $ cut -d"|" -f3 e.lst |sort |uniq -d$ cut -d"|" -f3 e.lst |sort |uniq -d

g.mg.m mgrmgr

And the –c (count) option displays the frequency of occurrence of And the –c (count) option displays the frequency of occurrence of all lines, along with the lines:all lines, along with the lines:

Ex:Ex: $ cut -d"|" -f3 e.lst |sort |uniq -c$ cut -d"|" -f3 e.lst |sort |uniq -c 1 dir.1 dir.

2 g.m2 g.m 2 mgr2 mgr

Page 33: TEXT PROCESSING UTILITIES

LINE NUMBERING – THE nl COMMANDLINE NUMBERING – THE nl COMMAND

There is separate command in UNIX system that has There is separate command in UNIX system that has elaborate schemes for numbering lines --the nl command elaborate schemes for numbering lines --the nl command

nl numbers only logical lines, i.e. the nl numbers only logical lines, i.e. the new linenew line character character containing something apart from the containing something apart from the new linenew line character. character.

By default, nl simply adds line numbers to its input, and By default, nl simply adds line numbers to its input, and prints them in a space six characters wide:prints them in a space six characters wide:

Ex:Ex: $ nl clist1$ nl clist1 1 shukla | g.m1 shukla | g.m 2 sharma |d.g.m2 sharma |d.g.m 3 akash |dir.3 akash |dir. 4 tiwary |g.m4 tiwary |g.m 5 kumar | mgr5 kumar | mgr

Page 34: TEXT PROCESSING UTILITIES

nl uses the tab character to separate the numbers from the nl uses the tab character to separate the numbers from the text. Use the –w(width) option to specify the width of the text. Use the –w(width) option to specify the width of the number format, and –s (separator) to specify the separator:number format, and –s (separator) to specify the separator:

Ex:Ex: $ nl -w2 -s":" clist1$ nl -w2 -s":" clist1 1: shukla | g.m1: shukla | g.m 2: sharma |d.g.m2: sharma |d.g.m 3: akash |dir.3: akash |dir. 4: tiwary |g.m4: tiwary |g.m 5: kumar | mgr5: kumar | mgr

Page 35: TEXT PROCESSING UTILITIES

To have leading zeroes in the first field, use –n To have leading zeroes in the first field, use –n option:option:

Ex:Ex: $ nl -w2 -s":" -nrz clist1$ nl -w2 -s":" -nrz clist1 01: shukla | g.m01: shukla | g.m 02: sharma |d.g.m02: sharma |d.g.m 03: akash |dir.03: akash |dir. 04: tiwary |g.m04: tiwary |g.m 05: kumar | mgr05: kumar | mgr

The –n option, followed immediately by the parameter rz, The –n option, followed immediately by the parameter rz, right justifies the number, with the leading zeroes to fill the right justifies the number, with the leading zeroes to fill the gaps. The other format you can use is ln, which left justifies gaps. The other format you can use is ln, which left justifies the number and removes the leading zeroes.the number and removes the leading zeroes.

Page 36: TEXT PROCESSING UTILITIES

In many applications, you have code tables In many applications, you have code tables starting from a number different from 1 (or 01 or starting from a number different from 1 (or 01 or 001). The –v option followed by a number, 001). The –v option followed by a number, determines the initial value that is to be used to determines the initial value that is to be used to number the lines. You can use the number 40 as number the lines. You can use the number 40 as the initial value:the initial value:

Ex:Ex: $ nl -w2 -s":" -nrz -v40 clist1$ nl -w2 -s":" -nrz -v40 clist1 40: shukla | g.m40: shukla | g.m 41: sharma |d.g.m41: sharma |d.g.m 42: akash |dir.42: akash |dir. 43: tiwary |g.m43: tiwary |g.m 44: kumar | mgr44: kumar | mgr

Page 37: TEXT PROCESSING UTILITIES

You can set the increment too with –i (increment) You can set the increment too with –i (increment) option :option :

Ex:Ex:

$ nl -w2 -s":" -nrz -v40 -i5 clist1$ nl -w2 -s":" -nrz -v40 -i5 clist1 40: shukla | g.m40: shukla | g.m 45: sharma |d.g.m45: sharma |d.g.m 50: akash |dir.50: akash |dir. 55: tiwary |g.m55: tiwary |g.m 60: kumar | mgr60: kumar | mgr

Page 38: TEXT PROCESSING UTILITIES

TRANSLATING CHARACTERS -TRANSLATING CHARACTERS -THE tr COMMANDTHE tr COMMAND

The tr (translate) filter manipulates individual characters in a The tr (translate) filter manipulates individual characters in a line. line.

It translates characters using one or two compact It translates characters using one or two compact expressions:expressions:

Syntax:Syntax: tr tr options expression1 expression2 standard inputoptions expression1 expression2 standard input

tr takes input only from the standard input; it doesn’t take a tr takes input only from the standard input; it doesn’t take a filename as argument.filename as argument.

By default, it translates each character in By default, it translates each character in expression1 expression1 to its to its mapped counterpart in mapped counterpart in expression2.expression2.

The 1The 1stst character in 1 character in 1stst expression is replaced with the 1 expression is replaced with the 1stst character in the 2character in the 2ndnd expression, and similarly for the other expression, and similarly for the other characters.characters.

Page 39: TEXT PROCESSING UTILITIES

Ex: To replace the “|” with a ~(tilde) and the “/” with a -.Ex: To replace the “|” with a ~(tilde) and the “/” with a -.

$ tr '|/' '~-' < shortlist | head -2$ tr '|/' '~-' < shortlist | head -2

2233 ~ shukla ~ g.m ~ sales ~ 12-12-52 ~ 200002233 ~ shukla ~ g.m ~ sales ~ 12-12-52 ~ 20000

9876 ~ sharma ~d.g.m~product~ 12-03-60 ~ 150009876 ~ sharma ~d.g.m~product~ 12-03-60 ~ 15000

Changing case of text:Changing case of text: To change the case of 1To change the case of 1stst three lines from lower to upper: three lines from lower to upper:

$ head -2 e.lst | tr '[a-z]' '[A-Z]'$ head -2 e.lst | tr '[a-z]' '[A-Z]'

2233 | SHUKLA | G.M | SALES | 12/12/52 | 200002233 | SHUKLA | G.M | SALES | 12/12/52 | 20000

9876 | SHARMA | MGR |PRODUCT| 12/03 60 | 150009876 | SHARMA | MGR |PRODUCT| 12/03 60 | 15000

Page 40: TEXT PROCESSING UTILITIES

Using ASCII octal values and escape sequences :Using ASCII octal values and escape sequences : tr also uses octal values and escape sequences to tr also uses octal values and escape sequences to

represent characters.represent characters. To have each field on a separate line, replae the “|” with To have each field on a separate line, replae the “|” with

the LF character (octal value 012): the LF character (octal value 012):

$ tr '|' '\012' < emp.lst |head -n 6 $ tr '|' '\012' < emp.lst |head -n 6

22332233

shuklashukla

g.mg.m

salessales

12/12/5212/12/52

2000020000

Page 41: TEXT PROCESSING UTILITIES

Deleting characters (-d) :Deleting characters (-d) : To delete the characters “|” and “/” from the file:To delete the characters “|” and “/” from the file: $ tr –d ‘|/’ < shortlist | head –n 2$ tr –d ‘|/’ < shortlist | head –n 2 2233 shukla g.m sales 121252 200002233 shukla g.m sales 121252 20000 9876 sharma d.g.m product 1203 60 150009876 sharma d.g.m product 1203 60 15000

Compressing Multiple Consecutive characters (-s):Compressing Multiple Consecutive characters (-s): We can eliminate all redundant spaces in the files with We can eliminate all redundant spaces in the files with

delimited fields with the –s (squeeze) option.delimited fields with the –s (squeeze) option. The –s option squeezes multiple consecutive occurrences of The –s option squeezes multiple consecutive occurrences of

its argument to a single character.its argument to a single character. $ tr –s ‘ ‘ <shortlist | head –n 3$ tr –s ‘ ‘ <shortlist | head –n 3

Page 42: TEXT PROCESSING UTILITIES

File UtilitiesFile Utilities

CutCut

PastePaste

HeadHead

TailTail

CmpCmp

CommComm

DiffDiff

Page 43: TEXT PROCESSING UTILITIES

FiltersFilters A group of commands, each of which accepts A group of commands, each of which accepts

some data as input, performs some manipulation some data as input, performs some manipulation on it, and produces some output. Since they on it, and produces some output. Since they perform some filtering action on the data, they perform some filtering action on the data, they are appropriately called are appropriately called filters.filters.

GrepGrep EgrepEgrep FgrepFgrep Sed Sed AwkAwk sortsort uniquniq nlnl

Page 44: TEXT PROCESSING UTILITIES

SEARCHING FOR A PATTERN – SEARCHING FOR A PATTERN – THE grep COMMANDTHE grep COMMAND

The grep (global regular expression printer) scans The grep (global regular expression printer) scans a file for the occurrence of a pattern.a file for the occurrence of a pattern.

It uses a couple of options, and depending on It uses a couple of options, and depending on their usage, outputs the lines containing the their usage, outputs the lines containing the pattern, or the filenames or the line numbers.pattern, or the filenames or the line numbers.

Syntax:Syntax:

grep <options> <pattern><filename(s)>grep <options> <pattern><filename(s)>

Most of the grep’s options are shared by its other Most of the grep’s options are shared by its other members also (egrep and fgrep).members also (egrep and fgrep).

Page 45: TEXT PROCESSING UTILITIES

In addition to options, grep compulsorily requires In addition to options, grep compulsorily requires an expression to represent the pattern to be an expression to represent the pattern to be searched for. The first argument (barring the searched for. The first argument (barring the option) is always treated as the expression, and option) is always treated as the expression, and the ones remaining as the filenames.the ones remaining as the filenames.

grep looks for all occurrences of the expression in grep looks for all occurrences of the expression in its input, and, by default, outputs the lines its input, and, by default, outputs the lines containing the expression.containing the expression.

Page 46: TEXT PROCESSING UTILITIES

Ex:Ex: $ grep "sales" e.lst$ grep "sales" e.lst 2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 20000

When grep is used with multiple filenames, it displays the When grep is used with multiple filenames, it displays the filenames along with the output.filenames along with the output.

$ grep "sales" e.lst shortlist$ grep "sales" e.lst shortlist

e.lst:2233 | shukla | g.m | sales | 12/12/52 | 20000e.lst:2233 | shukla | g.m | sales | 12/12/52 | 20000

shortlist:2233 | shukla | g.m | sales | 12/12/52 | 20000shortlist:2233 | shukla | g.m | sales | 12/12/52 | 20000

Page 47: TEXT PROCESSING UTILITIES

Because grep is also a filter, it can search its standard input Because grep is also a filter, it can search its standard input for the pattern and store the output in a file:for the pattern and store the output in a file:

$ Who | grep itlaxmi > fff$ Who | grep itlaxmi > fff

Quoting in grep:Quoting in grep:

Quoting is essential if the search string consists of more Quoting is essential if the search string consists of more than one word, or uses any of the shell’s characters like *,$ than one word, or uses any of the shell’s characters like *,$ etc.etc.

grep simply returns the prompt when the pattern can’t be grep simply returns the prompt when the pattern can’t be located.located.

$ grep president shortlist$ grep president shortlist $$

Page 48: TEXT PROCESSING UTILITIES

grep optionsgrep options

OptionOption SignificanceSignificance

-c-c Displays count of number of occurrencesDisplays count of number of occurrences -l -l Displays list of the filenames onlyDisplays list of the filenames only -n-n Displays line numbers along with the linesDisplays line numbers along with the lines -v-v Doesn’t display lines matching expressionDoesn’t display lines matching expression -i-i Ignores case for matchingIgnores case for matching -h-h Omits filenames when handling multiple filesOmits filenames when handling multiple files -f flname-f flname Takes expressions from file flname (egrep Takes expressions from file flname (egrep

and and fgrep only).fgrep only). -x-x Displays lines matched in entirety (fgrep Displays lines matched in entirety (fgrep

only)only)

Page 49: TEXT PROCESSING UTILITIES

ExamplesExamples 1. $ grep -h mgr emp.lst shortlist1. $ grep -h mgr emp.lst shortlist

1234 | kumar | mgr |accnts | 18/03/79 |150001234 | kumar | mgr |accnts | 18/03/79 |15000

1234 | kumar | mgr |accnts | 18/03/79 |150001234 | kumar | mgr |accnts | 18/03/79 |15000

2. $ grep -c 'mgr' e.lst emp.lst2. $ grep -c 'mgr' e.lst emp.lst

e.lst:2e.lst:2

emp.lst:1emp.lst:1

3.3. $ grep -n 'mgr' e.lst emp.lst$ grep -n 'mgr' e.lst emp.lst

e.lst:2:9876 | sharma | mgr |product| 12/03 60 | 15000e.lst:2:9876 | sharma | mgr |product| 12/03 60 | 15000

e.lst:5:1234 | kumar | mgr |accnts | 18/03/79 |1500e.lst:5:1234 | kumar | mgr |accnts | 18/03/79 |1500

emp.lst:5:1234 | kumar | mgr |accnts | 18/03/79 |15000emp.lst:5:1234 | kumar | mgr |accnts | 18/03/79 |15000

Page 50: TEXT PROCESSING UTILITIES

ExamplesExamples 4. $ grep -v 'mgr' e.lst4. $ grep -v 'mgr' e.lst2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 200007898 | akash | dir.|mark. | 11/06/70 |90007898 | akash | dir.|mark. | 11/06/70 |90003456 | tiwary | g.m |product| 05/02/89 |230003456 | tiwary | g.m |product| 05/02/89 |23000 -v option is used for deleting lines in grep-v option is used for deleting lines in grep..

5. $ grep -l 'mgr' *.lst5. $ grep -l 'mgr' *.lstdesg.lstdesg.lstdesig.lstdesig.lste1.lste1.lste.lste.lstemp1.lstemp1.lstemp.lstemp.lst

Page 51: TEXT PROCESSING UTILITIES

ExamplesExamples 6. $ grep -i 'SHUKLA' e.lst6. $ grep -i 'SHUKLA' e.lst

2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 20000

Page 52: TEXT PROCESSING UTILITIES

Basic Regular Expressions (BRE)Basic Regular Expressions (BRE) You don’t always search a file with simple strings. It is You don’t always search a file with simple strings. It is

possible that you may be looking for a name, but don’t possible that you may be looking for a name, but don’t know exactly how it is spelt. Or, you may be interested in know exactly how it is spelt. Or, you may be interested in the occurrences of a pattern only at a certain location, e.g. the occurrences of a pattern only at a certain location, e.g. the beginning of a record.the beginning of a record.

The importance of grep lies not merely in its simple The importance of grep lies not merely in its simple pattern-matching capability but in its acceptance of a pattern-matching capability but in its acceptance of a regular expression for a pattern.regular expression for a pattern.

A regular expression is a string of ordinary and A regular expression is a string of ordinary and metacharacters which can be used to match more than one metacharacters which can be used to match more than one type of pattern.type of pattern.

Page 53: TEXT PROCESSING UTILITIES

The BRE Character Set Used by grep, sed The BRE Character Set Used by grep, sed and awkand awk

PatternPattern MatchesMatches

** Zero or more occurrences of the previous Zero or more occurrences of the previous charactercharacter

.. A single characterA single character [pqr][pqr] A single character p,q, or rA single character p,q, or r [c1-c2][c1-c2]A single character within the ASCII range A single character within the ASCII range

represented by c1 and c2represented by c1 and c2 [^pqr][^pqr] A single character which is not a p, q or rA single character which is not a p, q or r ^pat^pat Pattern pat at beginning of linePattern pat at beginning of line Pat$ Pat$ Pattern pat at end of line.Pattern pat at end of line.

Page 54: TEXT PROCESSING UTILITIES

ExamplesExamples g* g* Nothing or g, gg, ggg, etc.Nothing or g, gg, ggg, etc. gg* gg* g, gg, ggg, etcg, gg, ggg, etc .*.* Nothing or any number of charactersNothing or any number of characters [1-3][1-3] A digit between 1 and 3A digit between 1 and 3 [^a-zA-Z] A nonalphabetic character[^a-zA-Z] A nonalphabetic character bash$bash$ bash at end of linebash at end of line ^bash$^bash$ bash as the only word in linebash as the only word in line ^$^$ Lines containing nothing.Lines containing nothing.

Page 55: TEXT PROCESSING UTILITIES

ExamplesExamples $ grep "k.*" e.lst$ grep "k.*" e.lst

2233 | shukla | g.m | sales | 12/12/52 | 200002233 | shukla | g.m | sales | 12/12/52 | 20000

7898 | akash | dir.|mark. | 11/06/70 |90007898 | akash | dir.|mark. | 11/06/70 |9000

1234 | kumar | mgr |accnts | 18/03/79 |15001234 | kumar | mgr |accnts | 18/03/79 |1500

$ grep "9000$" e.lst$ grep "9000$" e.lst

7898 | akash | dir.|mark. | 11/06/70 |90007898 | akash | dir.|mark. | 11/06/70 |9000

$ grep '[Ss]h*arma' e.lst$ grep '[Ss]h*arma' e.lst

9876 | sharma | mgr |product| 12/03/60 |150009876 | sharma | mgr |product| 12/03/60 |15000

8888 | Sarma | dir.| sales | 05/09/60 |250008888 | Sarma | dir.| sales | 05/09/60 |25000

$ grep '[1-2]...$' e.lst$ grep '[1-2]...$' e.lst

1234 | kumar | mgr |accnts | 18/03/79 |15001234 | kumar | mgr |accnts | 18/03/79 |1500

Page 56: TEXT PROCESSING UTILITIES

EXTENDING grep – THE egrepEXTENDING grep – THE egrep

The egrep command, extends grep’s pattern-matching The egrep command, extends grep’s pattern-matching capabilities. capabilities.

It offers all the options of grep, but its most useful feature is It offers all the options of grep, but its most useful feature is the facility to specify more than one pattern for search. the facility to specify more than one pattern for search.

Each pattern is separated from the other by a | (pipe).Each pattern is separated from the other by a | (pipe).

Page 57: TEXT PROCESSING UTILITIES

The extended regular expression set used The extended regular expression set used by egrep and awkby egrep and awk

Expression Expression SignificanceSignificance

Ch+Ch+ Matches one or more occurrences of the Matches one or more occurrences of the character chcharacter ch

Ch?Ch? Matches zero or one occurrence of the Matches zero or one occurrence of the character character chch

Exp1\exp2Exp1\exp2 Matches the expression exp1 or exp2 Matches the expression exp1 or exp2

(x1\x2)x3(x1\x2)x3 Matches the expression x1x3 or x2x3 Matches the expression x1x3 or x2x3

Page 58: TEXT PROCESSING UTILITIES

ExamplesExamples g+g+ At least one gAt least one g g?g? Nothing or one gNothing or one g GIF|JPEGGIF|JPEG Matches GIF or JPEGMatches GIF or JPEG (lock | ver)wood(lock | ver)wood Matches lockwood or verwoodMatches lockwood or verwood

•$ egrep 'sales |mark.' e.lst2233 | shukla | g.m | sales | 12/12/52 |200007898 | akash | dir.|mark. | 11/06/70 |90008888 | Sarma | dir.| sales | 05/09/60 |25000

•$ egrep -i '(sh|s)arma' e.lst9876 | sharma | mgr |product| 12/03/60 |150008888 | Sarma | dir.| sales | 05/09/60 |25000

Page 59: TEXT PROCESSING UTILITIES

$ egrep –f pat.lst emp.lst$ egrep –f pat.lst emp.lst

The command takes the expressions from the file pat.lst. The command takes the expressions from the file pat.lst. This file must contain the patterns, suitably delimited in the This file must contain the patterns, suitably delimited in the same way as they are specified in the command line.same way as they are specified in the command line.

Page 60: TEXT PROCESSING UTILITIES

MULTIPLE STRING SEARCHING – THE MULTIPLE STRING SEARCHING – THE fgrepfgrep

Like egrep, fgrep accepts alternative patterns, both from Like egrep, fgrep accepts alternative patterns, both from the command line, as well as from a file, but unlike grep the command line, as well as from a file, but unlike grep and egrep, it doesn’t accept regular expressions.and egrep, it doesn’t accept regular expressions.

If the pattern to search for is a simple string, or a group of If the pattern to search for is a simple string, or a group of them, then fgrep is recommended.them, then fgrep is recommended.

It is faster than its two fellow members, and should be used It is faster than its two fellow members, and should be used while using fixed strings.while using fixed strings.

Alternative patterns in fgrep are specified by separating Alternative patterns in fgrep are specified by separating one pattern from another by the newline character. This is one pattern from another by the newline character. This is unlike egrep, which uses the | to delimit two expressions.unlike egrep, which uses the | to delimit two expressions.

Page 61: TEXT PROCESSING UTILITIES

Ex: Ex: If you search for three specific departments (without If you search for three specific departments (without

regular expressions), fgrep used in the following manner regular expressions), fgrep used in the following manner can produce a list sorted in reverse order containing the can produce a list sorted in reverse order containing the three patterns :three patterns :

$ fgrep ‘sales$ fgrep ‘sales> personnel> personnel> admin’ emp.lst | sort –t “|” +3r | tee newlist> admin’ emp.lst | sort –t “|” +3r | tee newlist

Like egrep, fgrep also takes patterns from a file, except that Like egrep, fgrep also takes patterns from a file, except that each string has to be stored in a separate line.each string has to be stored in a separate line.

EX: $ cat pat1.lstEX: $ cat pat1.lstsalessalespersonnelpersonneladminadmin

$ fgrep –f pat1.lst emp.lst$ fgrep –f pat1.lst emp.lst

Page 62: TEXT PROCESSING UTILITIES

ExamplesExamples

$ fgrep 'sales$ fgrep 'sales> mark.> mark.> product' e.lst> product' e.lst 2233 | shukla | g.m | sales | 12/12/52 |200002233 | shukla | g.m | sales | 12/12/52 |20000 9876 | sharma | mgr |product| 12/03/60 |150009876 | sharma | mgr |product| 12/03/60 |15000 7898 | akash | dir.|mark. | 11/06/70 |90007898 | akash | dir.|mark. | 11/06/70 |9000 3456 | tiwary | g.m |product| 05/02/89 |230003456 | tiwary | g.m |product| 05/02/89 |23000 8888 | Sarma | dir.| sales | 05/09/60 |250008888 | Sarma | dir.| sales | 05/09/60 |25000

Page 63: TEXT PROCESSING UTILITIES

RELATIONAL JOIN – THE join COMMANDRELATIONAL JOIN – THE join COMMAND

join helps to establish a logical relationship between two join helps to establish a logical relationship between two tables.tables.

It uses a common column in each table to establish this It uses a common column in each table to establish this relationship, and , by default, creates a single row which relationship, and , by default, creates a single row which contains all the columns of the two tables.contains all the columns of the two tables.

The prerequisite is that both tables be sorted on the joined The prerequisite is that both tables be sorted on the joined columns.columns.

Syntax:Syntax:join <options> <join field specifiers> <output fieldlist>file1 file2join <options> <join field specifiers> <output fieldlist>file1 file2

When no field delimiters are specified, it assumes that the When no field delimiters are specified, it assumes that the fields are delimited by spaces.fields are delimited by spaces.

Page 64: TEXT PROCESSING UTILITIES

The join uses numbers to identify fields, but it also uses The join uses numbers to identify fields, but it also uses numbers to identify files. Since you can join only two files numbers to identify files. Since you can join only two files with a single command, this parameter can take the values with a single command, this parameter can take the values 1 or 2, depending on the location of the file argument in the 1 or 2, depending on the location of the file argument in the command line.command line.

Page 65: TEXT PROCESSING UTILITIES

ExamplesExamples $ cat > emp_table$ cat > emp_tableempid designation deptnoempid designation deptno111 director 10111 director 10112 manager 10112 manager 10113 dgm 20113 dgm 20

[itlaxmi@snist ~]$ cat > dept_table[itlaxmi@snist ~]$ cat > dept_tabledeptno deptnamedeptno deptname10 sales10 sales20 production20 production

$ join -j1 3 -j2 1 emp_table dept_table$ join -j1 3 -j2 1 emp_table dept_tabledeptno empid designation deptnamedeptno empid designation deptname 10 111 director sales10 111 director sales 10 112 manager sales10 112 manager sales 20 113 dgm production20 113 dgm production

Page 66: TEXT PROCESSING UTILITIES

CREATING A TEE – THE tee COMMANDCREATING A TEE – THE tee COMMAND

Tee is an external command and not a feature of the shell. Tee is an external command and not a feature of the shell. It handles a character stream by It handles a character stream by splitting its input into splitting its input into twotwo components. components. It saves one component in a file and It saves one component in a file and writes the other to the standard output. writes the other to the standard output.

Being also a filter, tee can be placed anywhere in a Being also a filter, tee can be placed anywhere in a pipeline.pipeline.

Tee doesn’t perform any filtering action on its input, it gives Tee doesn’t perform any filtering action on its input, it gives out exact what it takes.out exact what it takes.

The following command sequence uses tee to display the The following command sequence uses tee to display the output of who and saves this output in a file as well.output of who and saves this output in a file as well.

Page 67: TEXT PROCESSING UTILITIES

ExamplesExamples

$ who | tee user.lst$ who | tee user.lst

root tty7 2009-08-04 09:51 (:0)root tty7 2009-08-04 09:51 (:0)

root pts/1 2009-08-04 09:51 (:0)root pts/1 2009-08-04 09:51 (:0)

itlaxmi pts/2 2009-08-04 12:52 (10.4.8.19)itlaxmi pts/2 2009-08-04 12:52 (10.4.8.19)

[itlaxmi@snist ~]$ cat user.lst[itlaxmi@snist ~]$ cat user.lst

root tty7 2009-08-04 09:51 (:0)root tty7 2009-08-04 09:51 (:0)

root pts/1 2009-08-04 09:51 (:0)root pts/1 2009-08-04 09:51 (:0)

itlaxmi pts/2 2009-08-04 12:52 (10.4.8.19)itlaxmi pts/2 2009-08-04 12:52 (10.4.8.19)

Page 68: TEXT PROCESSING UTILITIES

Since tee uses standard output, you can pipe its output to Since tee uses standard output, you can pipe its output to another command, say wc:another command, say wc:

$ who | tee user.lst | wc -l$ who | tee user.lst | wc -l 33

The –a (append) option appends the output to the file The –a (append) option appends the output to the file specified as argument.specified as argument.

$ cal 2009 | tee -a calfile > calfile2$ cal 2009 | tee -a calfile > calfile2 The sequence appends one stream to the file calfile, while The sequence appends one stream to the file calfile, while

overwriting the file calfile2 with the other stream.overwriting the file calfile2 with the other stream.

Page 69: TEXT PROCESSING UTILITIES

THE pg COMMANDTHE pg COMMAND

The disadvantage of head and tail is that they cannot The disadvantage of head and tail is that they cannot display a range of lines. Moreover, what is displayed is display a range of lines. Moreover, what is displayed is final. That is, if we have displayed the first 50 lines in a file final. That is, if we have displayed the first 50 lines in a file we cannot move back and view say the 10we cannot move back and view say the 10thth line. line.

Unix provides two commands which offer more flexibility in Unix provides two commands which offer more flexibility in viewing files. These are viewing files. These are pgpg and and moremore..

They are more or less work in the same manner, except for They are more or less work in the same manner, except for a few minor differences.a few minor differences.

Page 70: TEXT PROCESSING UTILITIES

Each of them helps you view a file page by page with lot Each of them helps you view a file page by page with lot of useful options like:of useful options like:

(a)(a) Set the number of lines to be displayed per page.Set the number of lines to be displayed per page.

(b)(b) Ability to move either forwards or backwards in a file just Ability to move either forwards or backwards in a file just at the touch of a key.at the touch of a key.

(c)(c) Skip pages while viewing the file page by page.Skip pages while viewing the file page by page.

(d)(d) Search the file for a pattern in forward or backward Search the file for a pattern in forward or backward direction.direction.

On executing each of these commands one pageful of file On executing each of these commands one pageful of file contents are displayed on the screen after which a contents are displayed on the screen after which a prompt is displayed at which the user can give various prompt is displayed at which the user can give various commands that are understood by pg or more.commands that are understood by pg or more.

Page 71: TEXT PROCESSING UTILITIES

ExampleExample $ pg +10 -15 -p “Page no. %d” myfile$ pg +10 -15 -p “Page no. %d” myfile

This command starts displaying the contents of myfile, 15 This command starts displaying the contents of myfile, 15 lines at a time from 10lines at a time from 10thth line onwards. At the end of each line onwards. At the end of each displayed page a prompt comes which displays the page displayed page a prompt comes which displays the page number on view. This prompt overrides the default ‘:’ number on view. This prompt overrides the default ‘:’ prompt of the pg command.prompt of the pg command.