34
Text Processing and Pattern Searching Chapter -6 1

Week Text Processing)

Embed Size (px)

Citation preview

Page 1: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 1/34

Text Processing and Pattern

Searching

Chapter -6

1

Page 2: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 2/34

Text line length adjustment

Given a set of lines of text of arbitrary length, reformat the text

so that no lines of more than n characters are printed. In each

output line the maximum number of words that occupy less

than or n characters, should be printed and no word should

extend across 2 lines. Paragraphs should also remain

indented.

tabs are not be considered(take them as single space)

Every end of the line returns a space.

2

Page 3: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 3/34

Algorithm

1. Establi

sh the li

ne length li

mi

t limit and add one toi

t to allowfor a space.

2. Initialize word and line character counts to zero and end-of-

line flag to false

3. While not end-of-f ile do

a) read and store next character

b) if character is a space then

b.1) if a new paragraph then

1.a) mov

e to next li

ne and reset charactercount for the new line.

b.2) add current word length to current line length

3

Page 4: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 4/34

b.3) if current word causes line length limit to be exceeded

then

3.a) move to next line and set line length to current

word length.

b.4) write out current word and its trialing space and reinitialize

character count.

b.5) turn off end-of-input-line flag

b.6) if at end-of-input-line then

6.a) set end-of-input-line flag and move to next input line.

4

Page 5: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 5/34

Pascal Implementationprocedure textformat(limit:integer);

var i,linecnt,wordcnt:integer;var chr, space:char;

var eol:boolean;

var word:array[1..30]of char;

beginwordcnt:=0;linecnt:=0,eol:=false; space:=;

limit:=limit+1;

while not eof(input) do

begin

read(chr);

wordcnt:=wordcnt+1;

word[wordcnt]:=chr;5

Page 6: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 6/34

if chr=space then

begin

if eol and (wordcnt=1) thenbegin

writeln;

linecnt:=0;

end;linecnt:=linecnt+wordcnt;

if linecnt>limit then

begin

writeln;

linecnt:=wordcnt;

end;

6

for i:=1 to wordcnt do

write(word[

i]);wordcnt:=0;

eol:=false;

if eoln(input) then

begin

eol:=true;readln;

end

end

end;

writeln

end

Page 7: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 7/34

Left and Right Justification of Text

Design and implement a procedure that will left and

right justify text in a way that avoids splitting words

and leaves paragraphs indented. An attempt should

also be made to distribute the additional blanks asevenly as possible in the justif ied line.

7

Page 8: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 8/34

Fixed line length is achieved by inserting additional spaces

between words.

For any particular line the following holds

 ± The line is already in correct length so no processing isneeded.

 ± The number of extra spaces needed to expand the current

line to the required length is equal to number of spaces

already present in the line.

It is simply adding 1 space to each existing space.

number of spaces to be added > existing spaces

number of spaces to be added < existing spaces.

8

Left and Right Justification of Text

Page 9: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 9/34

Ex:

10 extra spaces to a line which has already 7 spaces.

f irst add 7 spaces evenly to existing spaces

3 are left

 ± if 1 space then add it to middle space.

 ± if 2 spaces then f irst would be positioned one-third of the

way and second, two-thirds of the way across.

 ± if 3 spaces then add after 2, 4 and 6 word

9

Page 10: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 10/34

Algorithm

1. Establish line to be justif ied, its current length and justif ication

length.2. Include test to see if it can be justif ied.

3. Initialize space count and alphabetic start of line.

4. while current character a space do

a) shift to next character;b) increment alphabetic start to end to line.

c) write out a space

5. For from alphabetic start to end of line do

a) If current character is space thena.1) increment space count

a.2) set current position in space count table to 1

6. Remove any spaces from end of line

10

Page 11: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 11/34

7. Determine the extra spaces to be added from new and old

line lengths.

8. While still extra spaces to add and possible to do so do

a) compute current template increment from space count

and extra spaces count

b) if increment 0 the set to 1 since more extras than spaces

c) if extra spaces > space count then

c.1) determine space block using extra spaces and space

count else

c.1`) set space block size to 1

d) determine starting position for template

e) while not end-of-line and still spaces to add do

e.1) add space block to current template position.

e.2) move to next position in template

e.3) decrement extra space count by space block size11

Page 12: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 12/34

9. For from start to end-of-line doa) if next character a space then

a.1) move to next position in space count table

a.2) write out number of spaces as per space count table

else

a.1`) write out current character

10. Finish off with an end-of-line.

12

Page 13: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 13/34

Pascal Implementation

procedure justify( line: nchars; oldlen, newlen: integer);

const tsize:=40;var delta, exspace, j, ispace, nspaces, next, pos, st, spaceblock: integer;

var space: char; var template: array[1..tsize] of integer;

begin

if oldlen>newlen then writeln(line too long);

else begin

space:=; st:=1;

while (line[st]=space) and (st<=oldlen) do

begin

st:=st+1; write(space);

end

nspaces:=0;

if st<=oldlen then

while line[oldlen]=space do oldlen:=oldlen-1;

for pos:=st to oldlen do 13

Page 14: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 14/34

if line[pos]=space then

begin

nspaces:=nspaces+1;

template[nspaces]:=1;

end;

exspace:=newlen-oldlen;

while (exspace>0)and(nspaces>0) do

begindelta:=round(nspaces/exspace);

if delta=0 then delta:=1;

if exspace>nspaces then

spaceblock:=exspace div nspaces;

else

spaceblock:=1;

next:=(delta +1) div 2;

while (next<=nspaces)and (exspace>0) do

begin 14

template[next]:=template[next]+spaceblock;

next:=next+delta;

exspace:=exspace-spaceblock;

endend

ispace:=0

for pos:=st to oldlen do

if line[pos]=space then

begin

ispace:=ispace+1;for j:=1 to template[ispace] do

write(space);

end

else

write (line[pos]);

writeln;end end

Page 15: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 15/34

Keyword Searching in Textcount the number of times a particular word occurs in a given text

Algorithm

1. Establish the word and word length w length of the search-word

2. Initialize the match-count nmatches, set preceding character

and set pointer for word array i to 1

3. while not at end-of-f ile do

a) while not end-of-line do

a.1) read next character

a.2) if current text character chr matches ith character in word

then

2.a) extend partial match i by 1,

2.b) if a word-pattern match then

b.1) read next character post,

15

Page 16: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 16/34

b.2) if preceding and following character not alphabetic then

2.a) update match count nmatches

b.3) reinitialize pointer to word array i

b.4) save following character post as preceding character

else2.a`) save current text character as preceding character for

match

2.b`) reset word array pointer i to f irst position

b) read past end-of-line.4. return word-match count nmatches

16

Page 17: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 17/34

Pascal Implementationprocedure wordsearch ( word:nchars; wlength:integer; var

nmatches:integer);

type letters=a..z;

var i, :integer;

var chr, pre, post: char;

alphabet: set of letters;

begin

alphabet:=[a..z];

pre:=;i:=1;

while not eof(input) dobegin

while eoln(input)do

begin

read(chr); 17

Page 18: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 18/34

end

else

begin

pre:=chr;

i:=1;

end

end;

readln;

end

end

18

if chr=word[i] then

begin

i:=i+1;

if i>wlength then

begin

read(post);

if(not (pre in alphabet)) and

(not(post in alphabet)) thenbegin

nmatches:=nmatches+1;

end

i:=1;

pre:=post;end

Page 19: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 19/34

Text Line Editing

Design and implement an algorithm that will search a

line of text for a particular pattern or substring.

Should the pattern be found it is to be replaced by

another given pattern.the two wrongs in this line are wrong --original line

the two rights in this line are right -- edited line

19

Page 20: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 20/34

Algorithm

1. Establish the textline, the search pattern and replacement

pattern and their assoc

iated lengths.

2. Set initial values for the position in the old text, the new text

and the search pattern

3. While all pattern positions in the text have not been

examined do

a) if current text and pattern characters match then

a.1) extend indices to next pattern/text character pair

a.2) if a complete match then

2.a) copy new patterninto current pos

ition

in ed

itedline

2.b) move past old pattern in text

2.c) reset pointer for search pointer.

else 20

Page 21: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 21/34

a.1) copy current text character to next position in edited text

a.2) reset search pattern pointer

a.3) move pattern to next text position

4. Copy the leftover characters in the original text line

5. Return the edited line of text

21

Page 22: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 22/34

Pascal Implementationprocedure textedit(var text, newtext: nchars; var pattern, newpattern:

nchars; var newlen: integer; textlen, patlen, newpatlen :integer);

var i, j, k, l: integer;begin

i:=1; j:=1;k:=0;

while i<=textlen-patlen+1 do

begin

if text[i+j-1]=pattern[j] then

begin

 j:=j+1;

if j>patlen then

beginfor l:=1 to newpatlen do

begin

k:=k+1;

newtext[k]:=newpattern[l];

end 22

i:=i+patlen; j:=1; end endelse

begin

k:=k+1;

newtext[k]:=text[i];

i:=i+1; j:=1;

end

end

while i<=textlen do

begin

k:=k+1;newtext[k]:=text[i];

i:=i+1;

end

newtextlen:=k;

end

Page 23: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 23/34

Linear Pattern Search

Design and implement a pattern searching algorithm

with a performance that is linearly dependant on the

length of the string or text being searched. A count

should be made of the number of times the searchpattern occurs in the string

23

Page 24: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 24/34

Algorithm Description

Partial-match table setup algorithm

Linear pattern searching algorithm

Procedure for recovering from mismatches and

complete matches.

24

Page 25: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 25/34

Partial-match table setup algorithm

1. Establish the search pattern

2. Set initial displacement between the pattern and itself to one.

3. Initialize the zero and first position in the partial match array to zero

4. while all positions of pattern relative to itself not considered do

a) if current pattern and displaced pattern character pairs match then

a.1) save current degree of partial matcha.2) move to next position in pattern and displaced pattern

else

a`.1) a mismatch so set partial match to zero

a`.2) reset pointer to start of displaced pattern

a`.3) move the start of the displaced pattern to the next availableposition

5. return the partial match table.

25

Page 26: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 26/34

Linear pattern searching algorithm

1. Establish the pattern to be searched for and the string in which it is to

be sought together with lengths of the pattern and the string.

2. Set initial values for start of pattern and string and zero the match

count.

3. while all appropriate pattern positions in the string have not been

examined do

a) if current string and pattern characters match then

a.1) extend indices to next pattern/string pair.

a.2) if a complete match then

2.a) update complete match count

2.b) reset recovery position from the partial match table

else

a`.1) reset recovery position from the partial match table

4. return count of the number of complete matches of the pattern in the

string.

26

Page 27: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 27/34

Procedure for recovering from

mismatches and complete matches.

1. Establish the partial match table, the current position in the string andposition in pattern

2. if no smaller partial match then

a) move to next position in string

b) return to start of pattern

else

a`) recover from mismatch or complete match by using table to set new

smaller partial match for current position in string

3. return smaller partial match and current pattern position.

27

Page 28: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 28/34

Pascal Implementationprocedure kmpsearch (pattern :nchars;string:nchars; var recover : ntchars;

var nmatches: integer; patlength,slength :integer);

var position, match:integer;

procedure restart (recover :nchars; var match, position :integer);

begin

match :=recover[match-1]+1;

if match =1 then

position:=position +1;

end

procedure partialmatch(pattern:nchars; var recover :nchars; patlength:integer);

var position, match :integer;

begin

position:=2; match:=1;

28

Page 29: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 29/34

recover[0]:=0;recover[1]:=0;

while position <=patlength do

begin

if pattern[position]=pattern[match] then

begin

recover[position]:=match;

match:=match+1;

position:=position+1;

end

else

begin

recover[position]:=0;

match:=1;position:=position+1;

end

end

end29

Page 30: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 30/34

begin

position:=1; match:=1;

while position<=slength do

beginif pattern[match]=string[position] then

begin

match:=match+1;

position:=position+1;

if match > patlength then

begin

nmatches:=nmatches+1;

restart(recover, match, position);

endend

else

restart(recover, match, position);

end

end 30

Page 31: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 31/34

SubLinear Pattern Search

Design and implement an algorithm that will eff iciently

search given text for a particular keyword or pattern

and record the number of times the keyword or

pattern is found.

31

Page 32: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 32/34

Algorithm

1. Establish the word and text to be searched

2. set up the skip table3. set keyword match count to zero

4. set character position I to keyword length

5. while current character position < textlength do

a) get numeric value nxt of current character at position I

b) index into skip table at position nxt.

c) if skip value for current character > 0 then

c.1) increase current position by skip value

else

c`.1) backwards-match text and wordc`.2) if match made update match count

c`.3) recover from mismatch

6. return match count.

32

Page 33: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 33/34

Pascal Implementationprocedure quicksearch (text, word:tc;tlength, wlength:integer;var

nmatches:integer);

const asize=127;

type ascii=array[0..127] of integer;

vat i, j, k, nxt:intger;

var match: boolean;

skip:ascii;

begin

setskips(word, skip wlength, asize);

nmatches:=0; i:=wlength;

while i<=tlength do

beginnxt:=ord(text[i]);

if skip[nxt]>0 then

i:=i+skip[nxt];

else33

begin

 j:=i-1;

k:=wlength-1;

match:=true;

while (k>0) nd (match=true) do

begin

if text[j]=word[k] thenbegin

 j:=j-1;

k:=k-1

end

else

match:=false;end;

i:=i-skip[nxt];

end

end

end

Page 34: Week Text Processing)

8/8/2019 Week Text Processing)

http://slidepdf.com/reader/full/week-text-processing 34/34

procedure setskips(word:tc; var skip:ascii;wlength,asize:integer);

var i, j, p:integer;

begin

for i:=0to asize do

skip[i]:=wlength;

for j:=1 to wlength-1 do

begin

p:=ord(word[j]);

skip[p]:=wlength j;

end;

p:=ord(word[wlength]);skip[p]:=-skip[p];

end

34