Upload
prasannakompalli
View
220
Download
0
Embed Size (px)
Citation preview
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 1/34
Text Processing and Pattern
Searching
Chapter -6
1
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 2/34
Text line length adjustment
Given a set of lines of text of arbitrary length, reformat the text
so that no lines of more than n characters are printed. In each
output line the maximum number of words that occupy less
than or n characters, should be printed and no word should
extend across 2 lines. Paragraphs should also remain
indented.
tabs are not be considered(take them as single space)
Every end of the line returns a space.
2
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 3/34
Algorithm
1. Establi
sh the li
ne length li
mi
t limit and add one toi
t to allowfor a space.
2. Initialize word and line character counts to zero and end-of-
line flag to false
3. While not end-of-f ile do
a) read and store next character
b) if character is a space then
b.1) if a new paragraph then
1.a) mov
e to next li
ne and reset charactercount for the new line.
b.2) add current word length to current line length
3
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 4/34
b.3) if current word causes line length limit to be exceeded
then
3.a) move to next line and set line length to current
word length.
b.4) write out current word and its trialing space and reinitialize
character count.
b.5) turn off end-of-input-line flag
b.6) if at end-of-input-line then
6.a) set end-of-input-line flag and move to next input line.
4
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 5/34
Pascal Implementationprocedure textformat(limit:integer);
var i,linecnt,wordcnt:integer;var chr, space:char;
var eol:boolean;
var word:array[1..30]of char;
beginwordcnt:=0;linecnt:=0,eol:=false; space:=;
limit:=limit+1;
while not eof(input) do
begin
read(chr);
wordcnt:=wordcnt+1;
word[wordcnt]:=chr;5
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 6/34
if chr=space then
begin
if eol and (wordcnt=1) thenbegin
writeln;
linecnt:=0;
end;linecnt:=linecnt+wordcnt;
if linecnt>limit then
begin
writeln;
linecnt:=wordcnt;
end;
6
for i:=1 to wordcnt do
write(word[
i]);wordcnt:=0;
eol:=false;
if eoln(input) then
begin
eol:=true;readln;
end
end
end;
writeln
end
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 7/34
Left and Right Justification of Text
Design and implement a procedure that will left and
right justify text in a way that avoids splitting words
and leaves paragraphs indented. An attempt should
also be made to distribute the additional blanks asevenly as possible in the justif ied line.
7
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 8/34
Fixed line length is achieved by inserting additional spaces
between words.
For any particular line the following holds
± The line is already in correct length so no processing isneeded.
± The number of extra spaces needed to expand the current
line to the required length is equal to number of spaces
already present in the line.
It is simply adding 1 space to each existing space.
number of spaces to be added > existing spaces
number of spaces to be added < existing spaces.
8
Left and Right Justification of Text
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 9/34
Ex:
10 extra spaces to a line which has already 7 spaces.
f irst add 7 spaces evenly to existing spaces
3 are left
± if 1 space then add it to middle space.
± if 2 spaces then f irst would be positioned one-third of the
way and second, two-thirds of the way across.
± if 3 spaces then add after 2, 4 and 6 word
9
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 10/34
Algorithm
1. Establish line to be justif ied, its current length and justif ication
length.2. Include test to see if it can be justif ied.
3. Initialize space count and alphabetic start of line.
4. while current character a space do
a) shift to next character;b) increment alphabetic start to end to line.
c) write out a space
5. For from alphabetic start to end of line do
a) If current character is space thena.1) increment space count
a.2) set current position in space count table to 1
6. Remove any spaces from end of line
10
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 11/34
7. Determine the extra spaces to be added from new and old
line lengths.
8. While still extra spaces to add and possible to do so do
a) compute current template increment from space count
and extra spaces count
b) if increment 0 the set to 1 since more extras than spaces
c) if extra spaces > space count then
c.1) determine space block using extra spaces and space
count else
c.1`) set space block size to 1
d) determine starting position for template
e) while not end-of-line and still spaces to add do
e.1) add space block to current template position.
e.2) move to next position in template
e.3) decrement extra space count by space block size11
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 12/34
9. For from start to end-of-line doa) if next character a space then
a.1) move to next position in space count table
a.2) write out number of spaces as per space count table
else
a.1`) write out current character
10. Finish off with an end-of-line.
12
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 13/34
Pascal Implementation
procedure justify( line: nchars; oldlen, newlen: integer);
const tsize:=40;var delta, exspace, j, ispace, nspaces, next, pos, st, spaceblock: integer;
var space: char; var template: array[1..tsize] of integer;
begin
if oldlen>newlen then writeln(line too long);
else begin
space:=; st:=1;
while (line[st]=space) and (st<=oldlen) do
begin
st:=st+1; write(space);
end
nspaces:=0;
if st<=oldlen then
while line[oldlen]=space do oldlen:=oldlen-1;
for pos:=st to oldlen do 13
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 14/34
if line[pos]=space then
begin
nspaces:=nspaces+1;
template[nspaces]:=1;
end;
exspace:=newlen-oldlen;
while (exspace>0)and(nspaces>0) do
begindelta:=round(nspaces/exspace);
if delta=0 then delta:=1;
if exspace>nspaces then
spaceblock:=exspace div nspaces;
else
spaceblock:=1;
next:=(delta +1) div 2;
while (next<=nspaces)and (exspace>0) do
begin 14
template[next]:=template[next]+spaceblock;
next:=next+delta;
exspace:=exspace-spaceblock;
endend
ispace:=0
for pos:=st to oldlen do
if line[pos]=space then
begin
ispace:=ispace+1;for j:=1 to template[ispace] do
write(space);
end
else
write (line[pos]);
writeln;end end
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 15/34
Keyword Searching in Textcount the number of times a particular word occurs in a given text
Algorithm
1. Establish the word and word length w length of the search-word
2. Initialize the match-count nmatches, set preceding character
and set pointer for word array i to 1
3. while not at end-of-f ile do
a) while not end-of-line do
a.1) read next character
a.2) if current text character chr matches ith character in word
then
2.a) extend partial match i by 1,
2.b) if a word-pattern match then
b.1) read next character post,
15
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 16/34
b.2) if preceding and following character not alphabetic then
2.a) update match count nmatches
b.3) reinitialize pointer to word array i
b.4) save following character post as preceding character
else2.a`) save current text character as preceding character for
match
2.b`) reset word array pointer i to f irst position
b) read past end-of-line.4. return word-match count nmatches
16
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 17/34
Pascal Implementationprocedure wordsearch ( word:nchars; wlength:integer; var
nmatches:integer);
type letters=a..z;
var i, :integer;
var chr, pre, post: char;
alphabet: set of letters;
begin
alphabet:=[a..z];
pre:=;i:=1;
while not eof(input) dobegin
while eoln(input)do
begin
read(chr); 17
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 18/34
end
else
begin
pre:=chr;
i:=1;
end
end;
readln;
end
end
18
if chr=word[i] then
begin
i:=i+1;
if i>wlength then
begin
read(post);
if(not (pre in alphabet)) and
(not(post in alphabet)) thenbegin
nmatches:=nmatches+1;
end
i:=1;
pre:=post;end
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 19/34
Text Line Editing
Design and implement an algorithm that will search a
line of text for a particular pattern or substring.
Should the pattern be found it is to be replaced by
another given pattern.the two wrongs in this line are wrong --original line
the two rights in this line are right -- edited line
19
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 20/34
Algorithm
1. Establish the textline, the search pattern and replacement
pattern and their assoc
iated lengths.
2. Set initial values for the position in the old text, the new text
and the search pattern
3. While all pattern positions in the text have not been
examined do
a) if current text and pattern characters match then
a.1) extend indices to next pattern/text character pair
a.2) if a complete match then
2.a) copy new patterninto current pos
ition
in ed
itedline
2.b) move past old pattern in text
2.c) reset pointer for search pointer.
else 20
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 21/34
a.1) copy current text character to next position in edited text
a.2) reset search pattern pointer
a.3) move pattern to next text position
4. Copy the leftover characters in the original text line
5. Return the edited line of text
21
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 22/34
Pascal Implementationprocedure textedit(var text, newtext: nchars; var pattern, newpattern:
nchars; var newlen: integer; textlen, patlen, newpatlen :integer);
var i, j, k, l: integer;begin
i:=1; j:=1;k:=0;
while i<=textlen-patlen+1 do
begin
if text[i+j-1]=pattern[j] then
begin
j:=j+1;
if j>patlen then
beginfor l:=1 to newpatlen do
begin
k:=k+1;
newtext[k]:=newpattern[l];
end 22
i:=i+patlen; j:=1; end endelse
begin
k:=k+1;
newtext[k]:=text[i];
i:=i+1; j:=1;
end
end
while i<=textlen do
begin
k:=k+1;newtext[k]:=text[i];
i:=i+1;
end
newtextlen:=k;
end
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 23/34
Linear Pattern Search
Design and implement a pattern searching algorithm
with a performance that is linearly dependant on the
length of the string or text being searched. A count
should be made of the number of times the searchpattern occurs in the string
23
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 24/34
Algorithm Description
Partial-match table setup algorithm
Linear pattern searching algorithm
Procedure for recovering from mismatches and
complete matches.
24
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 25/34
Partial-match table setup algorithm
1. Establish the search pattern
2. Set initial displacement between the pattern and itself to one.
3. Initialize the zero and first position in the partial match array to zero
4. while all positions of pattern relative to itself not considered do
a) if current pattern and displaced pattern character pairs match then
a.1) save current degree of partial matcha.2) move to next position in pattern and displaced pattern
else
a`.1) a mismatch so set partial match to zero
a`.2) reset pointer to start of displaced pattern
a`.3) move the start of the displaced pattern to the next availableposition
5. return the partial match table.
25
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 26/34
Linear pattern searching algorithm
1. Establish the pattern to be searched for and the string in which it is to
be sought together with lengths of the pattern and the string.
2. Set initial values for start of pattern and string and zero the match
count.
3. while all appropriate pattern positions in the string have not been
examined do
a) if current string and pattern characters match then
a.1) extend indices to next pattern/string pair.
a.2) if a complete match then
2.a) update complete match count
2.b) reset recovery position from the partial match table
else
a`.1) reset recovery position from the partial match table
4. return count of the number of complete matches of the pattern in the
string.
26
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 27/34
Procedure for recovering from
mismatches and complete matches.
1. Establish the partial match table, the current position in the string andposition in pattern
2. if no smaller partial match then
a) move to next position in string
b) return to start of pattern
else
a`) recover from mismatch or complete match by using table to set new
smaller partial match for current position in string
3. return smaller partial match and current pattern position.
27
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 28/34
Pascal Implementationprocedure kmpsearch (pattern :nchars;string:nchars; var recover : ntchars;
var nmatches: integer; patlength,slength :integer);
var position, match:integer;
procedure restart (recover :nchars; var match, position :integer);
begin
match :=recover[match-1]+1;
if match =1 then
position:=position +1;
end
procedure partialmatch(pattern:nchars; var recover :nchars; patlength:integer);
var position, match :integer;
begin
position:=2; match:=1;
28
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 29/34
recover[0]:=0;recover[1]:=0;
while position <=patlength do
begin
if pattern[position]=pattern[match] then
begin
recover[position]:=match;
match:=match+1;
position:=position+1;
end
else
begin
recover[position]:=0;
match:=1;position:=position+1;
end
end
end29
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 30/34
begin
position:=1; match:=1;
while position<=slength do
beginif pattern[match]=string[position] then
begin
match:=match+1;
position:=position+1;
if match > patlength then
begin
nmatches:=nmatches+1;
restart(recover, match, position);
endend
else
restart(recover, match, position);
end
end 30
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 31/34
SubLinear Pattern Search
Design and implement an algorithm that will eff iciently
search given text for a particular keyword or pattern
and record the number of times the keyword or
pattern is found.
31
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 32/34
Algorithm
1. Establish the word and text to be searched
2. set up the skip table3. set keyword match count to zero
4. set character position I to keyword length
5. while current character position < textlength do
a) get numeric value nxt of current character at position I
b) index into skip table at position nxt.
c) if skip value for current character > 0 then
c.1) increase current position by skip value
else
c`.1) backwards-match text and wordc`.2) if match made update match count
c`.3) recover from mismatch
6. return match count.
32
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 33/34
Pascal Implementationprocedure quicksearch (text, word:tc;tlength, wlength:integer;var
nmatches:integer);
const asize=127;
type ascii=array[0..127] of integer;
vat i, j, k, nxt:intger;
var match: boolean;
skip:ascii;
begin
setskips(word, skip wlength, asize);
nmatches:=0; i:=wlength;
while i<=tlength do
beginnxt:=ord(text[i]);
if skip[nxt]>0 then
i:=i+skip[nxt];
else33
begin
j:=i-1;
k:=wlength-1;
match:=true;
while (k>0) nd (match=true) do
begin
if text[j]=word[k] thenbegin
j:=j-1;
k:=k-1
end
else
match:=false;end;
i:=i-skip[nxt];
end
end
end
8/8/2019 Week Text Processing)
http://slidepdf.com/reader/full/week-text-processing 34/34
procedure setskips(word:tc; var skip:ascii;wlength,asize:integer);
var i, j, p:integer;
begin
for i:=0to asize do
skip[i]:=wlength;
for j:=1 to wlength-1 do
begin
p:=ord(word[j]);
skip[p]:=wlength j;
end;
p:=ord(word[wlength]);skip[p]:=-skip[p];
end
34