Fall 20151 Week 4 CSCI-141 Scott C. Johnson. Computers can process text as well as numbers ◦ Example: a news agency might want to find all the articles

Strings and FilesFall 20151 Week 4

CSCI-141Scott C. Johnson

Computers can process text as well as numbers◦ Example: a news agency might want to find all

the articles on Hurricane Katrina as part of the tenth anniversary of this disaster. They would have to search for the words “hurricane and “Katrina”

String

How do we represent text such as:◦ Words◦ Sentences◦ Characters

In programming we call these sequences Strings.◦ A sequence is a collection or composition of elements in a

specific order◦ For strings, the elements are characters

This includes: Punctuation Spaces Numeric characters

◦ The order is a spelling of a word or structure of a sentence. Not always true.

String

A string can be:◦ Empty◦ Non-empty

Must start with a character and be followed by a string

Definition 1 A string is one of the following:◦ An empty string◦ A non-empty string, which has the following parts:

A head, which is a single character, followed by, A tail, which is a string

Strings

When processing strings we must be able to check for:◦ Empty string

def strIsEmpty(str) ◦ The head

def strHead(str)◦ The tail

def strTail(str)◦ To construct strings from other strings

def strConcat(str1, str2)◦ We will implemeant these later….

String Opertations

A Python string is:◦ A sequence of characters◦ The sequence can be explicitly written between

two quotation marks Python allows single or double quotes

◦ Examples: ‘Hello World’ “Hello World” ‘’ (the empty string is a valid string) ‘a’ “abc”

Python Strings

Like numbers, string can be assigned to variables◦ str = ‘Hello World’

You can asscess parts of strings via indexing◦ str[n] means give me the nth-1 character of the

string str◦ str[0] == ‘H’◦ str[1] == ‘e’◦ str[2] == ‘l’◦ str[5] == ‘ ‘

Python Strings

Another way to access parts of a string is via slicing◦ str[m:n] means give me the part of the string from

character m up to but not including n◦ Both m and n are optional

If m is omitted then it starts from the beginning of the string

If n is omitted it ends at the end of the string◦ Examples:

Str = ‘Hello World’ Str[1:4] == ‘ell’ Str[:5] == ‘Hello’ Str[1:] == ‘ello World’

Python Strings

You can concatenate string, or put two string together◦ The plus-sign (+)

Examples◦ ‘Hello’ + ‘World’ == ‘Hello World’◦ ‘a’ + ‘ab’ == ‘aab’◦ ‘ab’ + ‘c’ == ‘abc’◦ ‘a’ + ‘b’ + ‘c’ == ‘abc’

Python Strings

Python strings are immutable◦ Means parts of strings cannot be changed using the

assignment operator◦ If we assign a new value to a string it replaces the old value

Basically a new string Example:

◦ Str = ‘abc’◦ Str[0] == ‘a’◦ Str[0] = ‘z’

We get: ◦ “Traceback (most recent call last):

File "<stdin >", line 1, in <module >TypeError: ’str’ object does not support item assignment“

Python Strings

def strIsEmpty(str): if str == ‘’: return True else: return False

def strHead(str): return str[:1]

def strTail(str): return str[1:]

def strConcat(str1, str2): return str1 + str2

Python String Operations

Computing length of strings◦ Python has a function to do this◦ But we will write our own version so we can:

Learn about strings And how to process them

Think about this: lengthRec(‘abc’) == 3


Lets break down string length as a recursive function◦ ‘’: the empty string, length is 0◦ ‘a’: string with one character, length = 1

Head length = 1 tail length is empty string, length = 0

◦ ‘ab’: string with two character, length = 2 Head length = 1 tail is string with one character, length = 1

◦ ‘abc’: string with three characters. Length = 3 Head length = 1 tail is a string with two characters, length = 2

Notice the pattern????


The pattern leads us to: def lengthRec(str): if str == ‘’: return 0 else: return 1 + lengthRec(strTail(str))


Computing the reversal of a string◦ reverseRec(‘abc’) == ‘cba’

Lets solve for this like any other recursive function


String reversal cases:◦ ‘’: emptystring reversed is the empty string◦ ‘a’: a single character reversed is the same string◦ ‘ab’: two character string

‘b’ + ‘a’ == ‘ba’ strTail(‘ab’) + strHead(‘ab’)

◦ ‘abc’: three character string ‘c’ + ‘b’ + ‘a’ == ‘cba’ strTail(strTail(‘abc’)) + strHead(strTail(‘abc’)) + strHead(‘abc’) strTail(‘bc’) + strHead(‘bc’) + strHead(‘abc’)

Reversal of string ‘bc’ + strHead(‘abc’)


From this we get: def reverseRec(str): if str == ‘’: return ‘’ else: return reverseRec(strTail(str) + strHead(str)


Substitution Traces

Substitution Traces

Accumulative Length◦ Recall the recursive form we did earlier:

def lengthRec(str): if str == ‘’: return 0 else: return 1 + lengthRec(strTail(str))

◦ How can we change this to use an accumulator variable?? Notice we are returning zero for the empty case and

1 + the recursive call for the other case

Accumulative Recursion

Accumulative Length◦ We can add the accumulator variable and add 1

for the recursive call and return it for the empty string: def lengthAccum(str, ac):

if str == ‘’: return ac else: return lengthAccum(strTail(str), ac + 1)

Basically: if we have a head, add one to the acummulator and

check the tail Else return the accumulator variable since we have

nothing else to count


Accumulative Reverse◦ Recall the recursive form we did earlier:

def reverseRec(str): if str == ‘’: return ‘’ else: return reverseRec(strTail(str) + strHead(str)

◦ How can we change this to use an accumulator variable?? Notice we are returning the empty string for the

empty case and the recursive call + the strHead for the other case


Accumulative Reverse◦ We can add the accumulator variable and add the

strHead to the accumulator for each recursive call: def reverseAccum(str, ac):

if str == ‘’: return ac else: return reverseAccum(strTail(str), strHead(str) +ac)


We often can replace recursion with iteration◦ This requires the use of a new type of Python

statement◦ The for loop

String Iteration

for loop example◦ for ch in ‘abc’:

print(ch)◦ This will print:

a b c

String Iteration

With the for loop when can convert our recursive string operations to iterative forms◦ Recall the accumulative form:

def lengthAccum(str, ac): if str == ‘’: return ac else: return lengthAccum(strTail(str), ac + 1)

◦basically we ran the function over and over again adding one to ac until we hit the empty string…

How could we make that into a for loop???

String Iteration

With a for loop we can avoid recursion def lenghtIter(str): ac = 0 for ch in str: ac = 1 + ac return ac

String Iteration

This can work for reverse too!◦ def reverseAccum(str, ac):

if str == ‘’: return ac else: return reverseAccum(strTail(str), strHead(str) +ac)

Becomes:◦ def reverseIter(str):

ac = ‘’ for ch in str: ac = hd + ac return ac

String Iteration

Some times we want to access parts of a string by index◦ This means iterating over the range of values 0,

1, 2, …, len(str) -1 len(str) is a built in Python function for string length

◦ To do this we have a special for loop in Python for i in range(0, len(str)) This says for all character in str starting at index 0 to

the last index of the string, do x It does not have to be the whole string for I in range(2, 5)… this will do all i’s 2, 3, 4

Index Values and Range

example


Example:


Say we do not want to type in a long string every time…◦ For instance we want to find and remove all of the

instances of a word in a report… How can we do that without using an input

statement and entering the entire text manually?

Files

Python can read files!◦ Lets look at a basic function to hide a word in a

string◦ def hide(textFileString, hiddenWord):

for currentWord in textFileString: if currentWord == hiddenWord: print(‘---’) else: print currentWord

Files

How do we do this from a text file and not a string?◦ To make this problem simple we will assume only one

word per line in the file Do the file reads, the spaces are shown on purpose as _:

word1__word2___word3__word4_word5word6

word7

Files

How can we read this file?◦ It is actually pretty easy in Python◦ for line in open(‘text1.txt’):

print(line) ◦ This give us:

word1◦

__word2

◦ ___word3

◦ __word4

◦ _word5

◦ word6

◦ word7

Files

Notice the spaces are still there and it

appears to have more space between lines!

The extra space between lines is due to:◦ The print function adds a new line◦ The original new line from the file is still in the string◦ If we were to make a single concatenated string we would see the

original file contents◦ str = ‘’

for line in open(‘text1.txt’): str = str + lineprint(str) word1

__word2___word3__word4_word5word6

word7

Files

We can make printing better◦ print(str, end=‘’)

Print will not generate newlines◦ We still have an issue for our problem

Newlines are still there from the file◦ for line in open(‘text1.txt’):

print(line == ‘word1’)◦ We would see false for all line

Even thought word1 looks like it exists Due to the new line from the file!

Files

We can use a Python feature called strip◦ This removes all whitespace from a string

Whitespace: Newlines Spaces

◦ We call it a bit different that other string functions str.strip() It returns the ‘stripped’ string

Files

Using strip we get for line in open(‘text1.txt’):

print(line.strip() == ‘word1’) Which results in:

True False False False False False False False

Files

Using all these ideas we can make the hide and a helper function

Files

We orginally made this function to hide a word…◦ But it can be used to find a word too◦ We look for, or search for, the word to replace it◦ We look for all occurrences of the word…

Sometimes we only want to find the first time a word happens

Files

Linear Search◦ The process of visiting each element in a

sequence in order stopping when either: we find what we are looking for We have visited every element, and not found a

match◦ Often we wish to characterize how long an

algorithm takes Measuring in seconds often depends on the

hardware, operating system, etc◦ We want to avoid such details

How do we do this?

Files

We avoid this by focusing on the size of the input, N…◦ And characterize the time spend as a function of

N◦ This function is called the time complexity

One way to measure performance is to count the number of operations or statements the code makes

Files

For the hide function we:◦ Strip the current word◦ Compare it to the hidden word◦ And then prints the results

Since this is a loop, this occurs for each word◦ Even thought the operations can take different

times individually, as the number of words grow this time difference are very small They are considered a constant amount

Files

For a linear search we can see that different size of N can lead to different behaviors and times…◦ Say the element we are looking for is the first

element… it’s very fast….◦ What if it is the last element?

We must then look at every element If n is 10 then the slow case is no problem.. What if it was 10 billion?

Files

Typically we are interested in how bad it can get…◦ Or the Worst-case analysis

Consider the worst-case analysis of the linear search…◦ For each element we spend some constant time◦ Plus some fixed time to startup and end the

search◦ If processing of an element takes constant time k

Then search all N elements should take k * N Similarly the startup and end time is some constant c So the time to run the linear search is (k * N) + c

Files

So the time to run the linear search is (k * N) + c◦ What are k and c?

Most of the time we simply do not care….◦ It is often good enough to know that the time

complexity is a linear function with a non-zero slope! The ,mathematical way to ignore this is to say the

time complexity of linear search is O(N) It is pronounced “Order N” This is known as “Big-O” notation

Files

“Big-O” notation◦ Makes it easy to compare algorithms…◦ Constant time, like comparing two characters,

O(1)◦ There are some algorithms that are O(N2) and

O(N3)◦ We prefer O(1) to O(N), O(N) to O(N2)…◦ There are possibly time complexities in between

these…

Files

O(N2) example:def counter(n): for i in range(0,n): for j in range(0, n): print(i * j)

O(N3) example:def counter(n): for i in range(0,n): for j in range(0, n): for k in range(0,n): print(i * j * k)

Files

Documents

Fall 20151 Week 4 CSCI-141 Scott C. Johnson. Computers can process text as well as numbers ◦ Example: a news agency might want to find all the articles