67
Copyright © 2007-2009 Curt Hill C Style Strings An Specialized Form of Array

Copyright © 2007-2009 Curt Hill C Style Strings An Specialized Form of Array

Embed Size (px)

Citation preview

Copyright © 2007-2009 Curt Hill

C Style Strings

An Specialized Form of Array

Copyright © 2007-2009 Curt Hill

Introduction

• The discussion of strings is short on new syntax and long on new functions

• This makes it somewhat easier, I hope

Copyright © 2007-2009 Curt Hill

Strings are Different• Most of the things we have dealt with

are machine constructs:– int, double, char, for, while, functions

• They map very nicely to things that machines handle very well

• However, the human machine interface always has to deal with the notion that people read lines of text

• cout can handle this with \n but cin has problems since to it the blanks and \n or \t are just 'whitespace', whereas to us there is a very different interpretation of blanks and newlines

Copyright © 2007-2009 Curt Hill

Storage

• We also have the problem of storage of strings

• Strings are inherently variable length

• When we read in a line of text we may get any number of actual characters

Copyright © 2007-2009 Curt Hill

What do we do?

• Historically, there are several main approaches to how we will handle these in memory or as file records

• Fixed Length Records• Variable Length Records

– Delimiter– Descriptor

Copyright © 2007-2009 Curt Hill

Fixed length records

• Each item is some positive constant in length– Originally, the most common was 80

character, punch card images

• Items cannot be longer• Shorter items are padded on right

with some character, usually a blank

• This is the FORTRAN approach

Copyright © 2007-2009 Curt Hill

Variable length with delimiter

• Delimiter• There is some special character

that says: I am the end of the string or line

• Usually a control character• This is less general

– Consider object files where any character can legally occur

– Usually there is an escape sequence

Copyright © 2007-2009 Curt Hill

Variable length with descriptor

• Descriptor is an explicit length • There is an integer with the string

which says how large it will be• Usually immediately before first

character, usually one, two or four bytes

• One byte then string is 256 long max

• Two byte then string is 65k max• Four byte then string is 4G

Copyright © 2007-2009 Curt Hill

Storage

• This storage problem is rather vexing from a machine view

• Variable lengths are difficult to allocate on the stack

• We must know the length to access what follows them

• Thus we must allocate a maximum and waste what we do not use

Copyright © 2007-2009 Curt Hill

Examples:

• IBM Mainframe systems employ the first two in file systems

• Fixed length files each record is always the same length

• Card files– Tape or disk as well– This is also possible in C++ with just

ordinary arrays of characters– Standard Pascal, FORTRAN and

COBOL also use this, among others

Copyright © 2007-2009 Curt Hill

Examples (continued)

• IBM Mainframe systems also employed a variable length record, among others– CBuilder AnsiStrings among others– Allocate the maximum number of

bytes and then maintain a length indicator

– File systems do not need to allocate maximum but used length only

Copyright © 2007-2009 Curt Hill

Delimited variable length• UNIX, DOS, Windows use this for

text files• CR, LF or CR/LF is the line delimiter

– UNIX and LINUX uses linefeed– Windows and DOS uses CR/LF– Each file occupies a whole number of

allocation units (sectors or blocks) and the end of the file is marked with a character or character string to mark end of file also

• C/C++ uses this for strings– Null character is delimiter

Copyright © 2007-2009 Curt Hill

Delimiter Again

• Allocate the maximum amount of memory needed for the string

• Use a byte with a binary zero to mark the end

• This is ‘\0’• Nothing after the \0 is considered

as valid contents

Copyright © 2007-2009 Curt Hill

Discussion• All of these approaches are a

concession to how people do things

• They are not neat and clean compared to other kinds of things, such as integers

• Mostly because of the variable length approach

• The delimiter approach resembles the unfull array technique– Built in to string libraries

Copyright © 2007-2009 Curt Hill

Strings usage

• We have already seen strings• A constant string is enclosed in

double quotes whereas an ASCII character constant is one character inside apostrophes

• The string “hi” is how many characters?– Two for hi and one for \0 = 3

Copyright © 2007-2009 Curt Hill

Null character• A byte with a value of zero

– Not the zero digit• Automatically provided by a double

quoted string• May also be supplied by escape

sequence: ‘\0’• Initialization:char str[3] = “Hi”; char c = ‘\0’;char d = 0;

Copyright © 2007-2009 Curt Hill

Common mistake:

• x[5] = “Hello”;• We do not have room for \0

– Should be compile error– Not detected in CBuilder6

• The absence of that can cause runtime errors that will be noted later

• The \0 is always appended to any string in quotes

Copyright © 2007-2009 Curt Hill

Declaration• Declaration of a string is just

the same as declaring an array of characters

• Recall that an array of characters can be handled as a string or any other way consistent with an array of type char

• char str[9]="Hi there";char str[] = “Hi there”;

Copyright © 2007-2009 Curt Hill

Declaration Again• char str[10]="Hi there";

Declares str as a string of length 10– Initializes first nine characters– First eight as above– Ninth with \0– Tenth is undefined

• The only real difference between this any other array is the shorthand for strings:– char str[10] =

{'H','i',' ','t','h','e','r','e',’\0’};

Copyright © 2007-2009 Curt Hill

String usage

• Most other differences between a string and any other array is found in the standard functions

• First we will consider the fstreams• Second two libraries

Copyright © 2007-2009 Curt Hill

cout• cout (and all ofstreams) may handle a

string as we have seen• However, since it does not know the

length, it must search for the Null character to terminate

• If there is no Null it considers the string longer than it actually is until it finds a coincidental Null in memory

• The Null is common in memory, usually, being the first three bytes of positive ints that are small

• Nevertheless, it is easy to get tens, hundreds or thousands of extra bytes displayed

Copyright © 2007-2009 Curt Hill

cin• A different story• In cin whitespace is still skipped• So if you read in the string

– Hello there – will get 6 characters - the Hello plus Null

• The leading whitespace is skipped and the string is terminated with the blank between the o and t

• Solutions:– There are three versions of cin.get that will

be helpful• A no parameter version• A one parameter version • A two or three parameter version

Copyright © 2007-2009 Curt Hill

Get• These are methods of ifstreams• char get(void)

– gets one character and returns it– Does not skip whitespace

• char * get(char &) – gets one character without whitespace

skipping– Returns a parameter that we will mostly

ignore that can be used to indicate success– It is actually a pointer, but we can use it like

an integer where 0 means unsuccessful

Copyright © 2007-2009 Curt Hill

Examples

• Read all the characters:char ch[10000];int i = 0;while(cin) ch[i++] = cin.get();

• Alsochar ch[10000];int i = 0;while(cin.get(ch[i++]));

Copyright © 2007-2009 Curt Hill

String Get• int get(char p[ ], int n, char = ‘\n’)• The initial argument is a string to

read the characters into• n is the maximum number of

characters to obtain• Since this form always terminates

strings with a \0, the maximum number of input characters is only n-1

• Hence cin.get(st,1) only loads the \0

Copyright © 2007-2009 Curt Hill

String Get• The third parameter is a terminator

character• This can be anything, though the default

is an excellent choice• The get will read characters and store

them in p until one of the following conditions is met:– Too many characters– Delimiter is found

• When we are done, if the delimiter was found it will be the next unread character– Hence it will never read a delimiter

Copyright © 2007-2009 Curt Hill

Getline

• int getline(char [], int, char = ‘\n’)• Essentially the same as three

parameter get except it eats the delimiter and does not copy it to the buffer

• This is my favorite

Copyright © 2007-2009 Curt Hill

Examples• Declarationchar line[MAX];

• This will read the line but leave the end of line in the input buffer:cin.get(line,MAX);

• This will read the line, discard the end of line:cin.getline(line,MAX);

• A comma delimited file might be read: cin.getline(line,MAX,’,’);

• No good way to read where two or more different delimiters

Copyright © 2007-2009 Curt Hill

String assignment

• Given char a[10],b[10];• Can we:

– a = b;

• No• Can we:

– a = "Hi there";

• NO• How then do we string assignment?• Like any array manipulation

Copyright © 2007-2009 Curt Hill

The Hard Way• Usually by function call or something

involving a for loop• Like all arrays the following is possible:

char a[10], b[10];

for(i=0;i<10;i++)

a[i]= b[i];

• Or we can define a function to do the same thing:void str_asgn (char target[], const char src[], int size);

Copyright © 2007-2009 Curt Hill

str_asgn

void str_asgn (char target[], const char src[], int size){ int i; for(int i = 0;i<size;i++){ target[i] = src[i]; if(target[i] == 0) break; }}

Copyright © 2007-2009 Curt Hill

Overlapping Arrays

• One of the problems with this function is that overlapping arguments will cause weird results

• For example– str_asgn(&a[1],a,10);

• However, it uses next to no memory

• What actually happens?

Copyright © 2007-2009 Curt Hill

Overlap• Suppose the following array:char a[5] = “hi”;

• And we call: str_asgn(&a[1],a,5);

• Then a[0] is copied to a[1]– This is the ‘h’ which now occupies the

first two characters• Next a[1] is copied to a[2]

– This is the ‘h’ which now occupies the first three characters

Copyright © 2007-2009 Curt Hill

At beginning of copytarget

source

h i \0 * *

Copyright © 2007-2009 Curt Hill

First Copytarget

source

h h \0 * *

Copy source[0] to target[0]

Copyright © 2007-2009 Curt Hill

Second Copytarget

source

h h h * *

Copy source[1] to target[1]

Copyright © 2007-2009 Curt Hill

Third Copytarget

source

h h h h *

Copy source[2] to target[2]

You see the pattern.Handy if this is what you want.

Copyright © 2007-2009 Curt Hill

String operations• What can we do to an integer (assume int

i,j;)• Many things

– Comparison: if(i<j)– Arithmetic: i*j-2– Assignment i=j;

• What can we do to two arrays (assume int x[5],y[5])– Next to nothing without resorting to a function

• Should we consider a string an elementary type or a structured type (in this case array)

Copyright © 2007-2009 Curt Hill

Structured Types

• Clearly C/C++ thinks of strings as arrays so we can do next to nothing

• We cannot assign two strings• It seems like we can do nothing to

strings other than write functions that manipulate or use existing functions that manipulate

• Fortunately most of the useful functions have already been written

Copyright © 2007-2009 Curt Hill

Utility string functions

• The first library to consider is string.h

• Inside this are some utility functions that help us to perform string manipulation

• Some of these we will consider and many others not

Copyright © 2007-2009 Curt Hill

strlen• int strlen(const char*source)• Takes a string as an argument and finds

the length of the string• Not physical length but the position of

the \0 character• It is the length of the usable string and

the subscript of the \0 character• Extremely handy• It may overflow

– It may give a logical length greater than the physical length

Copyright © 2007-2009 Curt Hill

memcpy

• The two mem functions are not string functions but array functions

• void *memcpy(char s[ ], const char ct[], const int n)

• copy n chars from ct to s• return pointer to s

Copyright © 2007-2009 Curt Hill

memmove

• Same as memcpy except works if operands overlap

• Moves (copies really) length characters from source to dest.

• Often folds into one machine language instruction

• Does not care about \0, is guided only by length

Copyright © 2007-2009 Curt Hill

Example

• The mem's can be used for gross array movement of any sort

• For example:int a[10], b[10];...memcpy(a,b,10*sizeof(int));– sizeof is an operator that takes an

expression or parenthesized type

Copyright © 2007-2009 Curt Hill

Characteristics

• String functions have a number of characteristics making them easier to remember

• They all start with str – Usually followed by three or four

letters– This is descriptive

• The first parameter is usually a string and the most important one– Only one to be changed

Copyright © 2007-2009 Curt Hill

strcpy

• char * strcpy(char s[], const char ct[])

• Copy ct to s, including the \0• The return value is the pointer to s• No overlap is allowed and there

had better be a \0

Copyright © 2007-2009 Curt Hill

Two Flavors• Almost all string functions come in

two flavors– Brave and bold– Cautious

• The brave version always believes that a null character will be found

• The cautious version takes an additional integer which is the maximum length– Always has an n in the name right

after the str

Copyright © 2007-2009 Curt Hill

strncpy

• char * strncpy(char s[], const char ct[],int n)

• Copy ct to s, including the \0 or at most n characters whichever comes first

• The return value is the pointer to s• No overlap is allowed

Copyright © 2007-2009 Curt Hill

strcat

• Short for concatenate• char * strcat(char s[], const char ct[])

• Copy ct to end of s– The \0 of s is replaced and the end of

the string is supplied from ct

• The return value is the pointer to s• No overlap is allowed and there

had better be a \0

Copyright © 2007-2009 Curt Hill

strncat• char * strcat(char s[], const char ct[], int n);

• Copy ct to end of s– The \0 of s is replaced and the end of

the string is supplied from ct

• Copy at most n characters onto s• The new length is the sum of the

length of s and the copied characters• The return value is the pointer to s• No overlap is allowed

Copyright © 2007-2009 Curt Hill

Recall• All these functions are straight

from the C library• Standard in every implementation

of C/C++ since the 70s• C had no bool until the 90s, so

comparisons return an int• Also functions that return a

character will actually return an int– This will be automatically be cast to

char

Copyright © 2007-2009 Curt Hill

strcmp• Comparison• int strcmp(const char s[], const char t[])

• Compare s to t• Returns

– if s<t returns <0– returns 0 if s==t– if s>t returns >0

• No overlap is allowed and there had better be a \0

Copyright © 2007-2009 Curt Hill

Comparing characters• When two integers are compared, the

whole integer participates• String comparison is somewhat different• We sequentially compare corresponding

characters • The result is the result between the first

pair that is different• A substring is always less than the

larger string• Character comparison is based on

collating sequence

Copyright © 2007-2009 Curt Hill

Example

• Compare two strings:“bbbazz”“bbbbaa”

• First string is less• Compare two strings:

“zzz”“zzza”

• The shorter is less than the longer• “Z” < “a” in ASCII

Copyright © 2007-2009 Curt Hill

strncmp

• int strncmp(const char s[], const char t[],int n)

• Compare first n characters of s and t

• Returns– if s<t– return==0 if s==t– if s>t

• No overlap is allowed

Copyright © 2007-2009 Curt Hill

strchr

• char * strchr(const char s[], const char c)

• Looks for first c in s• Returns the pointer to the

character if found and NULL otherwise

• There had better be a \0

Copyright © 2007-2009 Curt Hill

strrchr

• char * strrchr(const char s[], const char c)– Nearly the same but starts at right

side

• Looks for last c in s• Returns the pointer to the

character if found and NULL otherwise

• There had better be a \0

Copyright © 2007-2009 Curt Hill

Many others• There are many others here as well that

are less important:– strspn– strcspn– strrpbrk– strstr– strerror– strtok– memcmp– memchr– memset

Copyright © 2007-2009 Curt Hill

Utility character functions

• Another library of importance is ctype.h

• These are functions that do something with a single character– Classifies– Converts case

Copyright © 2007-2009 Curt Hill

isalpha

• int isalpha (const char c);• Is the character c a letter (upper or

lower)• Returns 0 for false and 1 for true

Copyright © 2007-2009 Curt Hill

isupper and islower

• int isupper(const char c);• Is c an upper case letter• int islower(const char c);• Is c a lower case letter

Copyright © 2007-2009 Curt Hill

More

• int isdigit(const char c);– Is c a digit

• int isalphanum(const char c);– Is c a letter or digit

• int iscntrl(const char c);– Is c a control character

• int isspace(const char c);– Is c white space (blank, tab,

newline...)

Copyright © 2007-2009 Curt Hill

More• int isprint(const char c);• Is c printable (printables and space)• int ispunct(const char c);• Is c a printing character except

space, letters or digits• int isxdigit(const char c);• Is c a digit in hexadecimal(0-9,A-F)• int isgraph(const char c);• Is c a graphic charactern (printing

except space)

Copyright © 2007-2009 Curt Hill

Conversion

• int tolower(const char c);• Convert c to lower case• If !(isupper(c)) Then c is returned• int toupper(const char c);• Convert c to upper case• If !(islower(c)) Then c is returned

Copyright © 2007-2009 Curt Hill

Advantages

• Strings have several privileges over any other array

• Easy constant array notation– May be used other than in

declarations

• Integrated unfull array scheme

Copyright © 2007-2009 Curt Hill

String Objects

• Despite these advantages the string objects are the better approach

• They allow easy assignment and comparison

• Their methods provide all the extra things needed

• Strings were good for C, but object use is the C++ way

Copyright © 2007-2009 Curt Hill

Object Strings Strategy

• Store the string on the heap• Keep in the object a pointer to the

string• Other info, such as lengths may

also be retained• Examples:

AnsiString, cstring