Data Structures and Algorithms - Vilniaus universitetasalgis/dsax/Data-structures-0.pdf ·...

Preview:

Citation preview

Data Structuresand

Algorithms

Introduction, Basic data types

www.mif.vu.lt/~algis

Motto• Intermediate-level survey course: •mandatory - “Discrete Mathematics” and “Mathematics

for Informatics”.

• Programming and problem solving with applications.

• Algorithm: method for solving a problem.

• Data structure: method to store information.

• Data Structures and Algorithms:

• A much more dramatic effect can be made on the performance of a program by changing to a better algorithm

MottoAlgorithms impact is broad and far-reaching:• Internet: Web search, packet routing, distributed file sharing• Biology: Human genome project, protein folding.• Computers. Circuit layout, file system, compilers• Computer graphics: Movies, video games, virtual reality• Security: Cell phones, e-commerce, voting machines•Multimedia: CD player, DVD, MP3, JPG, DivX, HDTV• Transportation: Airline crew scheduling, map routing• Physics: N-body simulation, particle collision simulation, etc.Study of algorithms dates at least to Euclid (500 BC)• Some important algorithms were discovered by students!

Motto• Example: Network connectivity

6

Why stud

y algorithm

s?

To b

e ab

le solve

proble

ms th

at could not oth

erw

ise b

e ad

dre

ssed

Exam

ple: N

etw

ork connectivity

[stay tuned]

Content

• 1. Introduction, computing model, von Neuman principles, data, abstract data types, data structures, basic data types• 2. Sorting, internal sorting, quicksort • 3. Merge sort, von Neuman sorting, external sorting• 4. Abstract data types, stack, queue, programming of stack

and queue• 5. Heap, priority queue, priority queue by heap structure,

lists, list programming, dynamic sets ADT• 6. Hierarchical structures, binary search trees, tree allocation

in memory• 7. AVL trees, 2-3-4 trees, red-black trees

Content

• 8. B-trees and other similar trees, Huffman algorithm for data compression• 9. Hashing idea, hashing functions and tables, hashing

procedures and algorithms, extendable hashing• 10. Radix search algorithms, radix trees, radix algorithms,

radix search• 11. Patricia trees, splay trees, suffix tree• 12. Text search• 13. Analysis of algorithms

Introduction

• Informally, algorithm means is a well-defined computational procedure that takes value (set of values) as input and produces some value (set of values) as output.

• An algorithm is thus a sequence of computational steps that transform the input into the output.

• Algorithm is also viewed as a tool for solving a well-specified problem, involving computers.

• There exist many points of view to algorithms. A good example of this is a famous Euclid’s algorithm:

Introduction“ for integers x, y calculate greatest common divisor gcd (x, y) “ Direct implementation of the algorithm looks like:program euclid (input, output);var x,y: integer;function gcd (u,v: integer): integer;

var t: integer; beginrepeat if u<v then begin t := u; u := v; v := t end; u := u-v; until u = 0;

gcd := v end; begin while not eof do

begin readln (x, y); if (x>0) and (y>0) then writeln(x, y, gcd (x, y)) end; end.

Introduction• This algorithm has some exceptional features• it is applicable only to numbers; • it has to be changed every time when something of the

environment changes, say if numbers are very long and does not fit into a size of variable (numbers like 1000!).

• For algorithms of applications, like databases, information systems, etc., they are usually understood in a slightly different way:• is repeated many times; • in a various circumstancies; • with different types of data.

Introduction

• 404800479988610197196058631666872994808558901323↵ 829669944590997424504087073759918823627727188732↵519779505950995276120874975462497043601418278094↵ 646496291056393887437886487337119181045825783647↵849977012476632889835955735432513185323958463075↵ 557409114262417474349347553428646576611667797396↵668820291207379143853719588249808126867838374559↵ 731746136085379534524221586593201928090878297308↵431392844403281231558611036976801357304216168747↵ 609675871348312025478589320767169132448426236131↵412508780208000261683151027341827977704784635868↵ 170164365024153691398281264810213092761244896359↵928705114964975419909342221566832572080821333186↵ 116811553615836546984046708975602900950537616475↵847728421889679646244945160765353408198901385442↵ 487984959953319101723355556602139450399736280750↵137837615307127761926849034352625200015888535147↵ 331611702103968175921510907788019393178114194545↵257223865541461062892187960223838971476088506276↵ 862967146674697562911234082439208160153780889893↵964518263243671616762179168909779911903754031274↵ 622289988005195444414282012187361745992642956581↵746628302955570299024324153181617210465832036786↵ 906117260158783520751516284225540265170483304226↵143974286933061690897968482590125458327168226458↵ 066526769958652682272807075781391858178889652208↵164348344825993266043367660176999612831860788386↵ 150279465955131156552036093988180612138558600301↵435694527224206344631797460594682573103790084024↵ 432438465657245014402821885252470935190620929023↵136493273497565513958720559654228749774011413346↵ 962715422845862377387538230483865688976461927383↵814900140767310446640259899490222221765904339901↵ 886018566526485061799702356193897017860040811889↵729918311021171229845901641921068884387121855646↵ 124960798722908519296819372388642614839657382291↵123125024186649353143970137428531926649875337218↵ 940694281434118520158014123344828015051399694290↵153483077644569099073152433278288269864602789864↵ 321139083506217095002597389863554277196742822248↵757586765752344220207573630569498825087968928162↵ 753848863396909959826280956121450994871701244516↵461260379029309120889086942028510640182154399457↵ 156805941872748998094254742173582401063677404595↵741785160829230135358081840096996372524230560855↵ 903700624271243416909004153690105933983835777939↵410970027753472000000000000000000000000000000000↵ 000000000000000000000000000000000000000000000000↵000000000000000000000000000000000000000000000000↵ 000000000000000000000000000000000000000000000000↵000000000000000000000000000000000000000000000000↵ 000000000000000000000000

Von Neuman computing model

1943: ENIACPresper Eckert and John Mauchly -- first general electronic computer. Hard-wired program -- settings of dials and switches.

1944: Beginnings of EDVAC

among other improvements, includes program stored in memory

1945: John von Neumann

wrote a report on the stored program concept, known as the

First Draft of a Report on EDVAC

Von Neuman computing modelThe main principles of the “von Neumann machine” (or model):

• a memory, containing instructions and data all together

• a processing unit, for performing arithmetic and logical operations

• a control unit, for interpreting instructions

This was a revolutionary step:

• clearly, requiring hardware changes with each new programming operation was time-consuming, error-prone, and costly

Von Neuman computing model

Memory types: RAM

• RAM is typically volatile memory (meaning it doesn’t retain voltage settings once power is removed)

• RAM is an array of cells, each with a unique address

• A cell is the minimum unit of access.

• Originally, this was 8 bits taken together as a byte. In today’s computer, word-sized cells (16 bits, grouped in 4) are more typical.

• RAM gets its name from its access performance. In RAM memory, theoretically, it would take the same amount of time to access any memory cell, regardless of its location with the memory bank (“random” access).

The ALU• The third component in the von Neumann architecture is

called the Arithmetic Logic Unit.

• This is the component that performs the arithmetic and logic operations for which we have been building parts.

• This component may be duplicated more than once.

• The ALU is the “brain” of the computer.

•We have to define also a basic rules, how to read/write data in/from the memory – this leads to so-called basic data types or just data types.

The ALU

• It houses the special memory locations, called registers (a very fast and special memory, of which we have already considered).

• It maybe not just one, really a different kind of registers (for example L1, L2, etc.

• The ALU is important enough that we will come back to it later.

• For now, just realize that it contains the circuitry to perform addition, substraction, multiplication and division, as well as logical comparisons (less than, equal to and greater than).

Data types or basic data typesData Types:

• Data type is a classification of a type of information, id esthow to prescribe value to bites or bytes in computer memory.

• Data types are essential to any computer programming language.

•Without them, it becomes very difficult to maintain information within a computer program.

• Different data types have different sizes in memory depending on the machine and compilers.

Data types or basic data typesData Types • Integer• Floating-point • Character• String• Boolean • ConstantsA constant is a value that never changes, it used less often than variables in programming. A constant can be : • a number, like 25 or 3.6 • a character, like a or $ • a character string, like "this is a string"

Logical data typesBoolean Types • Introduced by ALGOL 60 • Used to represent switched and flags in programs • Range of values: two elements, one for “true” and one for

“false” • One popular exception is C89, in which numeric expressions

are used as conditionals (all operands with nonzero values are considered true, and zero is considered false) • A Boolean value could be represented by a single bit, but

often statured in the smallest efficiently addressable cell of memory, typically a byte

Basic data types - Boolean data

&& 0 1

0 0 0

1 0 1

Data values: {false, true}

In any programming language, for instance, in C/C++: false = 0, true = 1 (or any nonzero object)

Could store 1 value per bit, but usually use a byte (or word)

Operations: and &&or ||not !

| | 0 1

0 0 1

1 1 1

x !x0 11 0

Character data typesCharacter Types: • Char types are stored as numeric codings (ASCII / Unicode) • Traditionally, the most commonly used coding was the 8-bit

code ASCII (American Standard Code for Information Interchange)

• An alternative, 16-bit coding: Unicode (UCS-2) • Java was the first widely used language to use the Unicode

character set. Since then, it has found its way into JavaScript, Python, Perl, C# and F#

• After 1991, the Unicode Consortium, in cooperation with the International Standards Organization (ISO), developed a 4-byte character code named UCS-4 or UTF-32, which is described in the ISO/IEC 10646 Standard, published in 2000

Basic data types - Character dataStore numeric codes (ASCII, EBCDIC, Unicode) 1 byte for ASCII and EBCDIC, 2 bytes for Unicode (see examples on p. 35).

Basic operation: comparison of chars to determine if ==, <, >, etc. uses their numeric codes (i.e. uses their ordinal values)

Integer data typeInteger:

• The most common primitive numeric data type is integer.

• The hardware of many computers supports several sizes of integers.

• These sizes of integers, and often a few others, are supported by some programming languages.

• Java includes four signed integer sizes: byte, short, int, and long.

• C++ and C#, include unsigned integer types, which are types for integer values without sings.

Basic data types - Integer data

Nonegative (unsigned) integer: type unsigned (and variations) in C++Store its base-two representation in a fixed number w of bits

(e.g., w = 16 or w = 32)

88 = 00000000010110002

Signed integer: type int (and variations) in C++Store in a fixed number w of bits using one of the following

representations:

Sign-magnitude representationSave one bit (usually most significant) for sign

(0 = +, 1 = – )

Use base-two representation in the other bits.

88 ® _0000000010110000

sign bit

1. Cumbersome for arithmetic computations

2. 2 0’s in this scheme

3. Incrementing by one results in subtraction of one, not addition!

–88 ® _000000001011000¯1

Both 0 and -0

Complement representation

For negative n (–n):(1) Find w-bit base-2 representation of n(2) Complement each bit.(3) Add 1

Example: –881. 88 as a 16-bit base-two number 0000000001011000

Same as subtracting the number from 0!

For non-negative n:Use ordinary base-two representation with leading (sign) bit 0

2. Complement this bit string3. Add 1

11111111101001111111111110101000

5 + 7:

0000000000000101+0000000000000111

5 + –6:0000000000000101

+1111111111111010

These work for both + and – integers

0000000000001100

111¬¾¾ carry bits

1111111111111111

+ 0 1

0 0 1

1 1 10

x 0 1

0 0 0

1 0 1

Good for arithmetic computation

Problems with integer representation• Limited capacity -- a finite number of bits

• Overflow and Underflow:

Overflow

- addition or multiplication can exceed largest value permitted by storage scheme

Underflow

- subtraction or multiplication can exceed smallest value permitted by storage scheme

Not a perfect representation of (mathematical) integers:

• can only store a finite (sub)range of them

Floating-point data type•Model real numbers, but only as approximations for most

real values. • On most computers, floating-point numbers are stored in

binary, which exacerbates the problem. • Another problem is the loss of accuracy through arithmetic

operations. • Languages for scientific use support at least two floating-

point types; sometimes more (e.g. float, and double.) • The collection of values that can be represented by a

floating-point type is defined in terms of precision and range: • Precision: is the accuracy of the fractional part of a value,

measured as the number of bits.• Range: is the range of fractions and exponents.

Floating-point data type

IEEE floating-point formats: (a) single precision, (b) double precision

How is real data represented?Types float and double (and variations) in C++Single precision (IEEE Floating-Point Format )

1. Write binary representation in floating-point form:b1.b2b3 . . . ´ 2k with each bi a bit and b1 = 1 (unless number is 0)

­mantissa exponent

or fractional part

Example: 22.625 = Floating point form:

10110.1012

1.01101012 ´ 24

+ 127

double:Exp: 11 bits,

bias 1023Mant: 52 bits

Round-off Errorsbase

Basic data types for C

C programming language:

char - smallest addressable unit of the machine that can contain basic character set. It is an integer type, actual type can be either signed or unsigned depending on the implementation.

signed char - same size as char, but guaranteed to be signed.

unsigned char - same size as char, but guaranteed to be unsigned.

shortshort intsigned shortsigned short int - short signed integer type, at least 16 bits in size.

Basic data types for Cunsignedunsigned int - same as int, but unsigned.

longlong intsigned longsigned long int - long signed integer type, at least 32 bits in size.

unsigned longunsigned long int - same as long, but unsigned.

long longlong long intsigned long longsigned long long int - long long signed integer type, at least 64 bits in size

Basic data types for C

unsigned long long int - same as long long, but unsigned.

float - single precision floating-point type.

double - double precision floating-point type, actual properties unspecified (except minimum limits), however on most systems this is the IEEE 754 double-precision binary floating-point format

long double - extended precision floating-point type, actual properties unspecified.

Boolean type

Structures:

• struct birthday { char name[20]; int day; int month; int year; };

Basic data types for CArray - array of N elements of type T

• int cat[10]; // array of 10 elements, each of type int

• int bob[]; // array of an unspecified number of 'int' elements.

• int a[10][8]; // array of 10 elements, each of type 'array of 8 intelements'

• float f[][32]; // array of unspecified number of 'array of 32 float elements’

Pointer - char *square; long *circle;

Unions - union types are special structures which allow access to the same memory using different type descriptions

Complex data typesComplex (numbers):

• Some languages support a complex type: e.g., Fortran and Python

• Each value consists of two floats: the real part and the imaginary part

• Literal form (in Python):

(7 + 3j) where 7 is the real part and 3 is the imaginary part

Decimal data typesDecimal

•Most larger computers that are designed to support business applications have hardware support for decimal data types

• Decimal types store a fixed number of decimal digits, with the decimal point at a fixed position in the value

• These are the primary data types for business data processing and are therefore essential to COBOL

• Advantage: accuracy of decimal values

• Disadvantages: limited range since no exponents are allowed, and its representation wastes memory

Strings and their operationsTypical operations:

• Assignment

• Comparison (=, >, etc.)

• Concatenation

• Substring reference

• Pattern matching

C and C++ use char arrays to store char strings and provide a collection of string operations through a standard library whose header is string.h

Recommended