25
Dialog Databases Dialog Databases Structure & Structure & Indexing Indexing Dr. Dania Bilal Dr. Dania Bilal IS 530 IS 530 Fall 2009 Fall 2009

Dialog Databases Structure & Indexing Dr. Dania Bilal IS 530 Fall 2009

Embed Size (px)

Citation preview

Dialog Databases Dialog Databases Structure & IndexingStructure & Indexing

Dr. Dania BilalDr. Dania Bilal

IS 530IS 530

Fall 2009Fall 2009

DefinitionDefinition

A database is a collection of information A database is a collection of information organized in a way that a computer organized in a way that a computer program can quickly retrieve desired program can quickly retrieve desired pieces of data. pieces of data.

Database ComponentsDatabase Components

FieldsFields

RecordsRecords

FilesFiles

Database FieldsDatabase Fields

Pieces of information a user can accessPieces of information a user can access AuthorAuthor TitleTitle Journal nameJournal name AbstractAbstract DescriptorsDescriptors OtherOther

Fields AttributesFields Attributes

Numeric Numeric (e.g., accession number)(e.g., accession number)

Textual Textual (e.g., author name)(e.g., author name)

Data StructureData Structure

A scheme for organizing related pieces of A scheme for organizing related pieces of information. information.

Basic types of data structuresBasic types of data structures FilesFiles RecordsRecords TreesTrees TablesTables

FilesFiles

FileFile A collection of recordsA collection of records In Dialog, a file also refers to a specific In Dialog, a file also refers to a specific

databasedatabase

Every file/database has a number and/or a Every file/database has a number and/or a namename ERIC is a database with a file no. 1 in Dialog.ERIC is a database with a file no. 1 in Dialog.

RecordsRecords

RecordRecord A collection of fields which constitutes a A collection of fields which constitutes a

complete set of informationcomplete set of informationAuthor, title, journal name, abstract, etc.Author, title, journal name, abstract, etc.

A collection of records constitutes a file.A collection of records constitutes a file.

TreesTrees

Data is organized in a hierarchical Data is organized in a hierarchical structurestructure Each element is attached to one or more Each element is attached to one or more

elements that is directly beneath it.elements that is directly beneath it. Connections between elements ->branchesConnections between elements ->branches Elements at bottom of a tree with no elements Elements at bottom of a tree with no elements

below them -> leavesbelow them -> leaves Example: Yahoo directory. Example: Yahoo directory.

TablesTables

Data is organized in rows and columnsData is organized in rows and columns Example: Excel spreadsheetExample: Excel spreadsheet

Relational database management systems Relational database management systems store data in the form of related tablesstore data in the form of related tables Aleph system (Hodges online catalog) is based on a Aleph system (Hodges online catalog) is based on a

relational database management system called Oracle.relational database management system called Oracle.

Dialog DatabaseDialog Database

Documents or surrogates are stored in a Documents or surrogates are stored in a linear filelinear file Example of linear organization is cassette Example of linear organization is cassette

tapetape Access to songs on the tape is not “direct” or Access to songs on the tape is not “direct” or

“random” in nature. “random” in nature.

Linear file is transformed into an inverted Linear file is transformed into an inverted file (in Dialog)file (in Dialog)

Dialog Database StructureDialog Database Structure

Linear fileLinear file Composed of document surrogates Composed of document surrogates

(abstracts) stored in their full, original form.(abstracts) stored in their full, original form.

Inverted fileInverted file Composed of all words included in document Composed of all words included in document

surrogates excluding stop words.surrogates excluding stop words.

Problem with Linear FileProblem with Linear File

Documents or surrogates will have to be Documents or surrogates will have to be searched in their entirety to locate specific searched in their entirety to locate specific information needed.information needed. Slow Slow Inefficient Inefficient Access to information may cause frustrationAccess to information may cause frustration

Inverted FileInverted File

Words in all document surrogatesWords in all document surrogates can can be searched instead of the whole text of be searched instead of the whole text of the documents themselvesthe documents themselves Music CD is an analogy to an inverted Music CD is an analogy to an inverted

structure.structure. Divided into tracksDivided into tracks Random and direct access to each track is Random and direct access to each track is

easyeasy

Faster access to informationFaster access to information

Dialog Inverted FileDialog Inverted File

A list of words in each document surrogate A list of words in each document surrogate is made.is made.

Each word is numbered, including phrases Each word is numbered, including phrases and excluding stop words (the, a, an, etc.).and excluding stop words (the, a, an, etc.).

Words that are numbered are Words that are numbered are alphabetized (numbers precede letters).alphabetized (numbers precede letters).

Dialog Inverted FileDialog Inverted File

Alphabetized entries are followed byAlphabetized entries are followed by document number (based on its acquisition document number (based on its acquisition

and addition to database)and addition to database) field entry or entries appeared infield entry or entries appeared in

Author field Author field

Title fieldTitle field

Abstract fieldAbstract field

Descriptor fieldDescriptor field

Other fields, as applicableOther fields, as applicable

Linear File: ExampleLinear File: Example

101. The origins of Don Giovanni.

Discusses the history and sources Mozart used in his opera Don Giovanni.

DE: Mozart, Opera, Historical Analysis.

Inverted FileInverted File101. The origins of Don Giovanni. Discusses the history and sources Mozart used in his opera Don Giovanni.DE: Mozart, Opera, Historical Analysis.

Word Doc no. Field Word sequence

Origins 101 Ti 2

Don 101 Ti 4

Giovanni 101 Ti 5

Discusses 101 Ab 1

History 101 Ab 3

Sources 101 Ab 5

Mozart 101 Ab 6

Used 101 Ab 7

Inverted File Cont’d.Inverted File Cont’d.101. The origins of Don Giovanni. Discusses the history and sources Mozart used in his opera Don Giovanni.DE: Mozart, Opera, Historical Analysis.

Word Doc no. Field Word sequence

Mozart 101 DE 1

Opera 101 DE 2

Historical 101 DE 3

Analysis 101 DE 4

Historical Analysis 101 DE 3,4

IndexingIndexing

Words (keywords)Words (keywords) Every important word in a document is Every important word in a document is

indexedindexed Example: Historical analysisExample: Historical analysis

Indexed as 2 separate words and as a phraseIndexed as 2 separate words and as a phraseHistorical (word)Historical (word)Analysis (word)Analysis (word)Historical analysis (phrase)Historical analysis (phrase)

Google IndexingGoogle Indexing

Example 1. Google Example 1. Google Phrase/Sentence Indexing.Phrase/Sentence Indexing.

Example 2. Google Example 2. Google Phrase/Keywords Indexing.Phrase/Keywords Indexing.

Example 3. Google Natural Example 3. Google Natural Language Search and Retrieval???Language Search and Retrieval???

DemosDemos

Dialog - ERIC databaseDialog - ERIC database

EBSCO - ERIC databaseEBSCO - ERIC database

Discussion of differences in interface Discussion of differences in interface featuresfeatures