Upload
aldous-mcdowell
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Dialog Databases Dialog Databases Structure & IndexingStructure & Indexing
Dr. Dania BilalDr. Dania Bilal
IS 530IS 530
Fall 2009Fall 2009
DefinitionDefinition
A database is a collection of information A database is a collection of information organized in a way that a computer organized in a way that a computer program can quickly retrieve desired program can quickly retrieve desired pieces of data. pieces of data.
Database FieldsDatabase Fields
Pieces of information a user can accessPieces of information a user can access AuthorAuthor TitleTitle Journal nameJournal name AbstractAbstract DescriptorsDescriptors OtherOther
Fields AttributesFields Attributes
Numeric Numeric (e.g., accession number)(e.g., accession number)
Textual Textual (e.g., author name)(e.g., author name)
Data StructureData Structure
A scheme for organizing related pieces of A scheme for organizing related pieces of information. information.
Basic types of data structuresBasic types of data structures FilesFiles RecordsRecords TreesTrees TablesTables
FilesFiles
FileFile A collection of recordsA collection of records In Dialog, a file also refers to a specific In Dialog, a file also refers to a specific
databasedatabase
Every file/database has a number and/or a Every file/database has a number and/or a namename ERIC is a database with a file no. 1 in Dialog.ERIC is a database with a file no. 1 in Dialog.
RecordsRecords
RecordRecord A collection of fields which constitutes a A collection of fields which constitutes a
complete set of informationcomplete set of informationAuthor, title, journal name, abstract, etc.Author, title, journal name, abstract, etc.
A collection of records constitutes a file.A collection of records constitutes a file.
TreesTrees
Data is organized in a hierarchical Data is organized in a hierarchical structurestructure Each element is attached to one or more Each element is attached to one or more
elements that is directly beneath it.elements that is directly beneath it. Connections between elements ->branchesConnections between elements ->branches Elements at bottom of a tree with no elements Elements at bottom of a tree with no elements
below them -> leavesbelow them -> leaves Example: Yahoo directory. Example: Yahoo directory.
TablesTables
Data is organized in rows and columnsData is organized in rows and columns Example: Excel spreadsheetExample: Excel spreadsheet
Relational database management systems Relational database management systems store data in the form of related tablesstore data in the form of related tables Aleph system (Hodges online catalog) is based on a Aleph system (Hodges online catalog) is based on a
relational database management system called Oracle.relational database management system called Oracle.
Dialog DatabaseDialog Database
Documents or surrogates are stored in a Documents or surrogates are stored in a linear filelinear file Example of linear organization is cassette Example of linear organization is cassette
tapetape Access to songs on the tape is not “direct” or Access to songs on the tape is not “direct” or
“random” in nature. “random” in nature.
Linear file is transformed into an inverted Linear file is transformed into an inverted file (in Dialog)file (in Dialog)
Dialog Database StructureDialog Database Structure
Linear fileLinear file Composed of document surrogates Composed of document surrogates
(abstracts) stored in their full, original form.(abstracts) stored in their full, original form.
Inverted fileInverted file Composed of all words included in document Composed of all words included in document
surrogates excluding stop words.surrogates excluding stop words.
Problem with Linear FileProblem with Linear File
Documents or surrogates will have to be Documents or surrogates will have to be searched in their entirety to locate specific searched in their entirety to locate specific information needed.information needed. Slow Slow Inefficient Inefficient Access to information may cause frustrationAccess to information may cause frustration
Inverted FileInverted File
Words in all document surrogatesWords in all document surrogates can can be searched instead of the whole text of be searched instead of the whole text of the documents themselvesthe documents themselves Music CD is an analogy to an inverted Music CD is an analogy to an inverted
structure.structure. Divided into tracksDivided into tracks Random and direct access to each track is Random and direct access to each track is
easyeasy
Faster access to informationFaster access to information
Dialog Inverted FileDialog Inverted File
A list of words in each document surrogate A list of words in each document surrogate is made.is made.
Each word is numbered, including phrases Each word is numbered, including phrases and excluding stop words (the, a, an, etc.).and excluding stop words (the, a, an, etc.).
Words that are numbered are Words that are numbered are alphabetized (numbers precede letters).alphabetized (numbers precede letters).
Dialog Inverted FileDialog Inverted File
Alphabetized entries are followed byAlphabetized entries are followed by document number (based on its acquisition document number (based on its acquisition
and addition to database)and addition to database) field entry or entries appeared infield entry or entries appeared in
Author field Author field
Title fieldTitle field
Abstract fieldAbstract field
Descriptor fieldDescriptor field
Other fields, as applicableOther fields, as applicable
Linear File: ExampleLinear File: Example
101. The origins of Don Giovanni.
Discusses the history and sources Mozart used in his opera Don Giovanni.
DE: Mozart, Opera, Historical Analysis.
Inverted FileInverted File101. The origins of Don Giovanni. Discusses the history and sources Mozart used in his opera Don Giovanni.DE: Mozart, Opera, Historical Analysis.
Word Doc no. Field Word sequence
Origins 101 Ti 2
Don 101 Ti 4
Giovanni 101 Ti 5
Discusses 101 Ab 1
History 101 Ab 3
Sources 101 Ab 5
Mozart 101 Ab 6
Used 101 Ab 7
Inverted File Cont’d.Inverted File Cont’d.101. The origins of Don Giovanni. Discusses the history and sources Mozart used in his opera Don Giovanni.DE: Mozart, Opera, Historical Analysis.
Word Doc no. Field Word sequence
Mozart 101 DE 1
Opera 101 DE 2
Historical 101 DE 3
Analysis 101 DE 4
Historical Analysis 101 DE 3,4
IndexingIndexing
Words (keywords)Words (keywords) Every important word in a document is Every important word in a document is
indexedindexed Example: Historical analysisExample: Historical analysis
Indexed as 2 separate words and as a phraseIndexed as 2 separate words and as a phraseHistorical (word)Historical (word)Analysis (word)Analysis (word)Historical analysis (phrase)Historical analysis (phrase)
Example 3. Google Natural Example 3. Google Natural Language Search and Retrieval???Language Search and Retrieval???