Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nørvåg Department of Computer and Information Science Norwegian

Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nrvg Department of Computer and Information Science Norwegian University of Science and Technology Trondheim, Norway (Work done during visit at Aalborg University, Denmark) Slide 2 August 20, 2003ECDL'20032 Outline Motivation and example application The temporal text-indexing approach used in V2 A more space-efficient approach: ITTX Comparison Summary and further work Slide 3 August 20, 2003ECDL'20033 Motivation Amount of data available in various documents rapidly increasing Storage getting cheaper Less need for deleting data! Can more often afford to store previous versions Slide 4 August 20, 2003ECDL'20034 Example application: Temporal web warehouse Related projects: Internet Archive Wayback Machine Several projects at national level in different countries Slide 5 August 20, 2003ECDL'20035 Our goal Want to query: Historical versions, e.g., all documents containing bin Laden & created before September 11, 2001 Changes, e.g., all documents that did not contain bin Laden before September 11, 2001, but contained these words afterwards Why? For example: Identifying trends, web archive mining, investigations, etc Slide 6 August 20, 2003ECDL'20036 What is the problem? Temporal text- containment queries: Q: Give me all document versions that contained the word Kjetil at date August 25. 2002 Expensive query without suitable index Slide 7 August 20, 2003ECDL'20037 Context: the V2 temporal document database system Supports storage, retrieval, and querying of transaction- time temporal documents Support for temporal text-containment queries Emphasis on using/developing techniques easy to integrate into existing systems Slide 8 August 20, 2003ECDL'20038 Temporal text indexing in V2 prototype: first version Document versions uniquely identified by version identifiers (VIDs) Given by name and timestamp VID Basic text index indexes all versions Simple (but fairly efficient) support structure: VP index: maps from VID to validity time periods for versions Temporal text query processing: 1.Text index query on all versions 2.Time-select step using VP index Efficient under assumption that VP index fits in main memory Slide 9 August 20, 2003ECDL'20039 From the V2 approach to ITTX: Interval-based Temporal Text indeXing Problem of original approach: size of text index grows proportional with size of document database Want: size of text index to grow proportional with size of changes Solution: interval based indexing Use document identifier (DID) and document- version identifier (DVID) to identify version Conceptually in text index for each word-occurrence for document valid from T S to T E : (Word, DID, DVID, T S, T E ) Entries for consecutive DVIDs stored as interval: (Word, DID, DVID, DVID, T S, T E ) Slide 10 August 20, 2003ECDL'200310 Separate indexes for word occurrences in current and historical documents Assume queries for current documents will still be most frequent separate index for entries that are still valid smaller amount of entries have to be processed Avoid storing unknown end timestamps for current versions save some space Slide 11 August 20, 2003ECDL'200311 Temporal text-index structures Slide 12 August 20, 2003ECDL'200312 Operation: insert document at time t 1. Allocate document identifier d 2. Insert document into version database 3. For all distinct words W in document, insert (Word=W, DID=d, DVID=0, T S =t) into CTxtIdx Slide 13 August 20, 2003ECDL'200313 Operation: update document d at time t 1. Read previous version with DVID=j 2. DVID=j+1 allocated for new version 3. For all new distinct words W in document, insert (Word=W, DID=d, DVID=j+1, T S =t) into CTxtIdx 4. For all words that disappeared between versions: 1.Remove (Word, DID, DVID=i, T S ) from CTxtIdx 2.Insert (Word, DID, DVID=i, T S,, T E =t) into HTxtIdx Slide 14 August 20, 2003ECDL'200314 Operation: temporal snapshot single- word text-containment query Task: querying for all document versions that contained a particular word W S at time t 1. HTxtIdx: Retrieve (Word, DID, DVID i, DVID j, T S, T E ) where Word= W S and T S t T E 2. CTxtIdx: Retrieve (Word, DID, DVID j, T S ) where Word= W S and t T S 3. Interesting part of result: set of (DID, DVID j, DVID j ) tuples 4. Do not know exact DVID, lookup in doc-version database and doc-name index needed Multi-word query: retrieval of all postings for word only necessary for one of the words, for other words only selective (Word, DID x ) needed Slide 15 August 20, 2003ECDL'200315 Comparison: ITTX vs. original V2 Advantages of ITTX: Smaller index size More efficient non-temporal (current) text-containment queries Average cost of updating document/index entries much lower Slide 16 August 20, 2003ECDL'200316 Possible problem with ITTX: Data reduction Granularity reduction Results in fragmented intervals in text index more space needed Vacuuming: physically remove some non-current versions or deleted documents No problem with ITTX Slide 17 August 20, 2003ECDL'200317 Summary and further work The motivation and context The (previous) approach, currently used in V2 The new/improved approach Ongoing work: New version of the V2 document database system Will include implementation of ITTX Will support XML and temporal XML queries Study approaches that can achieve better clustering in the temporal dimension, e.g., TSB-tree-like approaches

Documents

Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nørvåg Department of Computer and Information Science Norwegian