45
LIS618 lecture 4 Thomas Krichel 2004-02-28

LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

Embed Size (px)

Citation preview

Page 1: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

LIS618 lecture 4

Thomas Krichel

2004-02-28

Page 2: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

Structure

• Document preprocessing

• Practice: Nexis

Page 3: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

document preprocessing

• There are some operations that may be done to the documents before indexing– lexical analysis– stemming of words– elimination of stop words– selection of index terms– construction of term categorization structures

we will look at those in turn• in many cases, document preprocessing is not

well documented by the provider.• but searchers need to be aware of them…

Page 4: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

lexical analysis

• divides a stream of characters into a stream of words

• seems easy enough but….– should we keep numbers?– hyphens. compare "state-of-the-art" with "b-52"– removal of punctuation, but “The battle of Cannae

took place in 217 B.C. It was a bad defeat for the Romans."

– casing. compare "bank" and "Bank"

Page 5: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

stemming

• in general, users search for the occurrence of a term irrespective of grammar

• plural, gerund forms, past tense can be subject to stemming

• important algorithm by Porter, applicable to English only

• evidence about the effect of stemming on information retrieval is mixed

• stemming is relatively rare these days.

Page 6: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

elimination of stop words

• some words carry no meaning and should be eliminated

• in fact any word that appears in 80% of all documents is pretty much useless, but

• consider a searcher for "to be or not to be".

• It is better to reduce the index weight of terms that appear very frequently

Page 7: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

index term selection

• In printed indexes, we use nouns only• some nouns that appear heavily together can be

considered to be one index term, such as "computer science"

• Dialog deals with this through phrase indexing. • Nexis has the smart indexing feature that groups

terms into concepts• Most web engines, index all words, and all of the

individually

Page 8: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

thesauri

• a list of words and for each word, a list of related words – synonyms– broader terms– narrower terms

• used– to provide a consistent vocabulary for indexing and

searching– to assist users with locating terms for query

formulation– allow users to broaden or narrow query

Page 9: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

use of thesauri

• Thesauri are limited to experimental systems, or some high-quality systems, examples are

– http://www.sosig.ac.uk/roads/cgi-bin/thesaurus.pl

– Nexis

• It can be confusing to users.

• There is very little free thesaurus data available.

Page 10: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

Back to Nexis: customization

• You can customize the search results on the top right corner of the screen.– You should set your default result list to the

expanded list, so you can see your keywords in context of the documents.

– You may wish to increase the size of the default page to 99.

• You can hide the subject directory.

Page 11: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

test sources

• The ones that I have used are – News 60 days (the top source)– Two foreign sources

• Le Monde • Der Spiegel

Page 12: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

dating

• Dates can be entered in any of the following forms– 07/24/2000 -- 7/24/00 – July 24, 2000 -- Jul 24, 2000 – July 2000

• You can set the dates in the menu, but they will not go beyond the stated frame of the source.

Page 13: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

Document preprocessing in Nexis

• The following are always considered word limits– hyphens -– slashes / – parentheses ()– spaces

• examples– “state of the art” will also find “state-of-the-art”– “co-operative” will not find “cooperative”.

• The documentation says you have to leave them out of the query. I think this is wrong, you can use slash and hyphen.

Page 14: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

query preprocessing in Nexis

• apostrophe: – if followed by "s", it is a possessive. Singular and

plural and plural possessive are also found. Thus “company’s” also finds

• company• companies• companies’

– if not, it counts like a character in a word

• at-sign: contrary to documentation, does not appear to be preprocessed. But you can leave it out when you search for an email addresses. “[email protected]” gives the same results as “president whitehouse.gov”

Page 15: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

query preprocessing in Nexis

• ampersand: if it is surrounded by blanks, most of the time, Nexis treats it as "and". If it is not, it treats it as a normal character

• But the & is not always equivalent to “and”.

• Example: search in Le Monde for – “cable & wireless” same as “cable wireless”– “cable and wireless” gives a different result

Page 16: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

query preprocessing in Nexis

• colon and comma are read as a space unless adjacent characters are numbers.

• percent and pound sign mean themselves and are not equivalent to anything.

• ? $ ; in the query are all ignored, it is said in the documentation, but that is not true– “$ 4711” in Der Spiegel versus “4711” in the same

source

• ® is replaced by the word "R", ™ is replaced by the word "TM“, according to documentation.

Page 17: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

query preprocessing in Nexis

• if you enter a dot, it is interpreted as a decimal point if surrounded by numbers

• if the dot is followed by just one letter, you have to keep it – “10:14 a.m.” will get some results– “10:14 am” will get different results

• but if there are several letters, you can leave it out– “ebay com” gives same results as “ebay.com”

Page 18: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

query preprocessing in Nexis

• The double quote is ignored in a query. • It can be used to make the word not, a

reserved word, searchable, according to the documentation.

• Example– “to be or not to be”, Der Spiegel, all dates

• This method can not be applied to other noise words. It is not even possible to get lists of noise words.

Page 19: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

effort to be easy

• Overall, Nexis makes an effort to be easy at the expense of precise query semantics.

• One positive example are long quotes– who said: “For the past three years there has been no

growth. Sooner or later they have to figure out that the melody has changed.”

– Such long quotes work most of the time, they can be entered without surrounding quotation marks.

Page 20: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

noise and reserved words

• Noise words are common words– in power search, noise words are ignored, replace by

space– in quick search, you can use phrases– no list of noise words

• Reserved words are – and– or– not

used in Boolean expressions. They are not indexed.

Page 21: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

plurals

• Nexis indexes plural and possessive as the singular.

• But in power search, you can use the following– PLURAL (term) only the plural of term– SINGULAR (term) only the singular of term– ALLCAPS (term) only capitals of term– NOCAPS (term) no capitals of term– CAPS (term) capitalized term only

• Note that term can be a sequence of words.• This feature does not work properly on certain

databases. Example allcaps(RFA) in Le Monde.

Page 22: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

plural, singular etc functions

• Such functions are not reliable. Example, search in the Der Spiegel– national plural(archives) no results– national archives– singular(haus) same results as “haus”– plural(haus) no results– upper(USA) no results

Page 23: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

entering phrases in power search

• If you just put one after the other, it looks for the one after the other subject to– latin chars only

• Путин does not work– no hits (opera)– erroneous hits (explorer)

– removal of noise words• “Thomas Bishoff” finds Thomas of Bishoff

– plural ORing• “José Carreras” will also find José Carrera

Page 24: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

searching for phrases

• If your phrase contains a word that is a reserved word you are best off using connectors.

• Example, current news, today only– “black white” 221 hits

• “black and white”• “black or white”• “black/white”

– “black and white” 444 hits• interpreted “and” as a reserved word.

Page 25: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

searching for phrases

• If your phrase contains a word that likely to be a word you are best off using connectors.

• Otherwise, you search seems to be translated with the noise word removed and then replaced by a connector.

• Example, current news, 60 days– “cream of white” you get

• “cream or white”• “cream/white”• “cream into the white”

Page 26: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

treatment of accents

• I have done a lot of research, but found no systematic treatment of accents. It seems source dependent.

• Therefore search several variations, e.g. – Müntefering – Muentefering – Muntefering

• Same thing for plurals. Some foreign language sources are not searched for plurals. Example, search “archive” in Der Spiegel.

Page 27: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

order of connectors• The order is

– OR – W/n, PRE/n – W/S (can not be combined with w/n or pre/n)– W/P (can not be combined with w/n or pre/n)– AND – AND NOT

• When you use two or more of the same connectors in a search, they normally operate from left to right.

• When a search contains multiple W/n or PRE/n connectors, the connectors operate in numerical order with the smallest number first.

Page 28: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

examples

• use news most current 60 days– “Harry pre/1 Miller” 18 hits

• 5th has the sentence “HAWTHORN has discovered another Harry Miller”.

– “Harry pre/2 Miller” 93 hits

• “another pre/2 Miller w/3 Hawthorn” matches “HAWTHORN has discovered another Harry Miller”, because the pre/2 is evaluated first.

Page 29: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

problem formats

• Sometimes Nexis gets confused!• Problem Format:          Corrected Format:

– A w/n (B and C) A w/n B and A w/n C– A w/s (B and C) A w/s B and A w/s C– A w/p (B and C) A w/p B and A w/p C– A w/s (B and C) A w/s B and A w/s C

• but of course you can do something like “another pre/2 Miller and Miller w/s Hawthorn”

Page 30: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

Nexis segments

• Nexis does some document preprocessing for characters, discussed in a later slide.

• The processed document has a number field/value pairs that are called segments

• Not every source has every segment.

• Searches using segments will return no hits for sources that don’t have the segment.

Page 31: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

segment types

• I make a distinction between – native – smart-indexed

segments.

• Smart-indexed segments contain the result of smart indexing. Thus, they are not native to the source.

Page 32: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

segment types

• Some segment can be sorted. There are the ones that have numbers or dates as values.

• These segments use the following arithmetic operators:– = is equal to or is– > aft greater than or after– < bef less than or before

• Example for Der Spiegel• atleast3(foster) and date(<2000) and

length(>1000)

Page 33: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

how to know about segment

• This is virtually impossible, because there is no comprehensive documentation.

• Look at the example of Le Monde, where the list of segments seems incomplete.

• For Der Spiegel “rubrik(hausmitteilung) and text(foster)” does find the example given. “section(hausmitteilung) and body(foster)”

Page 34: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

typical segments in news

• BYLINE • CORRRECTION • CORRECTION-DATE (date)• DATE (date)• DATELINE (not a date) • GRAPHIC• HEADLINE • HIGHLIGHT • LEAD• HLEAD is HEADLINE, HIGHLIGHT, & LEAD

Page 35: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

typical segments in news

• PUBLICATION name and copyright

• SECTION

• SERIES

• SOURCE

• TICKER  

• TYPE

Page 36: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

typical smart-indexed segments

• CITY• COMPANY • COUNTRY • GEOGRAPHIC • INDUSTRY• KEYWORD • ORGANIZATION• PERSON • PRODUCT• SUBJECT • TICKER• TYPE• TERMS includes all these

Page 37: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

terms example

• compare, in recent news, person(osama bin laden) with “osama bin laden”

• index terms can be useful to hone in on complex topics that can have many names. Example– start with “sex” as an index term– collect terms related to gay marriage and civil

unions

Page 38: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

segment search

• You can place query terms and connectors in a segment and then search for it. Example:

hlead((drug or substance) w/10 abuse)

Page 39: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

using segments for news

• uses power search expressions, plus• hlead (expression) ?• headline (expression)• company (expression) for a company• byline (expression) for the author• show (expression) for a television show

transcript

expression is a simple keyword or expression using several words, possible combined with connectors.

Page 40: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

segments for legal data

• name (expression) for the name of a party

• cite (expression) for a citation expression for case law

• title (expression) for the title of a law article

expression is a simple keyword or expression using several words, possible combined with connectors.

Page 41: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

Search forms

• There are special forms for – News– Company reports– Market indicators– Portfolio– News and quotes about companies

Page 42: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

Personal news alert

• do a search

• then click on “track in personal news” to get to a screen where you can enter – periodicity – what documents to be sent– subject

• This works for real estate for me.

Page 43: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

Real time news

• This uses a different query language– terms are implicitly ANDed– explicit AND and OR allowed– phrases have to be put in quotes– * starts for any number of characters, not just

one as in power search– parenthesis can be used

• I have poor experience with this.

Page 44: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

Summary on Nexis

• Nexis has a rich set of resources.

• It can be searched by inexperienced, but likely to get poor result.

• Clever learning about its features can get you quite far, however, the features are not well documented online.

• Nexis seems frequently to violate its stated rules.

Page 45: LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis

http://openlib.org/home/krichel

Thank you for your attention!