Upload
oswin-russell
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
(C) 2003, The University of Michigan 1
Information Retrieval
Handout #3
February 10, 2003
(C) 2003, The University of Michigan 2
Course Information
• Instructor: Dragomir R. Radev ([email protected])
• Office: 3080, West Hall Connector
• Phone: (734) 615-5225
• Office hours: M&F 11-12
• Course page: http://tangra.si.umich.edu/~radev/650/
• Class meets on Mondays, 1-4 PM in 409 West Hall
(C) 2003, The University of Michigan 3
TF*IDF (cont’d)
(C) 2003, The University of Michigan 4
Vector-based matching
• The cosine measure
sim (D,C) =
(dk . ck . idf(k))
(dk)2 . (ck)2
k
k
k
(C) 2003, The University of Michigan 5
IDF: Inverse document frequency
N: number of documentsdk: number of documents containing term kfik: absolute frequency of term k in document iwik: weight of term k in document i
idfk = log2(N/dk) + 1 = log2N - log2dk + 1
TF * IDF is used for automated indexing and for topicdiscrimination:
(C) 2003, The University of Michigan 6
Asian and European news622.941 deng306.835 china196.725 beijing153.608 chinese152.113 xiaoping124.591 jiang108.777 communist102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people
97.487 nato92.151 albright74.652 belgrade46.657 enlargement34.778 alliance34.778 french33.803 opposition32.571 russia14.095 government 9.389 told 9.154 would 8.459 their 6.059 which
(C) 2003, The University of Michigan 7
Other topics
120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center
74.652 compuserve65.321 massey55.989 salizzoni29.996 bob27.994 online27.198 executive15.890 interim15.271 chief11.647 service11.174 second 6.781 world 6.315 president
(C) 2003, The University of Michigan 8
Semantic networks
(C) 2003, The University of Michigan 9
Semantic Networks
• Used to represent relationships between words
• Example: WordNet - created by George Miller’s team at Princeton
• Based on synsets (synonyms, interchangeable words) and lexical matrices
(C) 2003, The University of Michigan 10
Lexical matrix
Word FormsWord
Meanings F1 F2 F3 … Fn
M1 E1,1 E1,2
M2 E1,2
……
Mm Em,n
(C) 2003, The University of Michigan 11
Synsets
• Disambiguation– {board, plank}– {board, committee}
• Synonyms– substitution– weak substitution– synonyms must be of the same part of speech
(C) 2003, The University of Michigan 12
$ ./wn board -hypen
Synonyms/Hypernyms (Ordered by Frequency) of noun board
9 senses of board
Sense 1board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping
Sense 2board => sheet, flat solid => artifact, artefact => object, physical object => entity, something
Sense 3board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something
(C) 2003, The University of Michigan 13
Sense 4display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
Sense 5board, gameboard => surface => artifact, artefact => object, physical object => entity, something
Sense 6board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something
(C) 2003, The University of Michigan 14
Sense 7control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
Sense 8circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
Sense 9dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
(C) 2003, The University of Michigan 15
Antonymy
• “x” vs. “not-x”
• “rich” vs. “poor”?
• {rise, ascend} vs. {fall, descend}
(C) 2003, The University of Michigan 16
Other relations
• Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”.
• Hyponymy: {tree} is a hyponym of {plant}.
• Hierarchical structure based on hyponymy (and hypernymy).
(C) 2003, The University of Michigan 17
Other features of WordNet
• Index of familiarity
• Polysemy
(C) 2003, The University of Michigan 18
board used as a noun is familiar (polysemy count = 9)
bird used as a noun is common (polysemy count = 5)
cat used as a noun is common (polysemy count = 7)
house used as a noun is familiar (polysemy count = 11)
information used as a noun is common (polysemy count = 5)
retrieval used as a noun is uncommon (polysemy count = 3)
serendipity used as a noun is very rare (polysemy count = 1)
Familiarity and polysemy
(C) 2003, The University of Michigan 19
Compound nouns
advisory boardappeals boardbackboardbackgammon boardbaseboardbasketball backboardbig boardbillboardbinder's boardbinder board
blackboardboard gameboard measureboard meetingboard memberboard of appealsboard of directorsboard of educationboard of regentsboard of trustees
(C) 2003, The University of Michigan 20
Overview of senses1. board -- (a committee having supervisory powers; "the board has seven members")2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows")3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes)4. display panel, display board, board -- (a board on which information can be displayed to public view)5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces")6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree")8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")
(C) 2003, The University of Michigan 21
Top-level concepts
{act, action, activity}
{animal, fauna}
{artifact}
{attribute, property}
{body, corpus}
{cognition, knowledge}
{communication}
{event, happening}
{feeling, emotion}
{food}
{group, collection}
{location, place}
{motive}
{natural object}
{natural phenomenon}
{person, human being}
{plant, flora}
{possession}
{process}
{quantity, amount}
{relation}
{shape}
{state, condition}
{substance}
{time}
(C) 2003, The University of Michigan 22
Properties of words
(C) 2003, The University of Michigan 23
Word distributions
• Negative binomial distribution
• In the Brown corpus– the word “said” has p = 9.24 and α = 0.42
kk ppk
kkF
)1(1
)(
(C) 2003, The University of Michigan 24
Vocabulary growth
• Heaps’ Law
• V = vocabulary size
• V = Knβ, where K and β depend on the text
• K is typically between 10 and 100, and β is less than 1 (for TREC-2 it’s between 0.4 and 0.6)
(C) 2003, The University of Michigan 25
Word length
• In TREC-2, word length is 5 characters on average.
• If stop words are removed, average length increases to a range from 6 to 7.
(C) 2003, The University of Michigan 26
Word similarity
• Hamming distance - when words are of the same length
• Levenshtein distance - number of edits (insertions, deletions, replacements)– color --> colour (1)– survey --> surgery (2)– com puter --> computer ?
• Longest common subsequence (LCS)– lcs (survey, surgery) = surey
(C) 2003, The University of Michigan 27
Approximate string matching
• The Soundex algorithm (Odell and Russell)
• Uses:– spelling correction– hash function– non-recoverable
(C) 2003, The University of Michigan 28
The Soundex algorithm
1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions
2. Assign the following numbers to the remaining letters after the first:b,f,p,v : 1
c,g,j,k,q,s,x,z : 2
d,t : 3
l : 4
m n : 5
r : 6
(C) 2003, The University of Michigan 29
The Soundex algorithm
3. if two or more letters with the same code were adjacent in the original name, omit all but the first
4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits
Examples:
Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300
same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
Some problems: Rogers and Rodgers, Sinclair and StClair
(C) 2003, The University of Michigan 30
Compression
(C) 2003, The University of Michigan 31
Compression
• Huffman coding (prefix property)
• Ziv-Lempel codes (better)
(C) 2003, The University of Michigan 32
Huffman coding
• Developed by David Huffman (1952)
• Average of 5 bits per character
• Based on frequency distributions of symbols
• Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols
(C) 2003, The University of Michigan 33
Symbol Frequency
A 7
B 4
C 10
D 5
E 2
F 11
G 15
H 3
I 7
J 8
(C) 2003, The University of Michigan 34
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
c
b d
f
g
i j
he
a
(C) 2003, The University of Michigan 35
Symbol Code
A 0110
B 0010
C 000
D 0011
E 01110
F 010
G 10
H 01111
I 110
J 111
(C) 2003, The University of Michigan 36
Exercise 1
• Consider the bit string: 01101101111000100110001110100111000110101101011101
• Use the Huffman code from the example to decode it.
• Try inserting, deleting, and switching some bits at random locations and try decoding.
(C) 2003, The University of Michigan 37
Ziv-Lempel coding
• Two types - one is known as LZ77 (used in GZIP)
• Code: set of triples <a,b,c>• a: how far back in the decoded text to look
for the upcoming text segment• b: how many characters to copy• c: new character to add to complete segment
(C) 2003, The University of Michigan 38
• <0,0,p> p• <0,0,e> pe• <0,0,t> pet• <2,1,r> peter• <0,0,_> peter_• <6,1,i> peter_pi• <8,2,r> peter_piper• <6,3,c> peter_piper_pic• <0,0,k> peter_piper_pick• <7,1,d> peter_piper_picked• <7,1,a> peter_piper_picked_a• <9,2,e> peter_piper_picked_a_pe• <9,2,_> peter_piper_picked_a_peck_• <0,0,o> peter_piper_picked_a_peck_o• <0,0,f> peter_piper_picked_a_peck_of• <17,5,l> peter_piper_picked_a_peck_of_pickl• <12,1,d> peter_piper_picked_a_peck_of_pickled• <16,3,p> peter_piper_picked_a_peck_of_pickled_pep• <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper• <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers
(C) 2003, The University of Michigan 39
No. of triples Average textlength
No. of codetriples
Average textlength
1 1.00 11 1.82
2 1.00 12 1.92
3 1.00 13 2.00
4 1.25 14 1.93
5 1.20 15 1.87
6 1.33 16 2.13
7 1.57 17 2.12
8 1.88 18 2.22
9 1.78 19 2.26
10 1.80 20 2.20
(C) 2003, The University of Michigan 40
Markup languages
(C) 2003, The University of Michigan 41
Markup languages
• HTML
• SGML
• XML
(C) 2003, The University of Michigan 42
HTML
• Focus on presentation, not content
(C) 2003, The University of Michigan 43
<!SGML "ISO 8879:1986" CHARSETBASESET "ISO 646-1983//CHARSETInternational Reference Version (IRV)//ESC 2/5 4/0"DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED BASESET "ISO Registration Number 109//CHARSET ECMA-94 Right Part of Latin-1 Alphabet Nr.3//ESC 2/9 4/3" DESCSET 128 32 UNUSED -- no such characters -- 160 1 UNUSED -- nbs character -- 161 94 161 -- 161 through 254 inclusive -- 255 1 UNUSED
CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"SCOPE DOCUMENTSYNTAXSHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255BASESET "ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0"DESCSET 0 128 0FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9NAMING LCNMSTRT "" UCNMSTRT "" LCNMCHAR "_-." UCNMCHAR "_-." NAMECASE GENERAL NO ENTITY NODELIM GENERAL SGMLREF SHORTREF SGMLREFNAMES SGMLREFQUANTITY SGMLREF ATTCNT 99999999 ATTSPLEN 99999999 DTEMPLEN 24000 ENTLVL 99999999 GRPCNT 99999999 GRPGTCNT 99999999 GRPLVL 99999999 LITLEN 24000 NAMELEN 99999999
PILEN 24000 TAGLEN 99999999 TAGLVL 99999999 FEATURES
MINIMIZE DATATAG NO OMITTAG YES RANK YES SHORTTAG YESLINK SIMPLE YES 1000 IMPLICIT YES EXPLICIT YES 1OTHER CONCUR NO SUBDOC YES 99999999 FORMAL YES APPINFO NONE>
<!DOCTYPE DOCSET [<!--File: asr.dtdAuthor: Jon Fiscus, NISTDesc: This DTD is intended to parse a TDT2 .tkn file.
--><!ELEMENT DOCSET - O (X|W)+><!ELEMENT X - O EMPTY ><!ELEMENT W - O CDATA >
<!ATTLIST DOCSET type (ASRTEXT|NEWSWIRE|CAPTION|TRANSCRIPT|SYSTRAN|ASR_SYSTRAN) #REQUIRED fileid CDATA #REQUIRED collect_date CDATA #REQUIRED collect_src CDATA #REQUIRED src_lang CDATA #REQUIRED content_lang CDATA #REQUIRED proc_remarks CDATA #IMPLIED >
<!ATTLIST W recid CDATA #REQUIRED Bsec CDATA #IMPLIED Dur CDATA #IMPLIED Clust CDATA #IMPLIED Conf CDATA #IMPLIED tr (Y|N) #IMPLIED >
<!ATTLIST X Bsec CDATA #IMPLIED Dur CDATA #IMPLIED Conf (NA) #IMPLIED >]>
SGML
(C) 2003, The University of Michigan 44
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE DOCSENT SYSTEM "../../../../../dtd/docsent.dtd" ><DOCSENT DID='D-20000408_011.e' DOCNO='17706' LANG='ENG' CORR-DOC='D-20000408_017.c'><BODY><HEADLINE><S PAR="1" RSNT="1" SNO="1"> Beat Drugs Fund Grants $16 million in Support of 29 Anti-Drug Projects </S></HEADLINE><TEXT> <S PAR='2' RSNT='1' SNO='2'>The Governing Committee of the Beat Drugs Fund , chaired by the Secretary for Security , has approved grants of $16 .39 million for 29 anti-drug projects this year .</S><S PAR='3' RSNT='1' SNO='3'>The Commissioner for Narcotics , Mrs Clarie Lo , who is also a member of the Governing Committee , said , "The number of drug abusers aged below 21 dropped by 13 .6 per cent from 2829 in 1998 to 2 443 in 1999 .</S><S PAR='3' RSNT='2' SNO='4'>Despite the continuing drop in recent years , we recognise that youths-at-risk are a highly vulnerable group and deserve the full attention of all those working in the anti-drug field . "</S><S PAR='4' RSNT='1' SNO='5'> "To prevent our younger generation from abusing drugs , education and publicity is an on-going campaign; and any relaxation in efforts might have adverse consequences , " Mrs Lo added .</S><S PAR='5' RSNT='1' SNO='6'>In considering this year 's applications for the Fund , the Governing Committee attached importance to those aiming to steer youths-at-risk away from drugs .</S><S PAR='6' RSNT='1' SNO='7'>Amongst the 29 projects approved this year , 22 are related to drug prevention education and publicity ($10 .72 million) , five to treatment and rehabilitation ($2 .98 million)and two to research ($2 .69 million) .</S><S PAR='7' RSNT='1' SNO='8'>An amount of $2 .08 million was granted to conduct a pioneering longitudinal research on the development and validation of a drug prevention programme in Hong Kong .</S><S PAR='8' RSNT='1' SNO='9'>Youths-at-risk aged between 10 to 15 in selected areas including Tuen Mun and Kwun Tong will be invited to take part in the project .</S><S PAR='8' RSNT='2' SNO='10'>Participants will be taught on the adverse effect of drug abuse , social and personal skills to help them identify and resist peer influence to use drugs .</S></TEXT></BODY></DOCSENT>
<!-- DTD for sentence-segmented text -->
<!ELEMENT DOCSENT (EXTRACTION-INFO?, BODY)><!ATTLIST DOCSENT DID CDATA #REQUIRED DOCNO CDATA #IMPLIED LANG (CHIN|ENG) "ENG" CORR-DOC CDATA #IMPLIED> <!-- DID : documentid LANG: language -->
<!ELEMENT EXTRACTION-INFO EMPTY><!ATTLIST EXTRACTION-INFO SYSTEM CDATA #REQUIRED RUN CDATA #IMPLIED COMPRESSION CDATA #REQUIRED QID CDATA #REQUIRED>
<!ELEMENT BODY (HEADLINE?,TEXT)>
<!ELEMENT HEADLINE (S)*><!ELEMENT TEXT (S)*>
<!ELEMENT S (#PCDATA)> <!ATTLIST S PAR CDATA #REQUIRED RSNT CDATA #REQUIRED SNO CDATA #REQUIRED> <!-- PAR: paragraph no RSNT: relative sentence no (within paragraph) SNO: absolute sentence no -->
docsent.dtd
example.docsent
XML