Upload
shanon-chad-johns
View
215
Download
2
Tags:
Embed Size (px)
Citation preview
Search and Decoding Final ProjectSearch and Decoding Final Project
Identify Type of Articles Identify Type of Articles Using Property of Using Property of
PerplexityPerplexity
By Chih-Ti ShihAdvisor: Dr. V. Kepuska
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 22
Project OutlineProject Outline
Project ObjectiveProject Objective Building Specialized Corpus using Building Specialized Corpus using
Bootcat toolkit.Bootcat toolkit. Building language model using CMU Building language model using CMU
language toolkit.language toolkit. Compute perplexity of corpuses. Compute perplexity of corpuses. Project result.Project result.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 33
Project Objective – Introduction to Project Objective – Introduction to PerplexityPerplexity
To measure a performance of a To measure a performance of a language model, the best way is to language model, the best way is to use end-to-end evaluation .use end-to-end evaluation .
End-to-end evaluation is expensive End-to-end evaluation is expensive and time consuming.and time consuming.
Perplexity is the most common Perplexity is the most common evaluation metric and provide a fast, evaluation metric and provide a fast, efficient way to evaluate the efficient way to evaluate the performance of a language model. performance of a language model.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 44
Perplexity - 1Perplexity - 1
N
N
NN
wwwP
wwwPWPP
21
1
21
1
probability of the test set
Normalized by number of
words.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 55
Perplexity - 2Perplexity - 2
N
N
i ii wwPWPP
1 1|
1
Bi-Gram example
Probability of Wi-1 follow by Wi
Normalized by total number of
words
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 66
Project ObjectiveProject Objective
Inverse application: Use perplexity to Inverse application: Use perplexity to identify the content of the article or identify the content of the article or paper.paper.
The lower the perplexity the closer The lower the perplexity the closer the content between the training the content between the training corpus and the test corpus. corpus and the test corpus.
The corpus from the same filed will The corpus from the same filed will show relatively low perplexity show relatively low perplexity compare to other corpuses.compare to other corpuses.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 77
Project ObjectiveProject Objective
Specialized corpus from different field nSpecialized corpus from different field need to be build. eed to be build.
In this project, 3 specialized corpora are In this project, 3 specialized corpora are built. They are Business, History and Cobuilt. They are Business, History and Computer Eng. Corpuses.mputer Eng. Corpuses.
In order to test it, 12 (4 from each 3 fieldIn order to test it, 12 (4 from each 3 fields) articles are chosen as test corpus.s) articles are chosen as test corpus.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 88
Building Specialized Corpus using Building Specialized Corpus using Bootcat toolkit.Bootcat toolkit.
•Select seed•Generate n-Tuples•Retrieve urls•Fetch corresponding pages and build corpus
•Check corpus content and remove unwanted information
Steps:
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 99
Building Specialized Corpus using Building Specialized Corpus using Bootcat toolkit: select seedBootcat toolkit: select seed
The seeds or keywords of each corpus are The seeds or keywords of each corpus are the main factor which directly affects the the main factor which directly affects the specialty of the corpus. The more specific specialty of the corpus. The more specific of the seeds the more specialized the of the seeds the more specialized the corpus can be. corpus can be.
BusinessFinanceCreditLoanStockDowNasdaqCurrencyMutual FundsETFsBondsInvestingTaxes
Rea EstatePropertyWall StreetS&P500DJIAGas priceDAXTradeGreat DepressionCredit CardInvestmentMarket
Seeds of Business corpus:
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 1010
Building Specialized Corpus using Building Specialized Corpus using Bootcat toolkit: tuples Bootcat toolkit: tuples
Tuples are generated randomly from seedsTuples are generated randomly from seeds No word repeating is allow in the same tuple.No word repeating is allow in the same tuple.
Dow Business "Great Depression" FinanceStock Business Property S&P500Property Dow Nasdaq TaxesMarket DJIA "Gas price" BondsETFs Bonds "Gas price" Taxes"Gas price" Credit Bonds ETFsDow ETFs "Gas price" "Wall Street"Loan Trade Property "Wall Street"Finance Credit DJIA ETFs"Rea Estate" Stock Property ETFsStock DJIA Bonds BusinessInvesting Nasdaq "Credit Card" LoanFinance "Wall Street" Investing "Rea Estate"Credit Market Investing "Credit Card"Property "Rea Estate" Credit Loan
Business tuples:
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 1111
Building Specialized Corpus using Bootcat Building Specialized Corpus using Bootcat
toolkit: collect information from Yahoo!toolkit: collect information from Yahoo! Send tuples to Yahoo! and collect urls of Send tuples to Yahoo! and collect urls of
the search result pages.the search result pages. Remove repeated urlsRemove repeated urls Retrieve articles from each urls.Retrieve articles from each urls. Manually remove unwanted information.Manually remove unwanted information.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 1212
Building Specialized Corpus using Bootcat Building Specialized Corpus using Bootcat toolkit:toolkit:
CBusiness_Corpus_50k.txt: 50k words business corpus.CBusiness_Corpus_50k.txt: 50k words business corpus. CBusiness_Corpus_100k.txt: 100k words business corpus.CBusiness_Corpus_100k.txt: 100k words business corpus. CBusiness_Corpus_200k.txt: 200k words business corpus.CBusiness_Corpus_200k.txt: 200k words business corpus. CHistory_Corpus_50k.txt: 50k words history corpus.CHistory_Corpus_50k.txt: 50k words history corpus. CHistory_Corpus_100k.txt: 100k words history corpus.CHistory_Corpus_100k.txt: 100k words history corpus. CHistory_Corpus_200k.txt: 200k words history corpus.CHistory_Corpus_200k.txt: 200k words history corpus. CComputereng_Corpus_50k.txt: 50k words Computer CComputereng_Corpus_50k.txt: 50k words Computer
Engineering corpus.Engineering corpus. CComputereng_Corpus_100k.txt: 100k words Computer CComputereng_Corpus_100k.txt: 100k words Computer
Engineering corpus.Engineering corpus. CComputereng_Corpus_200k.txt: 200k words Computer CComputereng_Corpus_200k.txt: 200k words Computer
Engineering corpus.Engineering corpus.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 1313
Building language model using Building language model using CMU LM toolkitCMU LM toolkit
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 1414
Building language model using Building language model using CMU LM toolkitCMU LM toolkit
Build a list of every word which Build a list of every word which occurred in the training corpus, along occurred in the training corpus, along with its number of occurrences. with its number of occurrences.
Build a vocabulary file which content Build a vocabulary file which content the most frequent 20000 words. the most frequent 20000 words.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 1515
Building language model using Building language model using CMU LM toolkitCMU LM toolkit
Generate N-gram.Generate N-gram. In this project, 5-gram is used.In this project, 5-gram is used. Build language model.Build language model.
Business Corpus Test Message
B1 B1% B2 B2% B3 B3% B4 B4%
perplexity 704.9 846.77 526.39 589.85
No. of hit 5-grames 4 0.27 15 2.77 9 0.97 2 0.28
NO. of hit 4-grames 37 2.48 11 2.03 20 2.16 17 2.4
NO. of hit 3-grames 93 12.96 67 12.36 143 15.44 96 13.56
NO. of hit 2-grames 648 43.52 209 38.56 402 43.41 309 43.6
NO. of hit 1-grames 607 40.77 240 44.28 352 38.01 284 40.11
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 1616
Building language model using Building language model using CMU LM toolkitCMU LM toolkit
Calculate the perplexity of test Calculate the perplexity of test articles to the training corpus.articles to the training corpus.
The batter model will assign a higher The batter model will assign a higher probability to the test data which probability to the test data which lower the perplexity. lower the perplexity.
Average of the perplexity from 3 Average of the perplexity from 3 corpus from the same field but corpus from the same field but different size.different size.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 1717
Project ResultProject ResultTest Corpus B1 B2 B3 B4
Test article type: Business Business Business Business
Avg. PP. of Business Corpus 698.6 1104.25 698.6 1104.25
Avg. PP of History Corpus 959.9 1977.98 959.9 1977.98
Avg. PP of Computer Corpus 803.6 2393.19 803.6 2393.19
Identified type: Business Business Business Business
Test Corpus H1 H2 H3 H4
Test article type: History History History History
Avg. PP. of Business Corpus 579.403 843.503 668.057 576.917
Avg. PP of History Corpus 520.373 663.913 714.975 827.95
Avg. PP of Computer Corpus 628.535 865.558 734.805 686.505
Identified type: History History Business Business
Test Corpus C1 C2 C3 C4
Test article type: Computereng Computereng Computereng Computereng
Avg. PP. of Business Corpus 1580.403 663.303 1017.26 777.63
Avg. PP of History Corpus 1868.145 950.668 1070.14 1035.12
Avg. PP of Computer Corpus 1598.153 473.123 859.79 779.373
Identified type: Business Computereng Computereng Business
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 1818
Project ResultProject Result
There are total of 12 test corpus and 8 of There are total of 12 test corpus and 8 of them are been correctly identified and 3 them are been correctly identified and 3 of them are wrong. Thus, the error rate iof them are wrong. Thus, the error rate is about 33%. Please refer to the /perplexs about 33%. Please refer to the /perplexity.xls for the detail experiment result.ity.xls for the detail experiment result.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 1919
Possible ways to improve the resultPossible ways to improve the result
Remove the most common words from Remove the most common words from the vocabulary. It is because, the word the vocabulary. It is because, the word such as “The”, “and”, and “it”, are not such as “The”, “and”, and “it”, are not related to the specialized field. related to the specialized field.
Adjusting the training corpus, usually, Adjusting the training corpus, usually, the best ratio between the training the best ratio between the training corpus and the test corpus is 1:10. We corpus and the test corpus is 1:10. We can use it as a target to dynamically can use it as a target to dynamically change the size of the training corpus. change the size of the training corpus.
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 2020
ReferenceReference
Bootcat toolkit:Bootcat toolkit:Simple Utilities for Bootstrapping Corpora and Terms from the WSimple Utilities for Bootstrapping Corpora and Terms from the Web. By Marco Baroni and Silvia Bernardini http://sslmit.unibo.it/~beb. By Marco Baroni and Silvia Bernardini http://sslmit.unibo.it/~baroni/bootcat.htmlaroni/bootcat.html
• CMU language Toolkit:CMU language Toolkit:
Carnegie Mellon University, http://www.speech.cs.cmu.edu/Carnegie Mellon University, http://www.speech.cs.cmu.edu/
2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 2121
Questions?