Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska

Search and Decoding Final ProjectSearch and Decoding Final Project

Identify Type of Articles Identify Type of Articles Using Property of Using Property of

PerplexityPerplexity

By Chih-Ti ShihAdvisor: Dr. V. Kepuska

2007/12/132007/12/13 Chih-Ti ShihChih-Ti Shih 22

Project OutlineProject Outline

Project ObjectiveProject Objective Building Specialized Corpus using Building Specialized Corpus using

Bootcat toolkit.Bootcat toolkit. Building language model using CMU Building language model using CMU

language toolkit.language toolkit. Compute perplexity of corpuses. Compute perplexity of corpuses. Project result.Project result.


Project Objective – Introduction to Project Objective – Introduction to PerplexityPerplexity

To measure a performance of a To measure a performance of a language model, the best way is to language model, the best way is to use end-to-end evaluation .use end-to-end evaluation .

End-to-end evaluation is expensive End-to-end evaluation is expensive and time consuming.and time consuming.

Perplexity is the most common Perplexity is the most common evaluation metric and provide a fast, evaluation metric and provide a fast, efficient way to evaluate the efficient way to evaluate the performance of a language model. performance of a language model.


Perplexity - 1Perplexity - 1

N

N

NN

wwwP

wwwPWPP

21

1

21

1

probability of the test set

Normalized by number of

words.


Perplexity - 2Perplexity - 2

N

N

i ii wwPWPP

1 1|

1

Bi-Gram example

Probability of Wi-1 follow by Wi

Normalized by total number of

words


Project ObjectiveProject Objective

Inverse application: Use perplexity to Inverse application: Use perplexity to identify the content of the article or identify the content of the article or paper.paper.

The lower the perplexity the closer The lower the perplexity the closer the content between the training the content between the training corpus and the test corpus. corpus and the test corpus.

The corpus from the same filed will The corpus from the same filed will show relatively low perplexity show relatively low perplexity compare to other corpuses.compare to other corpuses.


Project ObjectiveProject Objective

Specialized corpus from different field nSpecialized corpus from different field need to be build. eed to be build.

In this project, 3 specialized corpora are In this project, 3 specialized corpora are built. They are Business, History and Cobuilt. They are Business, History and Computer Eng. Corpuses.mputer Eng. Corpuses.

In order to test it, 12 (4 from each 3 fieldIn order to test it, 12 (4 from each 3 fields) articles are chosen as test corpus.s) articles are chosen as test corpus.


Building Specialized Corpus using Building Specialized Corpus using Bootcat toolkit.Bootcat toolkit.

•Select seed•Generate n-Tuples•Retrieve urls•Fetch corresponding pages and build corpus

•Check corpus content and remove unwanted information

Steps:


Building Specialized Corpus using Building Specialized Corpus using Bootcat toolkit: select seedBootcat toolkit: select seed

The seeds or keywords of each corpus are The seeds or keywords of each corpus are the main factor which directly affects the the main factor which directly affects the specialty of the corpus. The more specific specialty of the corpus. The more specific of the seeds the more specialized the of the seeds the more specialized the corpus can be. corpus can be.

BusinessFinanceCreditLoanStockDowNasdaqCurrencyMutual FundsETFsBondsInvestingTaxes

Rea EstatePropertyWall StreetS&P500DJIAGas priceDAXTradeGreat DepressionCredit CardInvestmentMarket

Seeds of Business corpus:


Building Specialized Corpus using Building Specialized Corpus using Bootcat toolkit: tuples Bootcat toolkit: tuples

Tuples are generated randomly from seedsTuples are generated randomly from seeds No word repeating is allow in the same tuple.No word repeating is allow in the same tuple.

Dow Business "Great Depression" FinanceStock Business Property S&P500Property Dow Nasdaq TaxesMarket DJIA "Gas price" BondsETFs Bonds "Gas price" Taxes"Gas price" Credit Bonds ETFsDow ETFs "Gas price" "Wall Street"Loan Trade Property "Wall Street"Finance Credit DJIA ETFs"Rea Estate" Stock Property ETFsStock DJIA Bonds BusinessInvesting Nasdaq "Credit Card" LoanFinance "Wall Street" Investing "Rea Estate"Credit Market Investing "Credit Card"Property "Rea Estate" Credit Loan

Business tuples:


Building Specialized Corpus using Bootcat Building Specialized Corpus using Bootcat

toolkit: collect information from Yahoo!toolkit: collect information from Yahoo! Send tuples to Yahoo! and collect urls of Send tuples to Yahoo! and collect urls of

the search result pages.the search result pages. Remove repeated urlsRemove repeated urls Retrieve articles from each urls.Retrieve articles from each urls. Manually remove unwanted information.Manually remove unwanted information.


Building Specialized Corpus using Bootcat Building Specialized Corpus using Bootcat toolkit:toolkit:

CBusiness_Corpus_50k.txt: 50k words business corpus.CBusiness_Corpus_50k.txt: 50k words business corpus. CBusiness_Corpus_100k.txt: 100k words business corpus.CBusiness_Corpus_100k.txt: 100k words business corpus. CBusiness_Corpus_200k.txt: 200k words business corpus.CBusiness_Corpus_200k.txt: 200k words business corpus. CHistory_Corpus_50k.txt: 50k words history corpus.CHistory_Corpus_50k.txt: 50k words history corpus. CHistory_Corpus_100k.txt: 100k words history corpus.CHistory_Corpus_100k.txt: 100k words history corpus. CHistory_Corpus_200k.txt: 200k words history corpus.CHistory_Corpus_200k.txt: 200k words history corpus. CComputereng_Corpus_50k.txt: 50k words Computer CComputereng_Corpus_50k.txt: 50k words Computer

Engineering corpus.Engineering corpus. CComputereng_Corpus_100k.txt: 100k words Computer CComputereng_Corpus_100k.txt: 100k words Computer

Engineering corpus.Engineering corpus. CComputereng_Corpus_200k.txt: 200k words Computer CComputereng_Corpus_200k.txt: 200k words Computer

Engineering corpus.Engineering corpus.


Building language model using Building language model using CMU LM toolkitCMU LM toolkit



Build a list of every word which Build a list of every word which occurred in the training corpus, along occurred in the training corpus, along with its number of occurrences. with its number of occurrences.

Build a vocabulary file which content Build a vocabulary file which content the most frequent 20000 words. the most frequent 20000 words.



Generate N-gram.Generate N-gram. In this project, 5-gram is used.In this project, 5-gram is used. Build language model.Build language model.

Business Corpus Test Message

B1 B1% B2 B2% B3 B3% B4 B4%

perplexity 704.9 846.77 526.39 589.85

No. of hit 5-grames 4 0.27 15 2.77 9 0.97 2 0.28

NO. of hit 4-grames 37 2.48 11 2.03 20 2.16 17 2.4

NO. of hit 3-grames 93 12.96 67 12.36 143 15.44 96 13.56

NO. of hit 2-grames 648 43.52 209 38.56 402 43.41 309 43.6

NO. of hit 1-grames 607 40.77 240 44.28 352 38.01 284 40.11



Calculate the perplexity of test Calculate the perplexity of test articles to the training corpus.articles to the training corpus.

The batter model will assign a higher The batter model will assign a higher probability to the test data which probability to the test data which lower the perplexity. lower the perplexity.

Average of the perplexity from 3 Average of the perplexity from 3 corpus from the same field but corpus from the same field but different size.different size.


Project ResultProject ResultTest Corpus B1 B2 B3 B4

Test article type: Business Business Business Business

Avg. PP. of Business Corpus 698.6 1104.25 698.6 1104.25

Avg. PP of History Corpus 959.9 1977.98 959.9 1977.98

Avg. PP of Computer Corpus 803.6 2393.19 803.6 2393.19

Identified type: Business Business Business Business

Test Corpus H1 H2 H3 H4

Test article type: History History History History




Identified type: History History Business Business

Test Corpus C1 C2 C3 C4

Test article type: Computereng Computereng Computereng Computereng




Identified type: Business Computereng Computereng Business


Project ResultProject Result

There are total of 12 test corpus and 8 of There are total of 12 test corpus and 8 of them are been correctly identified and 3 them are been correctly identified and 3 of them are wrong. Thus, the error rate iof them are wrong. Thus, the error rate is about 33%. Please refer to the /perplexs about 33%. Please refer to the /perplexity.xls for the detail experiment result.ity.xls for the detail experiment result.


Possible ways to improve the resultPossible ways to improve the result

Remove the most common words from Remove the most common words from the vocabulary. It is because, the word the vocabulary. It is because, the word such as “The”, “and”, and “it”, are not such as “The”, “and”, and “it”, are not related to the specialized field. related to the specialized field.

Adjusting the training corpus, usually, Adjusting the training corpus, usually, the best ratio between the training the best ratio between the training corpus and the test corpus is 1:10. We corpus and the test corpus is 1:10. We can use it as a target to dynamically can use it as a target to dynamically change the size of the training corpus. change the size of the training corpus.


ReferenceReference

Bootcat toolkit:Bootcat toolkit:Simple Utilities for Bootstrapping Corpora and Terms from the WSimple Utilities for Bootstrapping Corpora and Terms from the Web. By Marco Baroni and Silvia Bernardini http://sslmit.unibo.it/~beb. By Marco Baroni and Silvia Bernardini http://sslmit.unibo.it/~baroni/bootcat.htmlaroni/bootcat.html

• CMU language Toolkit:CMU language Toolkit:

Carnegie Mellon University, http://www.speech.cs.cmu.edu/Carnegie Mellon University, http://www.speech.cs.cmu.edu/


Questions?

Documents

Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska