14
Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Embed Size (px)

Citation preview

Page 1: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Content Classification Analysis based on LDA Topic Model

PROJECT LEADER: HONGBO ZHAO

Page 2: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Content Classification Analysis based on LDA Topic Model

Web crawler achieving web

news chinese parsing

& extracting

Advanced TF-IDF contents

processing adding content-

based tests finding best

parameters in small data

Testing parameters testing in big

data comparing to

content-based algorithm

Page 3: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Web crawler achieving nearly 17,000 web

news through Sougou Database

including html characters, insignificantly

achieving web news

chinese parsing & extracting

Page 4: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Web crawler using ICTCLAS to parse and

extract chinese words, excluding stop words, conjunctions, prepositions and numerals

achieving web news

chinese parsing & extracting

Page 5: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Advanced TF-IDF Extracting news into TITLE, BEGIN, CONTENT and END section with different weights

Using TF-IDF to calculate top 5 keywords, the accuracy is 81% comparing to the sorted database

content processing

adding content-based tests

finding best parameters in small data

TITLEBEGI

NCONTENTEND

Page 6: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Advanced TF-IDF Adding content-based algorithm(the accuracy through 81% to 82% when the semantic weight through 1.0 to 0.0), there is no significant changes. We concludes that the semantics is useless in this circumstance.

contents processing

adding content-based tests

finding best parameters in small data

Page 7: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Advanced TF-IDF Testing perfect parameters in small data(less than 2000 news), including accurancy, time efficiency factors

testing sets = 30% of whole data

training sets = 70% of whole data

contents processing

adding content-based tests

finding best parameters in small data

Page 8: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Advanced TF-IDF the keywords in training sets

equals to testing sets

contents processing

adding content-based tests

finding best parameters in small data

keywords number

error ecore accuracy

5 2256 0.4983277591973244

10 1579 0.5735785953177257

15 1335 0.6304347826086957

20 1276 0.6789297658862876

ALL 1720 0.7190635451505016

Unstable

Page 9: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Advanced TF-IDF Using all keywords in training

sets

contents processing

adding content-based tests

finding best parameters in small data

keywords number

error score accuracy

5 1877 0.6471571906354515

10 1423 0.7073578595317725

15 1457 0.7006688963210702

20 1474 0.7056856187290970 Extremly low speed

Page 10: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Advanced TF-IDF Using all keywords in testing

sets

contents processing

adding content-based tests

finding best parameters in small data

keywords number

error score accuracy

5 1378 0.6321070234113713

10 1413 0.7040133779264214

15 1333 0.7107023411371237

20 1468 0.7257525083612040

When using 10 keywords in training sets, the accuracy, error score and time efficency is perfect

Page 11: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Testing parameters

Testing to big data, when the training set in every section increases gradually to 200, 450, 750 and finally 1343(all words), the accuracy is shown in the figure. The final accuracy reaches 82.5% or 85.1% excluding the culture section. The results shows the perfect parameters we selected.

testing in big data

comparing to content-based algorithm

Page 12: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Testing parameters

to content-based algorithm, the accuracy is greater, however, the time efficiency is lower

testing in big data

comparing to content-based algorithm

Page 13: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Summary

partial encoding & decoding problems errors in keywords parsing leads to classification

faults partial repeated passages leads to errors in accuracy successful algorithm in general

Page 14: Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

”Thanks

Content Classification Analysis based on LDA Topic Model