Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Content Classification Analysis based on LDA Topic Model

PROJECT LEADER: HONGBO ZHAO


Web crawler achieving web

news chinese parsing

& extracting

Advanced TF-IDF contents

processing adding content-

based tests finding best

parameters in small data

Testing parameters testing in big

data comparing to

content-based algorithm

Web crawler achieving nearly 17,000 web

news through Sougou Database

including html characters, insignificantly

achieving web news

chinese parsing & extracting

Web crawler using ICTCLAS to parse and

extract chinese words, excluding stop words, conjunctions, prepositions and numerals

achieving web news

chinese parsing & extracting

Advanced TF-IDF Extracting news into TITLE, BEGIN, CONTENT and END section with different weights

Using TF-IDF to calculate top 5 keywords, the accuracy is 81% comparing to the sorted database

content processing

adding content-based tests

finding best parameters in small data

TITLEBEGI

NCONTENTEND

Advanced TF-IDF Adding content-based algorithm(the accuracy through 81% to 82% when the semantic weight through 1.0 to 0.0), there is no significant changes. We concludes that the semantics is useless in this circumstance.

contents processing



Advanced TF-IDF Testing perfect parameters in small data(less than 2000 news), including accurancy, time efficiency factors

testing sets = 30% of whole data

training sets = 70% of whole data

contents processing



Advanced TF-IDF the keywords in training sets

equals to testing sets

contents processing



keywords number

error ecore accuracy

5 2256 0.4983277591973244

10 1579 0.5735785953177257

15 1335 0.6304347826086957

20 1276 0.6789297658862876

ALL 1720 0.7190635451505016

Unstable

Advanced TF-IDF Using all keywords in training

sets

contents processing



keywords number

error score accuracy

5 1877 0.6471571906354515

10 1423 0.7073578595317725

15 1457 0.7006688963210702

20 1474 0.7056856187290970 Extremly low speed

Advanced TF-IDF Using all keywords in testing

sets

contents processing



keywords number

error score accuracy

5 1378 0.6321070234113713

10 1413 0.7040133779264214

15 1333 0.7107023411371237

20 1468 0.7257525083612040

When using 10 keywords in training sets, the accuracy, error score and time efficency is perfect

Testing parameters

Testing to big data, when the training set in every section increases gradually to 200, 450, 750 and finally 1343(all words), the accuracy is shown in the figure. The final accuracy reaches 82.5% or 85.1% excluding the culture section. The results shows the perfect parameters we selected.

testing in big data

comparing to content-based algorithm

Testing parameters

to content-based algorithm, the accuracy is greater, however, the time efficiency is lower

testing in big data

comparing to content-based algorithm

Summary

partial encoding & decoding problems errors in keywords parsing leads to classification

faults partial repeated passages leads to errors in accuracy successful algorithm in general

“

”Thanks


Documents

Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO