Upload
cody-leaverton
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Content Classification Analysis based on LDA Topic Model
PROJECT LEADER: HONGBO ZHAO
Content Classification Analysis based on LDA Topic Model
Web crawler achieving web
news chinese parsing
& extracting
Advanced TF-IDF contents
processing adding content-
based tests finding best
parameters in small data
Testing parameters testing in big
data comparing to
content-based algorithm
Web crawler achieving nearly 17,000 web
news through Sougou Database
including html characters, insignificantly
achieving web news
chinese parsing & extracting
Web crawler using ICTCLAS to parse and
extract chinese words, excluding stop words, conjunctions, prepositions and numerals
achieving web news
chinese parsing & extracting
Advanced TF-IDF Extracting news into TITLE, BEGIN, CONTENT and END section with different weights
Using TF-IDF to calculate top 5 keywords, the accuracy is 81% comparing to the sorted database
content processing
adding content-based tests
finding best parameters in small data
TITLEBEGI
NCONTENTEND
Advanced TF-IDF Adding content-based algorithm(the accuracy through 81% to 82% when the semantic weight through 1.0 to 0.0), there is no significant changes. We concludes that the semantics is useless in this circumstance.
contents processing
adding content-based tests
finding best parameters in small data
Advanced TF-IDF Testing perfect parameters in small data(less than 2000 news), including accurancy, time efficiency factors
testing sets = 30% of whole data
training sets = 70% of whole data
contents processing
adding content-based tests
finding best parameters in small data
Advanced TF-IDF the keywords in training sets
equals to testing sets
contents processing
adding content-based tests
finding best parameters in small data
keywords number
error ecore accuracy
5 2256 0.4983277591973244
10 1579 0.5735785953177257
15 1335 0.6304347826086957
20 1276 0.6789297658862876
ALL 1720 0.7190635451505016
Unstable
Advanced TF-IDF Using all keywords in training
sets
contents processing
adding content-based tests
finding best parameters in small data
keywords number
error score accuracy
5 1877 0.6471571906354515
10 1423 0.7073578595317725
15 1457 0.7006688963210702
20 1474 0.7056856187290970 Extremly low speed
Advanced TF-IDF Using all keywords in testing
sets
contents processing
adding content-based tests
finding best parameters in small data
keywords number
error score accuracy
5 1378 0.6321070234113713
10 1413 0.7040133779264214
15 1333 0.7107023411371237
20 1468 0.7257525083612040
When using 10 keywords in training sets, the accuracy, error score and time efficency is perfect
Testing parameters
Testing to big data, when the training set in every section increases gradually to 200, 450, 750 and finally 1343(all words), the accuracy is shown in the figure. The final accuracy reaches 82.5% or 85.1% excluding the culture section. The results shows the perfect parameters we selected.
testing in big data
comparing to content-based algorithm
Testing parameters
to content-based algorithm, the accuracy is greater, however, the time efficiency is lower
testing in big data
comparing to content-based algorithm
Summary
partial encoding & decoding problems errors in keywords parsing leads to classification
faults partial repeated passages leads to errors in accuracy successful algorithm in general
“
”Thanks
Content Classification Analysis based on LDA Topic Model