Upload
kei-kurakawa
View
363
Download
1
Embed Size (px)
Citation preview
A SVM Applied Text Categorization of Academia-Industry Collaborative
Research and Development Documents on the Web
Kei Kurakawa1, Yuan Sun1, Nagayoshi Yamashita2, Yasumasa Baba3
1. National Institute of Informatics 2. GMO Research
(ex- Japan Society for the Promotion of Science) 3. The Institute of Statistical Mathematics
Analysis and Modeling of Complex Data in Behavioral and Social Sciences Joint meeting of Japanese and Italian Classification Societies Anacapri (Capri Island, Italy), 3-4 September 2012
U-I-G relations
• To make a policy of science and technology research and development, university-industry-government (U-I-G) relations is an important aspect to investigate it (Leydesdorff and Meyer, 2003).
2
• Web document is one of the research targets to clarify the state of the relationship.
• In the clarification process, to get the exact resources of U-I-G relations is the first requirement.
U
G I
Objective
• Objective is to extract automatically resources of U-I relations from the web.
• We set a target into “press release articles” of organizations, and make a framework to automatically crawl them and decide which is of U-I relations.
3
U
G I
Automatic extraction framework for U-I relations documents on the web
4
1. Crawling Web Documents
Press release ar7cles published on university or company web site
2. Extrac7ng Text From the Documents
3. Learning to Classify the Document
4. Classifying the Document
Learned Model File
Crawled Documents
Extracted Texts
Support Vector Machine (1) (Vapnik, 1995)
• Two class classifier
• input vectors – Input vector: – Target values: where
• For all input vectors, • Maximize margin between
hyperplane and
5
margin
y = 1y = 0y = �1
Support Vector
y(x) = w
T�(x) + b
Fixed feature space transformation
Bias parameter
Nx1, . . . ,xN
t1, . . . , tN tn 2 {�1, 1}
y(x) = 1 y(x) = �1
tny(xn) > 0
Support Vector Machine (2)
• Optimization problem
• By means of Lagrangian method
6
argminw,b
1
2kwk2.
subject to the constraints tn(wT�(x) + b) � 1, n = 1, . . . , N
y(x) =NX
n=1
antnk(x,xn) + b.
where kernel function is defined by
,and is Lagrange multipliers
k(x,x0) = �(x)T�(x0)
an > 0
U-I relations documents on the web
• Extracted texts from the web documents are very noisy for content analysis. – Irrelevant text, e.g. menu label text, header or footer of
page, ads are still remained. • In our observation,
– irrelevant text tends to be solely term not in a sentence, – in terms of detecting U-I relations, the exact evidence of
relevance are occurred in two or three sequential and formal sentences.
• For example, ”the MIT researchers and scientists from MicroCHIPS Inc. reported that... ”,
• target of Japanese ”東京大学とオムロン株式会社は、共同研究により、重なりや隠れに強く....”
• It’s enough to filter text including punctuation marks which means fully formal sentence.
7
Feature selection
• tf-idf (Term Frequency – Inverse Document Frequency)
• tf-idf is defined by
• Feature is defined by
• The term can be a term in a document, type of POS (part-of-speech) of morpheme, or analytical output of external tools in our experiment.
8
xd = (xt1,d, xt2,d, · · · , xtM ,d)xt,d = tf-idf(t, d,D)⇥ bt,d
bt,d =
⇢1 if t 2 d0 if t /2 d
tf-idf(t, d,D) = tf(t, d)⇥ idf(t,D)
a term a document all document
Mapping a document into a feature vector
9
東北大学は、NECとの共同研究によりCPU内で使用される電子回路(CAM:連想メモリプロセッサ)において、世界で初めて、既存回路と同等の高速動作と、処理中に電源を切ってもデータを回路上に保持できる不揮発動作、を両立する技術を開発、実証しました。
tf-idf( , d,D)
x = (tf-idf( , d,D),産官学 tf-idf( , d,D),協力
tf-idf( , d,D),tf-idf( , d,D),開始+動詞 受託+動詞
tf-idf( , d,D),tf-idf( , d,D),研究+動詞 実験+動詞
tf-idf( , d,D),tf-idf( , d,D),研究員 研究+名詞,サ変接続
tf-idf( , d,D),tf-idf( , d,D),開始+名詞,サ変接続 発見+動詞
tf-idf( , d,D),開発+名詞,サ変接続 共同 )
A document
Feature selection
A feature vector
x = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1473467, 2.4748564)
Features (1) 1) BoW
– Bag of Words. Full output of Mecab (Japanese morphological analyzer). Each word tf-idf consists of feature vector .
2) BoW(N) – Only noun is chosen.
3) BoW(N-3) – The word is restricted to proper noun, general-noun, and Sahen-noun (verb formed
by adding ”する” ([suru], do) to the noun). 4) K(14)
– Fourteen keywords related to U-I relations. The keywords are ”研究” ([kennkyu], research), ”開発” ([kaihatsu], development), ”実験” ([jikken], experiment), ”成功” ([seikou], success), ”発見” ([hakken], discover), ”開始” ([kaisi], start), ”受賞” ([jushou], award), ”表彰” ([hyoushou], honor), ”共同” ([kyoudou], collaboration), ”協同” ([kyoudou], cooperation), ”協力” ([kyouryoku], join forces), ”産学” ([sangaku], UI relationship), ”産官学” ([sankangaku], UIG (University-Industry-Government) relations), and ”連携” ([renkei], coordination).
5) K(18) – K(14) + 4 keywords. ”受託” ([jutaku], entrusted with), ”委託” ([itaku], consignment),
”締結” ([teiketsu], conclusion), and ”研究員” ([kennkyuin], researcher).
10
xn
Features (2) 6) K(18)+NM
– Keywords and POS (Part of Speech) of the next morpheme in a sequential text are checked, in that grammatically connections of those keywords are restricted to verb, auxiliary verb, and Sahen-noun.
7) Corp. – Cooperation marks. – ”株式会社”([kabushikigaisha], Incooperated), (株)( an unicode character as U
+3231), (株),or (株) . 8) Univ.
– University name is checked. – ”大学”([daigaku], university), or ”大”([dai], a shorten representation of university)
9) C.+U. – Both cooperation mark and university name are being in a sentence.
10) ORG – The existing of organization by means of Cabocha’s Japanese named entity
extraction function
11
Feature selection and SVM kernel functions
12
Test ID
TF-IDF feature element
Kernel function
(1) BoW
(2) BoW(N)
(3) BoW(N-3)
(4) K(14)
(5) K(18)
(6) K(18)+NM
(7) Corp.
(8) Univ.
(9) C.+U.
(10) ORG
1-1 ✔ Linear 1-2 ✔ Linear 1-3 ✔ Linear 2-1 ✔ Linear 2-2 ✔ Polynomial 2-3 ✔ RBF
3-1 ✔ Linear 3-2 ✔ Polynomial 3-3 ✔ RBF
4-1 ✔ Linear 4-2 ✔ Polynomial 4-3 ✔ RBF
5-1 ✔ ✔ Linear 5-2 ✔ ✔ Polynomial 5-3 ✔ ✔ RBF
6-1 ✔ ✔ ✔ ✔ Linear 6-2 ✔ ✔ ✔ ✔ Polynomial 6-3 ✔ ✔ ✔ ✔ RBF
7-1 ✔ ✔ ✔ ✔ Linear 7-2 ✔ ✔ ✔ ✔ Polynomial 7-3 ✔ ✔ ✔ ✔ RBF
7-4 ✔ ✔ ✔ ✔ RBF ( γ tuned) 8-1 ✔ ✔ ✔ ✔ ✔ Linear 8-2 ✔ ✔ ✔ ✔ ✔ Polynomial 8-3 ✔ ✔ ✔ ✔ ✔ RBF
8-4 ✔ ✔ ✔ ✔ ✔ RBF ( γ tuned)
Data set for experiment
13
Organization Crawled Articles Articles for Experiment Positive Article
Negative Article
Positive Article
Negative Article
Tohoku Univ. 44 499 44 44
The Univ. of Tokyo 106 848 106 106
Kyoto Univ. 40 329 40 40
Tokyo Inst. of Tech. 37 343 37 37
Hitachi Corp. 103 450 103 103
Total 330 2469 330 330
Classification results (SVM light (Joachims))
14
Test ID Accuracy Precision Recall F-measure
1-1 61.21 64.04 42.12 47.28 1-2 60.61 63.75 40.00 45.54 1-3 61.52 67.44 40.00 46.72 2-1 67.58 72.02 61.52 63.70 2-2 58.03 69.76 23.33 34.45 2-3 66.51 62.53 86.37 71.89 3-1 68.18 72.02 63.33 64.78 3-2 57.88 69.00 23.03 34.08 3-3 66.67 62.22 88.18 72.43 4-1 70.61 74.66 63.64 67.40 4-2 - - - - 4-3 70.76 65.49 90.30 75.66 5-1 70.61 74.61 63.64 67.31 5-2 - - - - 5-3 70.76 65.49 90.30 75.66 6-1 - - - - 6-2 - - - - 6-3 70.15 64.64 93.64 76.09 7-1 78.79 85.01 71.52 76.99 7-2 - - - - 7-3 72.27 66.07 94.85 77.61 7-4 80.15 78.81 83.94 81.05 8-1 78.94 85.03 71.82 77.16 8-2 - - - - 8-3 71.82 65.73 94.85 77.35 8-4 79.85 78.51 83.94 80.86
Average points in 10 fold cross validation
- Not calculated because of precision zero or learning optimization fault
BoW
K(14)
K(18)
K(18)+NM
K(18)+NM, ORG
K(18)+NM, Corp, Univ., ORG
K(18)+NM, Corp, Univ., C+U
K(18)+NM, Corp, Univ., C+U, ORG
Findings and discussion (1) • In the test ID 1- 1, 1-2, 1-3, feature elements
consists of BoW which count over 15800, 13000, and 12000 respectively. The f-measures are worse than the other features with the same linear kernel function. They seem to be out of learning.
• The reason why they are failed in learning can be that training data size is too much smaller than enough to learn. If we have enough size of training data, it becomes larger than feature vector size. This means training data size surpass the number of basis function of SVM, so that learning could be done without over-fitting.
15
Findings and discussion (2)
• In the test ID from 2-1 to 8-3, feature element size is about 14 to 33.
• Accuracy and f-measure are gradually inclined while feature elements are additionally complex.
16
Findings and discussion (3) • Test ID 7-* and 8-* is related to an occurrence of
university and company symbols. Especially in ID 7-3, recall and f-measure become highest. This means the occurrence of the two symbols in a sentence is sensitive to U-I relations.
• Kernel function type strongly depends on scores. • Parameters of kernel function and efficiency of
loss function affect balance between precision and recall rate. of Radial Basis Function is decided to get highest F-value under cross validation for this experiment.
17
�
Conclusion and future work
• To extract automatically resources of U-I relations from the web, – we set a target into “press release articles” of organizations, – Classification technique, i.e. support vector machine (SVM) is adapted
to the decision. • We have conducted an experiment for several combinations of
feature vector elements and kernel function types of SVM. • The combinations reveal that
– U-I relations keywords, – university and company symbols in a sentence are effective elements for features.
• Parameters of SVM is tuned to get higher f-measure, which also affect balance between precision and recall rate.
• Finally, we get accuracy 80.15, f-measure 81.05 for classifying U-I relations documents on the web.
• In future work, we build the classifier in a context clawer to automatically crawl press release Web sites of organizations and get more resources. 18