18
A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web Kei Kurakawa 1 , Yuan Sun 1 , Nagayoshi Yamashita 2 , Yasumasa Baba 3 1. National Institute of Informatics 2. GMO Research (ex- Japan Society for the Promotion of Science) 3. The Institute of Statistical Mathematics Analysis and Modeling of Complex Data in Behavioral and Social Sciences Joint meeting of Japanese and Italian Classification Societies Anacapri (Capri Island, Italy), 3-4 September 2012

A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Embed Size (px)

Citation preview

Page 1: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

A SVM Applied Text Categorization of Academia-Industry Collaborative

Research and Development Documents on the Web

Kei Kurakawa1, Yuan Sun1, Nagayoshi Yamashita2, Yasumasa Baba3

1. National Institute of Informatics 2. GMO Research

(ex- Japan Society for the Promotion of Science) 3. The Institute of Statistical Mathematics

Analysis and Modeling of Complex Data in Behavioral and Social Sciences Joint meeting of Japanese and Italian Classification Societies Anacapri (Capri Island, Italy), 3-4 September 2012

Page 2: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

U-I-G relations

•  To make a policy of science and technology research and development, university-industry-government (U-I-G) relations is an important aspect to investigate it (Leydesdorff and Meyer, 2003).

2

•  Web document is one of the research targets to clarify the state of the relationship.

•  In the clarification process, to get the exact resources of U-I-G relations is the first requirement.

U

G I

Page 3: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Objective

•  Objective is to extract automatically resources of U-I relations from the web.

•  We set a target into “press release articles” of organizations, and make a framework to automatically crawl them and decide which is of U-I relations.

3

U

G I

Page 4: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Automatic extraction framework for U-I relations documents on the web

4

1.  Crawling  Web  Documents

Press  release  ar7cles  published  on  university  or  company  web  site

2.  Extrac7ng  Text  From  the  Documents

3.  Learning  to  Classify  the  Document

4.  Classifying  the  Document

Learned  Model  File

Crawled  Documents

Extracted  Texts

Page 5: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Support Vector Machine (1) (Vapnik, 1995)

•  Two class classifier

•  input vectors –  Input vector: –  Target values: where

•  For all input vectors, •  Maximize margin between

hyperplane and

5

margin

y = 1y = 0y = �1

Support Vector

y(x) = w

T�(x) + b

Fixed feature space transformation

Bias parameter

Nx1, . . . ,xN

t1, . . . , tN tn 2 {�1, 1}

y(x) = 1 y(x) = �1

tny(xn) > 0

Page 6: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Support Vector Machine (2)

•  Optimization problem

•  By means of Lagrangian method

6

argminw,b

1

2kwk2.

subject to the constraints tn(wT�(x) + b) � 1, n = 1, . . . , N

y(x) =NX

n=1

antnk(x,xn) + b.

where kernel function is defined by

,and is Lagrange multipliers

k(x,x0) = �(x)T�(x0)

an > 0

Page 7: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

U-I relations documents on the web

•  Extracted texts from the web documents are very noisy for content analysis. –  Irrelevant text, e.g. menu label text, header or footer of

page, ads are still remained. •  In our observation,

–  irrelevant text tends to be solely term not in a sentence, –  in terms of detecting U-I relations, the exact evidence of

relevance are occurred in two or three sequential and formal sentences.

•  For example, ”the MIT researchers and scientists from MicroCHIPS Inc. reported that... ”,

•  target of Japanese ”東京大学とオムロン株式会社は、共同研究により、重なりや隠れに強く....”

•  It’s enough to filter text including punctuation marks which means fully formal sentence.

7

Page 8: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Feature selection

•  tf-idf (Term Frequency – Inverse Document Frequency)

•  tf-idf is defined by

•  Feature is defined by

•  The term can be a term in a document, type of POS (part-of-speech) of morpheme, or analytical output of external tools in our experiment.

8

xd = (xt1,d, xt2,d, · · · , xtM ,d)xt,d = tf-idf(t, d,D)⇥ bt,d

bt,d =

⇢1 if t 2 d0 if t /2 d

tf-idf(t, d,D) = tf(t, d)⇥ idf(t,D)

a term a document all document

Page 9: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Mapping a document into a feature vector

9

東北大学は、NECとの共同研究によりCPU内で使用される電子回路(CAM:連想メモリプロセッサ)において、世界で初めて、既存回路と同等の高速動作と、処理中に電源を切ってもデータを回路上に保持できる不揮発動作、を両立する技術を開発、実証しました。

tf-idf( , d,D)

x = (tf-idf( , d,D),産官学 tf-idf( , d,D),協力

tf-idf( , d,D),tf-idf( , d,D),開始+動詞 受託+動詞

tf-idf( , d,D),tf-idf( , d,D),研究+動詞 実験+動詞

tf-idf( , d,D),tf-idf( , d,D),研究員 研究+名詞,サ変接続

tf-idf( , d,D),tf-idf( , d,D),開始+名詞,サ変接続 発見+動詞

tf-idf( , d,D),開発+名詞,サ変接続 共同 )

A document

Feature selection

A feature vector

x = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1473467, 2.4748564)

Page 10: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Features (1) 1)  BoW

–  Bag of Words. Full output of Mecab (Japanese morphological analyzer). Each word tf-idf consists of feature vector .

2)  BoW(N) –  Only noun is chosen.

3)  BoW(N-3) –  The word is restricted to proper noun, general-noun, and Sahen-noun (verb formed

by adding ”する” ([suru], do) to the noun). 4)  K(14)

–  Fourteen keywords related to U-I relations. The keywords are ”研究” ([kennkyu], research), ”開発” ([kaihatsu], development), ”実験” ([jikken], experiment), ”成功” ([seikou], success), ”発見” ([hakken], discover), ”開始” ([kaisi], start), ”受賞” ([jushou], award), ”表彰” ([hyoushou], honor), ”共同” ([kyoudou], collaboration), ”協同” ([kyoudou], cooperation), ”協力” ([kyouryoku], join forces), ”産学” ([sangaku], UI relationship), ”産官学” ([sankangaku], UIG (University-Industry-Government) relations), and ”連携” ([renkei], coordination).

5)  K(18) –  K(14) + 4 keywords. ”受託” ([jutaku], entrusted with), ”委託” ([itaku], consignment),

”締結” ([teiketsu], conclusion), and ”研究員” ([kennkyuin], researcher).

10

xn

Page 11: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Features (2) 6)  K(18)+NM

–  Keywords and POS (Part of Speech) of the next morpheme in a sequential text are checked, in that grammatically connections of those keywords are restricted to verb, auxiliary verb, and Sahen-noun.

7)  Corp. –  Cooperation marks. –  ”株式会社”([kabushikigaisha], Incooperated), (株)( an unicode character as U

+3231), (株),or (株) . 8)  Univ.

–  University name is checked. –  ”大学”([daigaku], university), or ”大”([dai], a shorten representation of university)

9)  C.+U. –  Both cooperation mark and university name are being in a sentence.

10)  ORG –  The existing of organization by means of Cabocha’s Japanese named entity

extraction function

11

Page 12: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Feature selection and SVM kernel functions

12

Test ID

TF-IDF feature element

Kernel function

(1) BoW

(2) BoW(N)

(3) BoW(N-3)

(4) K(14)

(5) K(18)

(6) K(18)+NM

(7) Corp.

(8) Univ.

(9) C.+U.

(10) ORG

1-1 ✔ Linear 1-2 ✔ Linear 1-3 ✔ Linear 2-1 ✔ Linear 2-2 ✔ Polynomial 2-3 ✔ RBF

3-1 ✔ Linear 3-2 ✔ Polynomial 3-3 ✔ RBF

4-1 ✔ Linear 4-2 ✔ Polynomial 4-3 ✔ RBF

5-1 ✔ ✔ Linear 5-2 ✔ ✔ Polynomial 5-3 ✔ ✔ RBF

6-1 ✔ ✔ ✔ ✔ Linear 6-2 ✔ ✔ ✔ ✔ Polynomial 6-3 ✔ ✔ ✔ ✔ RBF

7-1 ✔ ✔ ✔ ✔ Linear 7-2 ✔ ✔ ✔ ✔ Polynomial 7-3 ✔ ✔ ✔ ✔ RBF

7-4 ✔ ✔ ✔ ✔ RBF ( γ tuned) 8-1 ✔ ✔ ✔ ✔ ✔ Linear 8-2 ✔ ✔ ✔ ✔ ✔ Polynomial 8-3 ✔ ✔ ✔ ✔ ✔ RBF

8-4 ✔ ✔ ✔ ✔ ✔ RBF ( γ tuned)

Page 13: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Data set for experiment

13

Organization Crawled Articles Articles for Experiment Positive Article

Negative Article

Positive Article

Negative Article

Tohoku Univ. 44 499 44 44

The Univ. of Tokyo 106 848 106 106

Kyoto Univ. 40 329 40 40

Tokyo Inst. of Tech. 37 343 37 37

Hitachi Corp. 103 450 103 103

Total 330 2469 330 330

Page 14: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Classification results (SVM light (Joachims))

14

Test ID Accuracy Precision Recall F-measure

1-1 61.21 64.04 42.12 47.28 1-2 60.61 63.75 40.00 45.54 1-3 61.52 67.44 40.00 46.72 2-1 67.58 72.02 61.52 63.70 2-2 58.03 69.76 23.33 34.45 2-3 66.51 62.53 86.37 71.89 3-1 68.18 72.02 63.33 64.78 3-2 57.88 69.00 23.03 34.08 3-3 66.67 62.22 88.18 72.43 4-1 70.61 74.66 63.64 67.40 4-2 - - - - 4-3 70.76 65.49 90.30 75.66 5-1 70.61 74.61 63.64 67.31 5-2 - - - - 5-3 70.76 65.49 90.30 75.66 6-1 - - - - 6-2 - - - - 6-3 70.15 64.64 93.64 76.09 7-1 78.79 85.01 71.52 76.99 7-2 - - - - 7-3 72.27 66.07 94.85 77.61 7-4 80.15 78.81 83.94 81.05 8-1 78.94 85.03 71.82 77.16 8-2 - - - - 8-3 71.82 65.73 94.85 77.35 8-4 79.85 78.51 83.94 80.86

Average points in 10 fold cross validation

- Not calculated because of precision zero or learning optimization fault

BoW

K(14)

K(18)

K(18)+NM

K(18)+NM, ORG

K(18)+NM, Corp, Univ., ORG

K(18)+NM, Corp, Univ., C+U

K(18)+NM, Corp, Univ., C+U, ORG

Page 15: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Findings and discussion (1) •  In the test ID 1- 1, 1-2, 1-3, feature elements

consists of BoW which count over 15800, 13000, and 12000 respectively. The f-measures are worse than the other features with the same linear kernel function. They seem to be out of learning.

•  The reason why they are failed in learning can be that training data size is too much smaller than enough to learn. If we have enough size of training data, it becomes larger than feature vector size. This means training data size surpass the number of basis function of SVM, so that learning could be done without over-fitting.

15

Page 16: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Findings and discussion (2)

•  In the test ID from 2-1 to 8-3, feature element size is about 14 to 33.

•  Accuracy and f-measure are gradually inclined while feature elements are additionally complex.

16

Page 17: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Findings and discussion (3) •  Test ID 7-* and 8-* is related to an occurrence of

university and company symbols. Especially in ID 7-3, recall and f-measure become highest. This means the occurrence of the two symbols in a sentence is sensitive to U-I relations.

•  Kernel function type strongly depends on scores. •  Parameters of kernel function and efficiency of

loss function affect balance between precision and recall rate. of Radial Basis Function is decided to get highest F-value under cross validation for this experiment.

17

Page 18: A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Conclusion and future work

•  To extract automatically resources of U-I relations from the web, –  we set a target into “press release articles” of organizations, –  Classification technique, i.e. support vector machine (SVM) is adapted

to the decision. •  We have conducted an experiment for several combinations of

feature vector elements and kernel function types of SVM. •  The combinations reveal that

–  U-I relations keywords, –  university and company symbols in a sentence are effective elements for features.

•  Parameters of SVM is tuned to get higher f-measure, which also affect balance between precision and recall rate.

•  Finally, we get accuracy 80.15, f-measure 81.05 for classifying U-I relations documents on the web.

•  In future work, we build the classifier in a context clawer to automatically crawl press release Web sites of organizations and get more resources. 18