A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

A SVM Applied Text Categorization of Academia-Industry Collaborative

Research and Development Documents on the Web

Kei Kurakawa1, Yuan Sun1, Nagayoshi Yamashita2, Yasumasa Baba3

1. National Institute of Informatics 2. GMO Research

(ex- Japan Society for the Promotion of Science) 3. The Institute of Statistical Mathematics

Analysis and Modeling of Complex Data in Behavioral and Social Sciences Joint meeting of Japanese and Italian Classification Societies Anacapri (Capri Island, Italy), 3-4 September 2012

U-I-G relations

•  To make a policy of science and technology research and development, university-industry-government (U-I-G) relations is an important aspect to investigate it (Leydesdorff and Meyer, 2003).

2

•  Web document is one of the research targets to clarify the state of the relationship.

•  In the clarification process, to get the exact resources of U-I-G relations is the first requirement.

U

G I

Objective

•  Objective is to extract automatically resources of U-I relations from the web.

•  We set a target into “press release articles” of organizations, and make a framework to automatically crawl them and decide which is of U-I relations.

3

U

G I

Automatic extraction framework for U-I relations documents on the web

4

1. Crawling Web Documents

Press release ar7cles published on university or company web site

2. Extrac7ng Text From the Documents

3. Learning to Classify the Document

4. Classifying the Document

Learned Model File

Crawled Documents

Extracted Texts

Support Vector Machine (1) (Vapnik, 1995)

•  Two class classifier

•  input vectors –  Input vector: –  Target values: where

•  For all input vectors, •  Maximize margin between

hyperplane and

5

margin

y = 1y = 0y = �1

Support Vector

y(x) = w

T�(x) + b

Fixed feature space transformation

Bias parameter

Nx1, . . . ,xN

t1, . . . , tN tn 2 {�1, 1}

y(x) = 1 y(x) = �1

tny(xn) > 0

Support Vector Machine (2)

•  Optimization problem

•  By means of Lagrangian method

6

argminw,b

1

2kwk2.

subject to the constraints tn(wT�(x) + b) � 1, n = 1, . . . , N

y(x) =NX

n=1

antnk(x,xn) + b.

where kernel function is defined by

,and is Lagrange multipliers

k(x,x0) = �(x)T�(x0)

an > 0

U-I relations documents on the web

•  Extracted texts from the web documents are very noisy for content analysis. –  Irrelevant text, e.g. menu label text, header or footer of

page, ads are still remained. •  In our observation,

–  irrelevant text tends to be solely term not in a sentence, –  in terms of detecting U-I relations, the exact evidence of

relevance are occurred in two or three sequential and formal sentences.

•  For example, ”the MIT researchers and scientists from MicroCHIPS Inc. reported that... ”,

•  target of Japanese ”東京大学とオムロン株式会社は、共同研究により、重なりや隠れに強く....”

•  It’s enough to filter text including punctuation marks which means fully formal sentence.

7

Feature selection

•  tf-idf (Term Frequency – Inverse Document Frequency)

•  tf-idf is defined by

•  Feature is defined by

•  The term can be a term in a document, type of POS (part-of-speech) of morpheme, or analytical output of external tools in our experiment.

8

xd = (xt1,d, xt2,d, · · · , xtM ,d)xt,d = tf-idf(t, d,D)⇥ bt,d

bt,d =

⇢1 if t 2 d0 if t /2 d

tf-idf(t, d,D) = tf(t, d)⇥ idf(t,D)

a term a document all document

Mapping a document into a feature vector

9

東北大学は、ＮＥＣとの共同研究によりCPU内で使用される電子回路（CAM：連想メモリプロセッサ）において、世界で初めて、既存回路と同等の高速動作と、処理中に電源を切ってもデータを回路上に保持できる不揮発動作、を両立する技術を開発、実証しました。

tf-idf( , d,D)

x = (tf-idf( , d,D),産官学 tf-idf( , d,D),協力

tf-idf( , d,D),tf-idf( , d,D),開始+動詞受託+動詞

tf-idf( , d,D),tf-idf( , d,D),研究+動詞実験+動詞

tf-idf( , d,D),tf-idf( , d,D),研究員研究+名詞,サ変接続

tf-idf( , d,D),tf-idf( , d,D),開始+名詞,サ変接続発見+動詞

tf-idf( , d,D),開発+名詞,サ変接続共同 )

A document

Feature selection

A feature vector

x = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1473467, 2.4748564)

Features (1) 1)  BoW

–  Bag of Words. Full output of Mecab (Japanese morphological analyzer). Each word tf-idf consists of feature vector .

2)  BoW(N) –  Only noun is chosen.

3)  BoW(N-3) –  The word is restricted to proper noun, general-noun, and Sahen-noun (verb formed

by adding ”する” ([suru], do) to the noun). 4)  K(14)

–  Fourteen keywords related to U-I relations. The keywords are ”研究” ([kennkyu], research), ”開発” ([kaihatsu], development), ”実験” ([jikken], experiment), ”成功” ([seikou], success), ”発見” ([hakken], discover), ”開始” ([kaisi], start), ”受賞” ([jushou], award), ”表彰” ([hyoushou], honor), ”共同” ([kyoudou], collaboration), ”協同” ([kyoudou], cooperation), ”協力” ([kyouryoku], join forces), ”産学” ([sangaku], UI relationship), ”産官学” ([sankangaku], UIG (University-Industry-Government) relations), and ”連携” ([renkei], coordination).

5)  K(18) –  K(14) + 4 keywords. ”受託” ([jutaku], entrusted with), ”委託” ([itaku], consignment),

”締結” ([teiketsu], conclusion), and ”研究員” ([kennkyuin], researcher).

10

xn

Features (2) 6)  K(18)+NM

–  Keywords and POS (Part of Speech) of the next morpheme in a sequential text are checked, in that grammatically connections of those keywords are restricted to verb, auxiliary verb, and Sahen-noun.

7)  Corp. –  Cooperation marks. –  ”株式会社”([kabushikigaisha], Incooperated), (株)( an unicode character as U

+3231), (株),or (株) . 8)  Univ.

–  University name is checked. –  ”大学”([daigaku], university), or ”大”([dai], a shorten representation of university)

9)  C.+U. –  Both cooperation mark and university name are being in a sentence.

10)  ORG –  The existing of organization by means of Cabocha’s Japanese named entity

extraction function

11

Feature selection and SVM kernel functions

12

Test ID

TF-IDF feature element

Kernel function

(1) BoW

(2) BoW(N)

(3) BoW(N-3)

(4) K(14)

(5) K(18)

(6) K(18)+NM

(7) Corp.

(8) Univ.

(9) C.+U.

(10) ORG

1-1 ✔ Linear 1-2 ✔ Linear 1-3 ✔ Linear 2-1 ✔ Linear 2-2 ✔ Polynomial 2-3 ✔ RBF

3-1 ✔ Linear 3-2 ✔ Polynomial 3-3 ✔ RBF

4-1 ✔ Linear 4-2 ✔ Polynomial 4-3 ✔ RBF

5-1 ✔ ✔ Linear 5-2 ✔ ✔ Polynomial 5-3 ✔ ✔ RBF

6-1 ✔ ✔ ✔ ✔ Linear 6-2 ✔ ✔ ✔ ✔ Polynomial 6-3 ✔ ✔ ✔ ✔ RBF

7-1 ✔ ✔ ✔ ✔ Linear 7-2 ✔ ✔ ✔ ✔ Polynomial 7-3 ✔ ✔ ✔ ✔ RBF

7-4 ✔ ✔ ✔ ✔ RBF ( γ tuned) 8-1 ✔ ✔ ✔ ✔ ✔ Linear 8-2 ✔ ✔ ✔ ✔ ✔ Polynomial 8-3 ✔ ✔ ✔ ✔ ✔ RBF

8-4 ✔ ✔ ✔ ✔ ✔ RBF ( γ tuned)

Data set for experiment

13

Organization Crawled Articles Articles for Experiment Positive Article

Negative Article

Positive Article

Negative Article

Tohoku Univ. 44 499 44 44

The Univ. of Tokyo 106 848 106 106

Kyoto Univ. 40 329 40 40

Tokyo Inst. of Tech. 37 343 37 37

Hitachi Corp. 103 450 103 103

Total 330 2469 330 330

Classification results (SVM light (Joachims))

14

Test ID Accuracy Precision Recall F-measure

1-1 61.21 64.04 42.12 47.28 1-2 60.61 63.75 40.00 45.54 1-3 61.52 67.44 40.00 46.72 2-1 67.58 72.02 61.52 63.70 2-2 58.03 69.76 23.33 34.45 2-3 66.51 62.53 86.37 71.89 3-1 68.18 72.02 63.33 64.78 3-2 57.88 69.00 23.03 34.08 3-3 66.67 62.22 88.18 72.43 4-1 70.61 74.66 63.64 67.40 4-2 - - - - 4-3 70.76 65.49 90.30 75.66 5-1 70.61 74.61 63.64 67.31 5-2 - - - - 5-3 70.76 65.49 90.30 75.66 6-1 - - - - 6-2 - - - - 6-3 70.15 64.64 93.64 76.09 7-1 78.79 85.01 71.52 76.99 7-2 - - - - 7-3 72.27 66.07 94.85 77.61 7-4 80.15 78.81 83.94 81.05 8-1 78.94 85.03 71.82 77.16 8-2 - - - - 8-3 71.82 65.73 94.85 77.35 8-4 79.85 78.51 83.94 80.86

Average points in 10 fold cross validation

- Not calculated because of precision zero or learning optimization fault

BoW

K(14)

K(18)

K(18)+NM

K(18)+NM, ORG

K(18)+NM, Corp, Univ., ORG

K(18)+NM, Corp, Univ., C+U

K(18)+NM, Corp, Univ., C+U, ORG

Findings and discussion (1) •  In the test ID 1- 1, 1-2, 1-3, feature elements

consists of BoW which count over 15800, 13000, and 12000 respectively. The f-measures are worse than the other features with the same linear kernel function. They seem to be out of learning.

•  The reason why they are failed in learning can be that training data size is too much smaller than enough to learn. If we have enough size of training data, it becomes larger than feature vector size. This means training data size surpass the number of basis function of SVM, so that learning could be done without over-fitting.

15

Findings and discussion (2)

•  In the test ID from 2-1 to 8-3, feature element size is about 14 to 33.

•  Accuracy and f-measure are gradually inclined while feature elements are additionally complex.

16

Findings and discussion (3) •  Test ID 7-* and 8-* is related to an occurrence of

university and company symbols. Especially in ID 7-3, recall and f-measure become highest. This means the occurrence of the two symbols in a sentence is sensitive to U-I relations.

•  Kernel function type strongly depends on scores. •  Parameters of kernel function and efficiency of

loss function affect balance between precision and recall rate. of Radial Basis Function is decided to get highest F-value under cross validation for this experiment.

17

�

Conclusion and future work

•  To extract automatically resources of U-I relations from the web, –  we set a target into “press release articles” of organizations, –  Classification technique, i.e. support vector machine (SVM) is adapted

to the decision. •  We have conducted an experiment for several combinations of

feature vector elements and kernel function types of SVM. •  The combinations reveal that

–  U-I relations keywords, –  university and company symbols in a sentence are effective elements for features.

•  Parameters of SVM is tuned to get higher f-measure, which also affect balance between precision and recall rate.

•  Finally, we get accuracy 80.15, f-measure 81.05 for classifying U-I relations documents on the web.

•  In future work, we build the classifier in a context clawer to automatically crawl press release Web sites of organizations and get more resources. 18

Technology

A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web