45
Unsupervised Overlapping Feature Selection for Conditional Random Fields Learning in Chinese Word Segmentation for Rocling 2011 Ting-hao Yang, Tian-jian Jiang, Chan-hung K , Richard Tzong-han Tsai, Wen-lian Hsu Institute of Information Science, Academia Sinica Department of Computer Science & Engineering, Yuan Ze Universit

for Rocling 2011

  • Upload
    brice

  • View
    36

  • Download
    3

Embed Size (px)

DESCRIPTION

Unsupervised Overlapping Feature Selection for C onditional R andom F ields Learning in Chinese Word Segmentation. for Rocling 2011. Ting- hao Yang, Tian-jian Jiang , Chan-hung Kuo , Richard Tzong-han Tsai, Wen-lian Hsu Institute of Information Science, Academia Sinica - PowerPoint PPT Presentation

Citation preview

Page 1: for  Rocling 2011

  Unsupervised Overlapping Feature Selection for

Conditional Random Fields Learning in Chinese Word Segmentation

for Rocling 2011

Ting-hao Yang, Tian-jian Jiang, Chan-hung Kuo, Richard Tzong-han Tsai, Wen-lian Hsu

Institute of Information Science, Academia SinicaDepartment of Computer Science & Engineering, Yuan Ze University

Page 2: for  Rocling 2011

45

Term Contributed Boundary Feature using Conditional Random Fields in 2010

A unified view of several unsupervised feature selection based on frequent strings

2

Introduction

Page 3: for  Rocling 2011

45

Unlabeled corpus

Automatic extracted pattern

Labeled training data

Chinese word segmentatio

n model3

Flow chart

Page 4: for  Rocling 2011

45

Unlabeled corpus

Automatic extracted pattern

Labeled training data

Chinese word segmentatio

n model4

Flow chart

Page 5: for  Rocling 2011

45

SRILM

YASA

5

Toolkit

Page 6: for  Rocling 2011

45

C++ libraries

The toolkit supports N-gram statistics for language model

6

SRILM

Page 7: for  Rocling 2011

45

Automatically extract frequent strings from unlabeled corpus

7

YASA

Pattern: 自然科學Frequency Net Frequency

( 自然科學 ,4)( 自然科 ,4)( 自然 ,10)

( 自然科學 ,4)( 自然科 ,0)( 自然 ,6)

Page 8: for  Rocling 2011

45

Unlabeled corpus

Automatic extracted pattern

Labeled training data

Chinese word segmentatio

n model8

Flow chart

Page 9: for  Rocling 2011

45

Character Label

反 B1而 E

會 S

欲 B1速 B2則 B3不 M

達 E9

6-Tag

Page 10: for  Rocling 2011

45

[0 -9 ] + [B1|B2|B3|M|E|S]

10

Extended Label

Page 11: for  Rocling 2011

45

N-Gram score

Frequent string score

Accessor variety score

11

Score

Page 12: for  Rocling 2011

45

Convert from term frequency and N-Gram frequency

Logarithm ranking mechanism

12

Score

Page 13: for  Rocling 2011

45

Pattern Frequency Logarithm ranking mechanismScore

塑膠原料的 10 log2(10) =3塑膠原料 5 log2(5) = 2原料的 3 log2(3)=1的生產 2 Log2(2)=1塑膠 4 log2(4)=2

13

Score

Page 14: for  Rocling 2011

45

Consider the score of outer pattern

Equation of AV

14

Score

rpredecesso distinct the of number : L )(Sav

rpredecesso distinct the of number : R )(Sav

Page 15: for  Rocling 2011

4515

Score

AV( 開發與法制 )

AV( 的開發與法制 ),AV( 是開發與法制 ),AV( 有開發與法制 ),AV( 開發與法制的 ),AV( 開發與法制是 ),AV( 開發與法制為 )

Page 16: for  Rocling 2011

45

Pattern Logarithm ranking mechanismScore

6-Tag Label

Label with score

塑膠原料的 log2(10) =3

塑 B1膠 B2原 B3料 M的 E

塑 3B1膠 3B2原 3B3料 3M的 3E

16

Score

Scores are also used for filtering overlapping pattern

Page 17: for  Rocling 2011

4517

Overlapping and Non-overlapping

Page 18: for  Rocling 2011

45

Character

TCB Feature

塑 B1

膠 B2

原 B3

料 M的 E生 -1產 -1 18

Non-overlapping

“ 塑膠原料的” score 3

conflicts with

” 的生產” score 1

” 的生產” is labeled as unseen

Page 19: for  Rocling 2011

45

Term Label反 5S3B14B1

而 6S3E4B2

會 6S4E欲 4S速 4S則 6S3B1

不 7S3E達 5S3E 19

Overlapping information?

Page 20: for  Rocling 2011

4520

Overlapping StringInput Unsupervised Feature

Selection1 char

2 char

3 char

4 char

5 char反 5S 3B1 4B1 0B1 0B1

而 6S 3E 4B2 0B2 0B2

會 6S 0E 4E 0B3 0B3

欲 4S 0E 0E 0E 0M速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E 0E 0E 0E達 5S 3E 0E 0E 0E

Page 21: for  Rocling 2011

45

Input Unsupervised Feature Selection1 char

2 char

3 char

4 char

5 char反 5S 3B1 4B1 0B1 0B1

而 6S 3E0B1

4B2 0B2 0B2

會 6S 0E 4E 0B3 0B3

欲 4S 0E 0E 0E 0M速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E 0E 0E 0E達 5S 3E 0E 0E 0E

21

Overlapping String

Page 22: for  Rocling 2011

45

Input Unsupervised Feature Selection1 char

2 char

3 char

4 char

5 char反 5S 3B1 4B1 0B1 0B1

而 6S 3E 4B2 0B2 0B2

會 6S 0E 4E 0B3 0B3

欲 4S 0E 0E 0E 0M速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E 0E 0E 0E達 5S 3E 0E 0E 0E

22

Overlapping String

Page 23: for  Rocling 2011

45

Input Unsupervised Feature Selection1 char

2 char

3 char

4 char

5 char反 5S 3B1 4B1 0B1 0B1

而 6S 3E 4B2 0B2 0B2

會 6S 0E 4E 0B3 0B3

欲 4S 0E 0E 0E 0M速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E3B

1

0E 0E 0E

達 5S 3E 0E 0E 0E

23

Overlapping String

Page 24: for  Rocling 2011

45

Input Unsupervised Feature Selection1 char

2 char

3 char

4 char

5 char反 5S 3B1 4B1 0B1 0B1

而 6S 3E 4B2 0B2 0B2

會 6S 0E 4E 0B3 0B3

欲 4S 0E 0E 0E 0M速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E 0E 0E 0E達 5S 3E 0E 0E 0E

24

Overlapping String

Page 25: for  Rocling 2011

45

Character-based N-gram extracted by SRILM

Keeping overlapping information

25

Character-based N-Gram (CNG)

Input Unsupervised Feature Selection1 char

2 char

3 char

4 char

5 char戲 3S 1E 2B1 1B1 1B1劇 4S 2B1 2B2 1B2 1B2性 6S 3B1 2E 1B3 1B3的 1S 3E 1E 1E 1M

結 2S 2E 1E 1E 1E果 5S 2B1 1E 1E 1E

Page 26: for  Rocling 2011

45

Using Frequent String from YASA Selected by forward maximum matching

algorithm

26

Term Contributed Boundary (TCB)

Character

TCB Feature

塑 B1膠 B2

原 B3

料 M的 E生 -1產 -1

Page 27: for  Rocling 2011

45

Using Frequent String from YASA Keep Overlapping information Converting score from frequent string

27

Term Contributed Frequency (TCF)

Input Unsupervised Feature Selection1 char

2 char

3 char

4 char

5 char欲 4S 0E 0E 0E 0M

速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E 0E 0E 0E達 5S 3E 0E 0E 0E

Page 28: for  Rocling 2011

45

Using SRILM to generate N-Grams Measure how likely a substring is a Chinese

word Using logarithm ranking mechanism

28

Accessor Variety based String (AVS)

Input Unsupervised Feature Selection1 char

2 char

3 char

4 char

5 char塑 3B1 2B1 -1 2B2 -1

膠 3B2 2B2 -1 2E -1原 3B3 2M 1B1 2B2 -1料 3M 2E 1B2 2E -1的 3E -1 1E -1 3S生 -1 -1 1B2 0E -1產 -1 -1 1E 0E -1

Page 29: for  Rocling 2011

45

Input Unsupervised Feature Selection1 char

2 char

3 char

4 char

5 char

TCB

塑 3B1 2B1 -1 2B2 -1 B1

膠 3B2 2B2 -1 2E -1 B2

原 3B3 2M 1B1 2B2 -1 B3料 3M 2E 1B2 2E -1 M的 3E -1 1E -1 3S E生 -1 -1 1B2 0E -1 -1產 -1 -1 1E 0E -1 -1

Compound AVS and TCB/TCF

29

AVS+TCB and AVS+TCF

Page 30: for  Rocling 2011

4530

CRF Labeling Scheme  Overlappin

gLabeled Score

6-Tag None

AVS V AV score

CNG V N-Gram score

TCB   None

TCF V Frequent String score

AVS+TCB   AVS AV scoreAVS+TCF V Frequent String

score, AV score

Page 31: for  Rocling 2011

45

Unlabeled corpus

Automatic extracted pattern

Labeled training data

Chinese word segmentatio

n model31

Flow chart

Page 32: for  Rocling 2011

45

Undirected graphical models trained to maximize a conditional probability of random variables X and Y

Feature instances are generated from template file

32

Conditional Random Fields

Page 33: for  Rocling 2011

45

Feature template

33

Conditional Random Fields

Feature FunctionC-1, C0, C1 Previous, current, or next tokenC-1C0 Previous and current tokens

C0C1 Current and next tokensC-1C1 Previous and next tokens

Page 34: for  Rocling 2011

45

Feature template

34

Conditional Random Fields

Feature FunctionC-1, C0, C1 Previous, current, or next tokenC-1C0 Previous and current tokens

C0C1 Current and next tokensC-1C1 Previous and next tokens欲速則不達

Page 35: for  Rocling 2011

45

Feature template

35

Conditional Random Fields

Feature FunctionC-1, C0, C1 Previous, current, or next tokenC-1C0 Previous and current tokens

C0C1 Current and next tokensC-1C1 Previous and next tokens欲速則不達

Page 36: for  Rocling 2011

45

Feature template

36

Conditional Random Fields

Feature FunctionC-1, C0, C1 Previous, current, or next tokenC-1C0 Previous and current tokens

C0C1 Current and next tokensC-1C1 Previous and next tokens欲速則不達

Page 37: for  Rocling 2011

45

Data set

◦Academia Sinica (AS)◦City University of Hong Kong (CityU)

◦Microsoft Research (MSR)◦Peking University (PKU)

37

Experiment

Page 38: for  Rocling 2011

4538

Evaluation Metric

segmented are that wordsof number thesegmented correctly thatare wordsof number the(P) Precision

standard gold the in wordsof number thesegmented correctly are that wordsof number the(R) Recall

RPRPF

2

Page 39: for  Rocling 2011

45

Evaluation Metric

standard gold in the wordsOOV ofnumber thesegmentedcorrectly are that wordsOOV ofnumber the

OOVR

datasetNumber of

i 1

1rank

1dataset of Number S R coreank

39

Page 40: for  Rocling 2011

45

6-tag CNG

AVSTC

BTC

F

AVS+TC

B

AVS+TC

F0.920.930.940.950.960.970.98

F1

ASCityUMSRPKU

40

F1 measure

Page 41: for  Rocling 2011

45

CNG6-t

ag TCF

TCB

AVS+TC

F

AVS+TC

BAVS

00.10.20.30.40.50.60.70.80.9

1

Rank score of F1

MRR

Rank score of F1 measure

41

Page 42: for  Rocling 2011

45

6-tag CNG

AVSTC

BTC

F

AVS+TC

B

AVS+TC

F0.6600000000000020.6800000000000020.7000000000000020.7200000000000020.7400000000000020.7600000000000020.7800000000000020.800000000000002

Recall Out-Of-Vocabulary

ASCityUMSRPKU

42

Recall of Out-Of-Vocabulary

Page 43: for  Rocling 2011

45

6-tag CNG

AVSTC

BTC

F

AVS+TC

B

AVS+TC

F0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Rank score of ROOV

MRR

Rank score of ROOV

43

Page 44: for  Rocling 2011

45

The feature collections which contain AVS obtains better F1

TCB/TCF enhances the 6-tag approach on the Recall of Out-of-Vocabulary

Only with high quality feature, overlapping label can keep useful information

44

Conclusion

Page 45: for  Rocling 2011

454545

Thanks for your attention