Upload
brice
View
36
Download
3
Embed Size (px)
DESCRIPTION
Unsupervised Overlapping Feature Selection for C onditional R andom F ields Learning in Chinese Word Segmentation. for Rocling 2011. Ting- hao Yang, Tian-jian Jiang , Chan-hung Kuo , Richard Tzong-han Tsai, Wen-lian Hsu Institute of Information Science, Academia Sinica - PowerPoint PPT Presentation
Citation preview
Unsupervised Overlapping Feature Selection for
Conditional Random Fields Learning in Chinese Word Segmentation
for Rocling 2011
Ting-hao Yang, Tian-jian Jiang, Chan-hung Kuo, Richard Tzong-han Tsai, Wen-lian Hsu
Institute of Information Science, Academia SinicaDepartment of Computer Science & Engineering, Yuan Ze University
45
Term Contributed Boundary Feature using Conditional Random Fields in 2010
A unified view of several unsupervised feature selection based on frequent strings
2
Introduction
45
Unlabeled corpus
Automatic extracted pattern
Labeled training data
Chinese word segmentatio
n model3
Flow chart
45
Unlabeled corpus
Automatic extracted pattern
Labeled training data
Chinese word segmentatio
n model4
Flow chart
45
SRILM
YASA
5
Toolkit
45
C++ libraries
The toolkit supports N-gram statistics for language model
6
SRILM
45
Automatically extract frequent strings from unlabeled corpus
7
YASA
Pattern: 自然科學Frequency Net Frequency
( 自然科學 ,4)( 自然科 ,4)( 自然 ,10)
( 自然科學 ,4)( 自然科 ,0)( 自然 ,6)
45
Unlabeled corpus
Automatic extracted pattern
Labeled training data
Chinese word segmentatio
n model8
Flow chart
45
Character Label
反 B1而 E
會 S
欲 B1速 B2則 B3不 M
達 E9
6-Tag
45
[0 -9 ] + [B1|B2|B3|M|E|S]
10
Extended Label
45
N-Gram score
Frequent string score
Accessor variety score
11
Score
45
Convert from term frequency and N-Gram frequency
Logarithm ranking mechanism
12
Score
45
Pattern Frequency Logarithm ranking mechanismScore
塑膠原料的 10 log2(10) =3塑膠原料 5 log2(5) = 2原料的 3 log2(3)=1的生產 2 Log2(2)=1塑膠 4 log2(4)=2
13
Score
45
Consider the score of outer pattern
Equation of AV
14
Score
rpredecesso distinct the of number : L )(Sav
rpredecesso distinct the of number : R )(Sav
4515
Score
AV( 開發與法制 )
AV( 的開發與法制 ),AV( 是開發與法制 ),AV( 有開發與法制 ),AV( 開發與法制的 ),AV( 開發與法制是 ),AV( 開發與法制為 )
45
Pattern Logarithm ranking mechanismScore
6-Tag Label
Label with score
塑膠原料的 log2(10) =3
塑 B1膠 B2原 B3料 M的 E
塑 3B1膠 3B2原 3B3料 3M的 3E
16
Score
Scores are also used for filtering overlapping pattern
4517
Overlapping and Non-overlapping
45
Character
TCB Feature
塑 B1
膠 B2
原 B3
料 M的 E生 -1產 -1 18
Non-overlapping
“ 塑膠原料的” score 3
conflicts with
” 的生產” score 1
” 的生產” is labeled as unseen
45
Term Label反 5S3B14B1
而 6S3E4B2
會 6S4E欲 4S速 4S則 6S3B1
不 7S3E達 5S3E 19
Overlapping information?
4520
Overlapping StringInput Unsupervised Feature
Selection1 char
2 char
3 char
4 char
5 char反 5S 3B1 4B1 0B1 0B1
而 6S 3E 4B2 0B2 0B2
會 6S 0E 4E 0B3 0B3
欲 4S 0E 0E 0E 0M速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E 0E 0E 0E達 5S 3E 0E 0E 0E
45
Input Unsupervised Feature Selection1 char
2 char
3 char
4 char
5 char反 5S 3B1 4B1 0B1 0B1
而 6S 3E0B1
4B2 0B2 0B2
會 6S 0E 4E 0B3 0B3
欲 4S 0E 0E 0E 0M速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E 0E 0E 0E達 5S 3E 0E 0E 0E
21
Overlapping String
45
Input Unsupervised Feature Selection1 char
2 char
3 char
4 char
5 char反 5S 3B1 4B1 0B1 0B1
而 6S 3E 4B2 0B2 0B2
會 6S 0E 4E 0B3 0B3
欲 4S 0E 0E 0E 0M速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E 0E 0E 0E達 5S 3E 0E 0E 0E
22
Overlapping String
45
Input Unsupervised Feature Selection1 char
2 char
3 char
4 char
5 char反 5S 3B1 4B1 0B1 0B1
而 6S 3E 4B2 0B2 0B2
會 6S 0E 4E 0B3 0B3
欲 4S 0E 0E 0E 0M速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E3B
1
0E 0E 0E
達 5S 3E 0E 0E 0E
23
Overlapping String
45
Input Unsupervised Feature Selection1 char
2 char
3 char
4 char
5 char反 5S 3B1 4B1 0B1 0B1
而 6S 3E 4B2 0B2 0B2
會 6S 0E 4E 0B3 0B3
欲 4S 0E 0E 0E 0M速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E 0E 0E 0E達 5S 3E 0E 0E 0E
24
Overlapping String
45
Character-based N-gram extracted by SRILM
Keeping overlapping information
25
Character-based N-Gram (CNG)
Input Unsupervised Feature Selection1 char
2 char
3 char
4 char
5 char戲 3S 1E 2B1 1B1 1B1劇 4S 2B1 2B2 1B2 1B2性 6S 3B1 2E 1B3 1B3的 1S 3E 1E 1E 1M
結 2S 2E 1E 1E 1E果 5S 2B1 1E 1E 1E
45
Using Frequent String from YASA Selected by forward maximum matching
algorithm
26
Term Contributed Boundary (TCB)
Character
TCB Feature
塑 B1膠 B2
原 B3
料 M的 E生 -1產 -1
45
Using Frequent String from YASA Keep Overlapping information Converting score from frequent string
27
Term Contributed Frequency (TCF)
Input Unsupervised Feature Selection1 char
2 char
3 char
4 char
5 char欲 4S 0E 0E 0E 0M
速 4S 0E 0E 0E 0E則 6S 3B1 0E 0E 0E不 7S 3E 0E 0E 0E達 5S 3E 0E 0E 0E
45
Using SRILM to generate N-Grams Measure how likely a substring is a Chinese
word Using logarithm ranking mechanism
28
Accessor Variety based String (AVS)
Input Unsupervised Feature Selection1 char
2 char
3 char
4 char
5 char塑 3B1 2B1 -1 2B2 -1
膠 3B2 2B2 -1 2E -1原 3B3 2M 1B1 2B2 -1料 3M 2E 1B2 2E -1的 3E -1 1E -1 3S生 -1 -1 1B2 0E -1產 -1 -1 1E 0E -1
45
Input Unsupervised Feature Selection1 char
2 char
3 char
4 char
5 char
TCB
塑 3B1 2B1 -1 2B2 -1 B1
膠 3B2 2B2 -1 2E -1 B2
原 3B3 2M 1B1 2B2 -1 B3料 3M 2E 1B2 2E -1 M的 3E -1 1E -1 3S E生 -1 -1 1B2 0E -1 -1產 -1 -1 1E 0E -1 -1
Compound AVS and TCB/TCF
29
AVS+TCB and AVS+TCF
4530
CRF Labeling Scheme Overlappin
gLabeled Score
6-Tag None
AVS V AV score
CNG V N-Gram score
TCB None
TCF V Frequent String score
AVS+TCB AVS AV scoreAVS+TCF V Frequent String
score, AV score
45
Unlabeled corpus
Automatic extracted pattern
Labeled training data
Chinese word segmentatio
n model31
Flow chart
45
Undirected graphical models trained to maximize a conditional probability of random variables X and Y
Feature instances are generated from template file
32
Conditional Random Fields
45
Feature template
33
Conditional Random Fields
Feature FunctionC-1, C0, C1 Previous, current, or next tokenC-1C0 Previous and current tokens
C0C1 Current and next tokensC-1C1 Previous and next tokens
45
Feature template
34
Conditional Random Fields
Feature FunctionC-1, C0, C1 Previous, current, or next tokenC-1C0 Previous and current tokens
C0C1 Current and next tokensC-1C1 Previous and next tokens欲速則不達
45
Feature template
35
Conditional Random Fields
Feature FunctionC-1, C0, C1 Previous, current, or next tokenC-1C0 Previous and current tokens
C0C1 Current and next tokensC-1C1 Previous and next tokens欲速則不達
45
Feature template
36
Conditional Random Fields
Feature FunctionC-1, C0, C1 Previous, current, or next tokenC-1C0 Previous and current tokens
C0C1 Current and next tokensC-1C1 Previous and next tokens欲速則不達
45
Data set
◦Academia Sinica (AS)◦City University of Hong Kong (CityU)
◦Microsoft Research (MSR)◦Peking University (PKU)
37
Experiment
4538
Evaluation Metric
segmented are that wordsof number thesegmented correctly thatare wordsof number the(P) Precision
standard gold the in wordsof number thesegmented correctly are that wordsof number the(R) Recall
RPRPF
2
45
Evaluation Metric
standard gold in the wordsOOV ofnumber thesegmentedcorrectly are that wordsOOV ofnumber the
OOVR
datasetNumber of
i 1
1rank
1dataset of Number S R coreank
39
45
6-tag CNG
AVSTC
BTC
F
AVS+TC
B
AVS+TC
F0.920.930.940.950.960.970.98
F1
ASCityUMSRPKU
40
F1 measure
45
CNG6-t
ag TCF
TCB
AVS+TC
F
AVS+TC
BAVS
00.10.20.30.40.50.60.70.80.9
1
Rank score of F1
MRR
Rank score of F1 measure
41
45
6-tag CNG
AVSTC
BTC
F
AVS+TC
B
AVS+TC
F0.6600000000000020.6800000000000020.7000000000000020.7200000000000020.7400000000000020.7600000000000020.7800000000000020.800000000000002
Recall Out-Of-Vocabulary
ASCityUMSRPKU
42
Recall of Out-Of-Vocabulary
45
6-tag CNG
AVSTC
BTC
F
AVS+TC
B
AVS+TC
F0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Rank score of ROOV
MRR
Rank score of ROOV
43
45
The feature collections which contain AVS obtains better F1
TCB/TCF enhances the 6-tag approach on the Recall of Out-of-Vocabulary
Only with high quality feature, overlapping label can keep useful information
44
Conclusion
454545
Thanks for your attention