View
215
Download
0
Category
Preview:
Citation preview
The Third Chinese Language Processing Bakeoff:Word Segmentation and Named Entity Recognition
Gina-Anne LevowFifth SIGHAN Workshop
July 22, 2006
Roadmap Bakeoff Task Motivation Bakeoff Structure:
Materials and annotations Tasks and conditions Participants and timeline
Results & Discussion: Word Segmentation Named Entity Recognition
Observations & Conclusions Thanks
Bakeoff Task Motivation Core enabling technologies for Chinese
language processing Word segmentation (WS)
Crucial tokenization in absence of whitespace Supports POS tagging, parsing, ref. resolution, etc Fundamental challenges:
“Word” not well, consistently defined; humans disagree Unknown words impede performance
Named Entity Recognition (NER) Essential for reference resolution, IR, etc Common class of new unknown words
Data Source Characterization Five corpora, providers
Annotation guidelines available, varied Simplified and traditional characters
Range of encodings, all available in Unicode (UTF-8)
Provided in common XML, converted to train/test form (LDC)
Tasks and Tracks Tasks:
Word Segmentation: Training and truth: whitespace delimited End-of-word tags replaced with space, no others
Named Entity Recognition: Training and truth: Similar to Co-NLL 2-column NAMEX only: LOC, PER, ORG (LDC: +GPE)
Tracks: Closed: Only provided materials may be used Open: Any materials may be used, but must
document
Structure: Participants &Timeline
Participants: 29 sites submitted runs for evaluation (36 init)
144 runs submitted: ~2/3 WS; 1/3 NER Diverse groups: 11 PRC, 7 Taiwan, 5 US, 2 Japan,
1each: Singapore, Korea, Hong Kong, Canada Mix of Commercial: MSRA, Yahoo!, Alias-I, FR Telecom,
etc- and Academic sites
Timeline: March 15: Registration open April 17: Training data released May 15: Test data released May 17: Results due
Word Segmentation: Results Contrasts: Left-to-right maximal
match Baseline: Uses only training vocabulary Topline: Uses only testing vocabularySource Recall Prec F-score OOV Roov Riv
CITYU 0.93 0.882 0.906 0.049 0.009 0.969
CKIP 0.915 0.87 0.892 0.042 0.03 0.954
MSRA 0.949 0.9 0.924 0.034 0.022 0.981
UPUC 0.869 0.79 0.828 0.088 0.011 0.951
Source Recall Prec F-Score
OOV Roov Riv
CITYU 0.982 0.985
0.984 0.04 0.993 0.981
CKIP 0.98 0.987
0.983 0.042 0.997 0.979
MSRA 0.991 0.993
0.992 0.034 0.999 0.991
UPUC 0.961 0.976
0.968 0.088 0.989 0.958
Word Segmentation: CityUSite RunID R P F Roov Riv
15 D 0.973 0.972 0.972 0.787 0.981
15 B 0.973 0.972 0.972 0.787 0.981
20 0.972 0.971 0.971 0.792 0.979
32 0.969 0.970 0.970 0.773 0.978
CityUClosed
Site RunID R P F Roov Riv
20 0.978 0.977 0.977 0.84 0.984
32 0.979 0.976 0.977 0.813 0.985
34 0.971 0.967 0.969 0.795 0.978
22 0.970 0.965 0.967 0.761 0.979
CityUOpen
Word Segmentation: CKIP
Site RunID R P F Roov Riv
20 0.961 0.955 0.958 0.702 0.972
15 A 0.961 0.953 0.957 0.658 0.974
15 B 0.961 0.952 0.57 0.656 0.974
32 0.958 0.948 0.953 0.646 0.972
Site RunID R P F Roov Riv
20 0.964 0.955 0.959 0.704 0.975
34 0.959 0.949 0.954 0.672 0.972
32 0.958 0.948 0.953 0.647 0.972
2 A 0.953 0.946 0.949 0.679 0.965
CKIPClosed
CKIPOpen
Word Segmentation: MSRASite RunID R P F Roov Riv
32 0.964 0.961 0.963 0.612 0.976
26 0.961 0.953 0.957 0.499 0.977
9 0.959 0.955 0.957 0.494 0.975
1 A 0.955 0.956 0.956 0.650 0.966
Site RunID R P F Roov Riv
11 A 0.980 0.978 0.979 0.839 0.985
11 B 0.977 0.976 0.977 0.840 0.982
14 0.975 0.976 0.975 0.811 0.981
32 0.977 0.971 0.974 0.675 0.988
MSRAClosed
MSRAOpen
Word Segmentation: UPUC
Site RunID R P F Roov Riv
20 0.940 0.926 0.933 0.707 0.963
32 0.936 0.923 0.930 0.683 0.961
1 A 0.940 0.914 0.927 0.634 0.969
26 A 0.936 0.917 0.926 0.617 0.966
Site RunID R P F Roov Riv
34 0.949 0.939 0.944 0.768 0.966
2 0.942 0.928 0.935 0.711 0.964
20 0.940 0.927 0.933 0.741 0.959
7 0.944 0.922 0.933 0.680 0.970
UPUCClosed
UPUCOpen
Word Segmentation: Overview
F-scores: 0.481-0.797 Best score: MSRA Open Task (FR Telecom) Best relative to topline: CityU Open: >99% Most frequent top rank: MSRA
Both F-scores and OOV recall higher in Open
Overall good results: Most outperform baseline
Word Segmentation: Discussion Continuing OOV challenges
Highest F-scores on MSRA Also highest top and base lines
Lowest OOV rate Lowest F-scores on UPUC
Also lowest top and baselines Highest OOV rate (> double all other OOV) Smallest corpus (~1/3 MSRA)
Best scores: most consistent corpus Vocabulary, annotation
UPUC also varies in genre: train: CTB; test: CTB,NW,BN
NER Results Contrast: Baseline
Label as Named Entity if unique tag in training
Source P R F PER-F ORG-F LOC-F GPE-F
CITYU 0.611 0.467
0.529
0.587 0.516 0.503 N/A
LDC 0.493 0.378
0.428
0.395 0.29 0.259 0.539
MSRA 0.59 0.488
0.534
0.614 0.469 0.531 N/A
NER Results: CityUSite P R F ORG-F LOC-F PER-F
3 0.914 0.867 0.89 0.805 0.921 0.909
19 0.92 0.854 0.886 0.805 0.925 0.887
21a 0.927 0.847 0.885 0.797 0.92 0.89
21b 0.924 0.849 0.885 0.798 0.924 0.892
Site P R F ORG-F LOC-F PER-F
6 0.869 0.749 0.805 0.68 0.86 0.81
CityUClosed
CityUOpen
NER Results: LDC
Site P R F ORG-F LOC-F PER-F
7 0.7616 0.662 0.708 0.521 0.286 0.742
6-gpe-loc
0.672 0.655 0.664 0.455 0.708 0.742
6 0.306 0.298 0.302 0.455 0.037 0.742
Site P R F ORG-F LOC-F PER-F
3 0.803 0.726 0.763 0.658 0.305 0.788
8 0.814 0.594 0.688 0.585 0.170 0.657
LDCClosed
LDCOpen
NER Results: MSRASite P R F ORG-F LOC-F PER-F
14 0.889 0.842 0.865 0.831 0.854 0.901
21a 0.912 0.817 0.862 0.82 0.905 0.826
21b 0.884 0.829 0.856 0.77 0.901 0.849
3 0.881 0.823 0.851 0.815 0.906 0.794
Site P R F ORG-F LOC-F PER-F
10 0.922 0.902 0.912 0.859 0.903 0.960
14 0.908 0.892 0.899 0.84 0.91 0.926
11b 0.877 0.875 0.876 0.761 0.897 0.922
11a 0.864 0.84 0.852 0.694 0.874 0.92
MSRAClosed
MSRAOpen
NER: Overview
Overall results: Best F-score: MSRA Open Track: 0.91 Strong overall performance:
Only two results below baseline Direct comparison of NER Open vs Closed
Difficult: only two sites performed both tracks Only MSRA had large numbers of runs
Here Open outperformed Closed: top 3 Open > Closed
NER Observations Named Entity Recognition challenges
Tagsets, variation, and corpus size Results on MSRA/CityU much better than LDC
LDC corpus substantially smaller Also larger tagset: GPE GPE easily confused for ORG or LOC
NER results sensitive to corpus size, tagset, genre
Conclusions & Future Challenges
Strong, diverse participation in WS & NER Many effective competitive results
Cross-task, cross-evaluation comparisons Still difficult Scores sensitive to corpus size, annotation consistency,
tagset, genre, etc Need corpus, config-independent measure of progress Encourage submissions that support comparisons
Extrinsic, task-oriented evaluation of WS/NER Continuing challenges: OOV, annotation
consistency, encoding combinations and variation, code-switching
Thanks Data Providers:
Chinese Knowledge Information Processing Group, Academia Sinica, Taiwan:
Keh-Jiann Chen, Henning Chiu City University of Hong Kong:
Benjamin K.Tsou, Olivia Oi Yee Kwong Linguistic Data Consortium: Stephanie Strassel Microsoft Research Asia: Mu Li University of Pennsylvania/University of Colorado:
Martha Palmer, Nianwen Xue Workshop co-chairs:
Hwee Tou Ng and Olivia Oi Yee Kwong All participants!
Recommended