The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition...

The Third Chinese Language Processing Bakeoff:Word Segmentation and Named Entity Recognition

Gina-Anne LevowFifth SIGHAN Workshop

July 22, 2006

Roadmap Bakeoff Task Motivation Bakeoff Structure:

Materials and annotations Tasks and conditions Participants and timeline

Results & Discussion: Word Segmentation Named Entity Recognition

Observations & Conclusions Thanks

Bakeoff Task Motivation Core enabling technologies for Chinese

language processing Word segmentation (WS)

Crucial tokenization in absence of whitespace Supports POS tagging, parsing, ref. resolution, etc Fundamental challenges:

“Word” not well, consistently defined; humans disagree Unknown words impede performance

Named Entity Recognition (NER) Essential for reference resolution, IR, etc Common class of new unknown words

Data Source Characterization Five corpora, providers

Annotation guidelines available, varied Simplified and traditional characters

Range of encodings, all available in Unicode (UTF-8)

Provided in common XML, converted to train/test form (LDC)

Tasks and Tracks Tasks:

Word Segmentation: Training and truth: whitespace delimited End-of-word tags replaced with space, no others

Named Entity Recognition: Training and truth: Similar to Co-NLL 2-column NAMEX only: LOC, PER, ORG (LDC: +GPE)

Tracks: Closed: Only provided materials may be used Open: Any materials may be used, but must

document

Structure: Participants &Timeline

Participants: 29 sites submitted runs for evaluation (36 init)

144 runs submitted: ~2/3 WS; 1/3 NER Diverse groups: 11 PRC, 7 Taiwan, 5 US, 2 Japan,

1each: Singapore, Korea, Hong Kong, Canada Mix of Commercial: MSRA, Yahoo!, Alias-I, FR Telecom,

etc- and Academic sites

Timeline: March 15: Registration open April 17: Training data released May 15: Test data released May 17: Results due

Word Segmentation: Results Contrasts: Left-to-right maximal

match Baseline: Uses only training vocabulary Topline: Uses only testing vocabularySource Recall Prec F-score OOV Roov Riv

CITYU 0.93 0.882 0.906 0.049 0.009 0.969

CKIP 0.915 0.87 0.892 0.042 0.03 0.954

MSRA 0.949 0.9 0.924 0.034 0.022 0.981

UPUC 0.869 0.79 0.828 0.088 0.011 0.951

Source Recall Prec F-Score

OOV Roov Riv

CITYU 0.982 0.985

0.984 0.04 0.993 0.981

CKIP 0.98 0.987

0.983 0.042 0.997 0.979

MSRA 0.991 0.993

0.992 0.034 0.999 0.991

UPUC 0.961 0.976

0.968 0.088 0.989 0.958

Word Segmentation: CityUSite RunID R P F Roov Riv

15 D 0.973 0.972 0.972 0.787 0.981

15 B 0.973 0.972 0.972 0.787 0.981

20 0.972 0.971 0.971 0.792 0.979

32 0.969 0.970 0.970 0.773 0.978

CityUClosed

Site RunID R P F Roov Riv

20 0.978 0.977 0.977 0.84 0.984

32 0.979 0.976 0.977 0.813 0.985

34 0.971 0.967 0.969 0.795 0.978

22 0.970 0.965 0.967 0.761 0.979

CityUOpen

Word Segmentation: CKIP

20 0.961 0.955 0.958 0.702 0.972

15 A 0.961 0.953 0.957 0.658 0.974

15 B 0.961 0.952 0.57 0.656 0.974

32 0.958 0.948 0.953 0.646 0.972

20 0.964 0.955 0.959 0.704 0.975

34 0.959 0.949 0.954 0.672 0.972

32 0.958 0.948 0.953 0.647 0.972

2 A 0.953 0.946 0.949 0.679 0.965

CKIPClosed

CKIPOpen

Word Segmentation: MSRASite RunID R P F Roov Riv

32 0.964 0.961 0.963 0.612 0.976

26 0.961 0.953 0.957 0.499 0.977

9 0.959 0.955 0.957 0.494 0.975

1 A 0.955 0.956 0.956 0.650 0.966

11 A 0.980 0.978 0.979 0.839 0.985

11 B 0.977 0.976 0.977 0.840 0.982

14 0.975 0.976 0.975 0.811 0.981

32 0.977 0.971 0.974 0.675 0.988

MSRAClosed

MSRAOpen

Word Segmentation: UPUC

20 0.940 0.926 0.933 0.707 0.963

32 0.936 0.923 0.930 0.683 0.961

1 A 0.940 0.914 0.927 0.634 0.969

26 A 0.936 0.917 0.926 0.617 0.966

34 0.949 0.939 0.944 0.768 0.966

2 0.942 0.928 0.935 0.711 0.964

20 0.940 0.927 0.933 0.741 0.959

7 0.944 0.922 0.933 0.680 0.970

UPUCClosed

UPUCOpen

Word Segmentation: Overview

F-scores: 0.481-0.797 Best score: MSRA Open Task (FR Telecom) Best relative to topline: CityU Open: >99% Most frequent top rank: MSRA

Both F-scores and OOV recall higher in Open

Overall good results: Most outperform baseline

Word Segmentation: Discussion Continuing OOV challenges

Highest F-scores on MSRA Also highest top and base lines

Lowest OOV rate Lowest F-scores on UPUC

Also lowest top and baselines Highest OOV rate (> double all other OOV) Smallest corpus (~1/3 MSRA)

Best scores: most consistent corpus Vocabulary, annotation

UPUC also varies in genre: train: CTB; test: CTB,NW,BN

NER Results Contrast: Baseline

Label as Named Entity if unique tag in training

Source P R F PER-F ORG-F LOC-F GPE-F

CITYU 0.611 0.467

0.587 0.516 0.503 N/A

LDC 0.493 0.378

0.395 0.29 0.259 0.539

MSRA 0.59 0.488

0.614 0.469 0.531 N/A

NER Results: CityUSite P R F ORG-F LOC-F PER-F

3 0.914 0.867 0.89 0.805 0.921 0.909

19 0.92 0.854 0.886 0.805 0.925 0.887

21a 0.927 0.847 0.885 0.797 0.92 0.89

21b 0.924 0.849 0.885 0.798 0.924 0.892

Site P R F ORG-F LOC-F PER-F

6 0.869 0.749 0.805 0.68 0.86 0.81

CityUClosed

CityUOpen

NER Results: LDC

7 0.7616 0.662 0.708 0.521 0.286 0.742

6-gpe-loc

0.672 0.655 0.664 0.455 0.708 0.742

6 0.306 0.298 0.302 0.455 0.037 0.742

3 0.803 0.726 0.763 0.658 0.305 0.788

8 0.814 0.594 0.688 0.585 0.170 0.657

LDCClosed

LDCOpen

NER Results: MSRASite P R F ORG-F LOC-F PER-F

14 0.889 0.842 0.865 0.831 0.854 0.901

21a 0.912 0.817 0.862 0.82 0.905 0.826

21b 0.884 0.829 0.856 0.77 0.901 0.849

3 0.881 0.823 0.851 0.815 0.906 0.794

10 0.922 0.902 0.912 0.859 0.903 0.960

14 0.908 0.892 0.899 0.84 0.91 0.926

11b 0.877 0.875 0.876 0.761 0.897 0.922

11a 0.864 0.84 0.852 0.694 0.874 0.92

MSRAClosed

MSRAOpen

NER: Overview

Overall results: Best F-score: MSRA Open Track: 0.91 Strong overall performance:

Only two results below baseline Direct comparison of NER Open vs Closed

Difficult: only two sites performed both tracks Only MSRA had large numbers of runs

Here Open outperformed Closed: top 3 Open > Closed

NER Observations Named Entity Recognition challenges

Tagsets, variation, and corpus size Results on MSRA/CityU much better than LDC

LDC corpus substantially smaller Also larger tagset: GPE GPE easily confused for ORG or LOC

NER results sensitive to corpus size, tagset, genre

Conclusions & Future Challenges

Strong, diverse participation in WS & NER Many effective competitive results

Cross-task, cross-evaluation comparisons Still difficult Scores sensitive to corpus size, annotation consistency,

tagset, genre, etc Need corpus, config-independent measure of progress Encourage submissions that support comparisons

Extrinsic, task-oriented evaluation of WS/NER Continuing challenges: OOV, annotation

consistency, encoding combinations and variation, code-switching

Thanks Data Providers:

Chinese Knowledge Information Processing Group, Academia Sinica, Taiwan:

Keh-Jiann Chen, Henning Chiu City University of Hong Kong:

Benjamin K.Tsou, Olivia Oi Yee Kwong Linguistic Data Consortium: Stephanie Strassel Microsoft Research Asia: Mu Li University of Pennsylvania/University of Colorado:

Martha Palmer, Nianwen Xue Workshop co-chairs:

Hwee Tou Ng and Olivia Oi Yee Kwong All participants!

The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition...

Documents

CAP 4730 Computer Graphic Methods Prof. Roy Levow Chapter 4

CIPS-SIGHAN Joint Conference on Chinese Language Processing · the challenge, the ﬁrst CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010) is organized under the

bakeoff - w-uh.com

Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006

TZI Digitale Medien und Netze © 2001 Carsten Bormann (0) Robust Header Compression (ROHC) Bakeoff July/August 2001 Roke Manor Research, UK Carsten Bormann,

COP 3813 Intro to Internet Computing Prof. Roy Levow Lecture 2

Digital Transformation with Predictive Maintenance Drives ...€¦ · investigating predictive maintenance. The customer decided to organize a competitive bakeoff, trimming an initial

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000

It's Bakeoff Time!

June 2008 Florida Atlantic University Department of Computer Science & Engineering ISM 4052 –Internet Application Programming Dr. Roy Levow Session 2

vision · see the futureolexglobal.com/img/productos/bakeoff/Folleto_BAKE_OFF_para_descargar.pdf · Bake • According to the traditional Italian bakery and pastry recipes, already

OutSystems - Bimodal Bakeoff

CAP 4703 Computer Graphic Methods Prof. Roy Levow Lecture 3

Discrete Mathematics Practice 3: Induction & Recursion · Discrete Mathematics Practice 3: Induction & Recursion Dong-SigHan Seoul National University dshan@bi.snu.ac.kr May11,2017

Directory - Levow

Overview of Issues in Discourse and Dialogue Gina-Anne Levow CS 35900-1 Discourse and Dialogue September 25, 2006

Context and Learning in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago May 18, 2007

Prof. Roy Levow Session 9. Defining the APF An Overview of the APF The APF Core Values

Model in Word - people.cs.uchicago.edupeople.cs.uchicago.edu/~levow/papers/MEI_NAACL_2000/MEI_NAAC… · Web viewMandarin-English Information (MEI): A Translingual Speech Retrieval

Prof. Roy Levow Session 10. Inputs the Client Checkpoint Questions to Be Answered During Client Checkpoint Adjusting Functionality for the Next