25
A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini ([email protected]) RISE SICS East Sweden 3 September 2017 LTA'17 - FedCSIS 2017 - Prague 1

A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini ([email protected])

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

A Web Corpus for eCareCollection, Lay Annotation and Learning

Marina Santini([email protected])

RISE SICS East

Sweden

3 September 2017LTA'17 - FedCSIS 2017 - Prague

1

Page 2: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

RISE & SICS

• RI.SE = Research Institutes of Sweden• SICS = Swedish Institute of Computer Science

• SICS East Linköping

• Group: Language Technology and Intelligent Interaction Design

• Group Leader: Professor Arne Jönsson ([email protected])

Citation: Santini M., Jönsson A., Nyström M. and Alirezai M. (2017) Web Corpus for eCare: Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague.

3 September 2017LTA'17 - FedCSIS 2017 - Prague 2

Page 3: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Outline

• The eCare@home project

• The Language Technology contribution

• The eCare Swedish corpus

• Lay-Specialized Annotation

• The experiments: lay-specialized text classification

• Conclusions

3 September 2017LTA'17 - FedCSIS 2017 - Prague 3

Page 4: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

eCare@home

• Website: http://ecareathome.se/

• Interdisciplinary project

• Funded by Swedish Knowledge Foundation

1. Measure attributes about people and the environment.

2. Infer beyond that which we cannot measure.

3. Automatically configure devices and "things" to achieve a task.

4. Represent information in a human consumable way.

3 September 2017LTA'17 - FedCSIS 2017 - Prague 4

Page 5: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Language Technology and eCare@home

3 September 2017LTA'17 - FedCSIS 2017 - Prague 5

Generally speaking:

• The Internet of Things Sensor Data = Numbers

• Language Technology to present information in a “human consumable way”

Page 6: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

How to Build a Medical Corpus for eCare?

3 September 2017LTA'17 - FedCSIS 2017 - Prague 6

Kardiologi vs hjärtspecialistspecialized vs lay

Page 7: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Starting point: a Web Corpus for eCare

3 September 2017LTA'17 - FedCSIS 2017 - Prague 7

• Creation of a concept-specific medical web corpus useful for eCare (and eHealth) LT applications, such as:

• the automatic extraction of lay synonyms (or paraphrases) of medical terms,

• text simplification

• etc.

Page 8: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Web Corpus

• Which textual sources?

3 September 2017 LTA'17 - FedCSIS 2017 - Prague 8

• Using medical journals?

• Crawling user-generated texts?

• Relying on a specific web genre like blogs?

• Downloading medical websites?

• Web corpus!

Page 9: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Designing a corpus for eCare

3 September 2017LTA'17 - FedCSIS 2017 - Prague 9

1. having a publicly-available medical corpus annotated with lay-specialized labels that can be easily shared;

2. having a corpus with a design and a structure that allow for expansion with additional documents over time;

3. accounting for very specific medical terms.

Page 10: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Potential Issues

3 September 2017LTA'17 - FedCSIS 2017 - Prague 10

1. expanding the corpus over time may cause scalability issues

2. the web is noisy the corpus will be noisy: noise disturbs LT applications

We made two assumptions: 1. scalability increasing the size of the corpus does not necessarily affect the

performance of LT applications negatively

2. Noise noisy texts do not necessarily affect the performance of LT applications negatively

Page 11: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

In the next slides...

3 September 2017LTA'17 - FedCSIS 2017 - Prague 11

1. Lay vs Specialized sublanguage

2. The construction and annotation of the eCare Swedish web corpus

3. Experiments 1 and 2: • Experiment 1: robustness to scalability issues ( increasing the size of

the corpus does not necessarily affect the performance of LT application negatively )

• Experiment 2: resilience to noise ( noisy texts do not necessarily affect the performance of LT application negatively)

Page 12: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Lay vs Specialized Sublanguage

3 September 2017LTA'17 - FedCSIS 2017 - Prague 12

• Definition of sublanguage:• A language variety used by a specific user group in

certain communicative/situational contexts or domains• Ex: the jargon used by police, or by the military, or by

the politicians, etc.

• Medical domain not only jargon but also medical terminology!

• Two user groups in contact:• patients lay sublanguage (ex: heart specialist)• professional staff specialized sublanguage (ex:

cardiologist)

The Guardian, 2014

Page 13: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

eCare Web Corpus: Construction

3 September 2017LTA'17 - FedCSIS 2017 - Prague 13

• Seed terms from SNOMED CT• Chronic diseases

• The corpus has been bootstrapped with 228 seed terms and 801 documents were bootcat-ted (BootCat, Baroni & Bernardini, 2004)

Initial seeds Retrieved seeds Downloaded documents

Number of words

Unigrams 13 13 112 91 118

Bigrams 215 142 689 618 491

Total 228 155 801 709 609

Example of long medical term (source SNOMED CT

Page 14: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Small Data (vs Big Data)

3 September 2017LTA'17 - FedCSIS 2017 - Prague 14

• Small data: data that has small enough size for human comprehension.

• Use: for many problems and questions, small data in itself is enough.

• Small data (wikipedia) = a new buzz word

Page 15: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

eCare Web Corpus: Lay Annotation

3 September 2017LTA'17 - FedCSIS 2017 - Prague 15

• Annotation by a native speaker who participates who has little knowledge of medical terminology.

• The lay-specialized text classification experiments described later on are based on this lay annotation.

Page 16: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Inter-rater agreement: Lay vs Expert (sample)

3 September 2017LTA'17 - FedCSIS 2017 - Prague 16

• Interrater agreement

To be taken with a grain of salt , but normally: • Poor agreement = 0.20 or less• Fair agreement = 0.20 to 0.40• Moderate agreement = 0.40 to 0.60• Good agreement = 0.60 to 0.80• Very good agreement = 0.80 to 1.00

User group bias: annotation maby be biassed by the annotator’s domain expertise.

Page 17: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Noise

3 September 2017LTA'17 - FedCSIS 2017 - Prague 17

• Apparently the web is full of automatically translated documents ! Do we need to sort them out? It depends!

• Out of 801 web documents, 339 have received comments by the lay annotator, e.g. "Machine Translated" or "it is about animals and not about humans".

• Is it important to remove this noise-ness for some LT tasks? Maybe not always

• Computationally, the presence of noise might be irrelevant, so it might not be worth investing resource for cleaning a corpus

Page 18: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Lay/Specialized Text Classification

3 September 2017LTA'17 - FedCSIS 2017 - Prague 18

Based on the annotation made by the lay annotator

• Experiment 1: focus on scalability

• Experiment 2: focus on the impact of noise on performance

Page 19: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Features

3 September 2017LTA'17 - FedCSIS 2017 - Prague 19

No text pre-processing

The texts were defined as “string”

Filter converting strings to word vectors

Page 20: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Experiment 1: Scalability

3 September 2017LTA'17 - FedCSIS 2017 - Prague 20

Page 21: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Experiment 2: Resilience to Noise

3 September 2017LTA'17 - FedCSIS 2017 - Prague 21

• Noisy vs Cleaned

Page 22: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

New Experiment: Lay vs Specialized Annotation

3 September 2017LTA'17 - FedCSIS 2017 - Prague 22

• SVM (weka implementation: SMO)

k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP

SMO 0.49 78.6 0.78 0.78 0.78 0.74 0.78 0.29

801 documents labelled by the lay annotator

k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP

SMO 0.54 79.5 0,79 0,79 0,79 0,77 0,79 0,25

778 documents labelled by the expert annotator

Page 23: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Conclusions: Findings

3 September 2017LTA'17 - FedCSIS 2017 - Prague 23

• We presented the construction of the the eCare corpus

• We made two claims and we supported our claims with two experiments:

1. We can design an extensible/dynamic corpus without fearing scalability issues

2. We can use a noisy corpus to build (certain) LT applications without fearing bad performance

Page 24: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Replicability

3 September 2017LTA'17 - FedCSIS 2017 - Prague 24

• We encourage the replication of the results presented in this paper, and welcome improvements and further discussion.

• Corpus, scripts and the output of the classification models are available from the project website: http://ecareathome.se.

Page 25: A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A Web Corpus for eCare Collection, Lay Annotation and Learning Marina Santini (marina.santini@ri.se)

Thanks for your attention !

3 September 2017 LTA'17 - FedCSIS 2017 - Prague 25

Any Questions ?