Transcript
Page 1: Comparable Corpora  BootCat (CCBC)

Comparable Corpora BootCat(CCBC)

Adam Kilgarriff, Avinesh PVSLexical Computing Ltd

Page 2: Comparable Corpora  BootCat (CCBC)

BootCaT

• Bootstrapping Corpora and Terms• Translators– Know the language– Not domain experts– Can interpret domain terms but can’t guess them

• Instant domain corpus from the web• Marco Baroni and Silvia Bernardini (2004)

Page 3: Comparable Corpora  BootCat (CCBC)

BootCaT method

• Piggyback on a search engine– Google, Yahoo, Bing

• Set of seed terms• Repeat– Take random 3 seeds– Send to search engine– Gather ‘search hits’ pages

• Remove, duplicates, find terms– Can iterate

Page 4: Comparable Corpora  BootCat (CCBC)

WebBootCaT

• Web interface• Improved cleaning, duplicate removal• Integrated with corpus tool (Sketch Engine)

Page 5: Comparable Corpora  BootCat (CCBC)
Page 6: Comparable Corpora  BootCat (CCBC)
Page 7: Comparable Corpora  BootCat (CCBC)
Page 8: Comparable Corpora  BootCat (CCBC)
Page 9: Comparable Corpora  BootCat (CCBC)

Going multilingual

• Google-translate– English: volcanology volcanologist "volcanic

eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphic tephrochronology geochronological "volcanic ash" ablation rhyolitic

– French:vulcanologue volcanologie "éruption volcanique" sismographes Eyjafjallajokull "surveillance de la déformation" géodiques tephra magma téphrochronologiestratigraphique géochronologiques "de cendres volcaniques" ablation rhyolitiques

• And do the same thing for French

Page 10: Comparable Corpora  BootCat (CCBC)
Page 11: Comparable Corpora  BootCat (CCBC)

• By July 2011– All steps integrated – Propose bilingual terminology