Transcript

HLT Industry in the Netherlands

Piek VossenFaculteit der Letteren, Vrije Universiteit Amsterdam

Irion Technologies, DelftWorkshop HLT Collaboration SA & Low Countries

24-26 November 2008, Cape Town

Overview of HLT-NL50 Companies investigated in the Netherlands (sources: NTU (64), Notas, PE)

http://taalunieversum.org/taal/technologie/ontwikkelaars.php

Searchno NLP

NLTSpeech

NLTText

NLTConsult.

SemanticWeb

ManualAnalysis

3 18 23 2 2 3

WiseGuys Philips Collexis Viataal LibRT Trendlight

IntelliGent DialogsUnlimited

Gridline inTaal AskNow Bureau Taal

Ilse DutchEar KnowledgeConcepts

Kieskompas

  Telecats Polderland ....more many more many more

  G2 Speech Q-go      

  Logica  TextKernel      

 Voice Data

Bridge Carp      

  Comsys Irion      

NLT Text (33)• Thesaurus based text processing: Collexis, GridLine, Knowledge

Concepts• Text mining: Textkernel, Irion• Spelling: Polderland, *TALO• Search: Irion, WiseGuys, Ilse, Intelligent• Classification: Irion, Collexis, Textkernel• Summarization: Carp Technologies• User profiling, data mining: AskNow, Sentient Machine Research

NLT Text (33)• Dialogue/Q&A: AskNow, Elitech, Q-go, Irion, • Lexicons: Van Dale• Translation tools: Lingvistica, Topterm, Linguistic Systems• Document & knowledge management: Getronics, CIBIT, AI

Engineering, ZyLAB Europe, Niceware, Sopheon, Human Inference, LibRT

• Manual text complexity: Bureau Taal• Manual language analysis, trends and politics: Kieskompas,

Trendlight• Medical language tools: Lexima, ViaTaal, inTaal• Semantic web: LibRT

NLT Speech (18)• !Effective ASR telephone applications,

stock market• Comsys ASR TTS telephone applications, call

centres• Dedicon TTS spoken documents for

disabled• Dialogues Unlimited ARS telephone applications• DutchEar ASR TTS telephone applications, self

services, colleague connect, stock market, traffic support, helpdesk support, speaker identification

• Fluency TTS text to synthetic speech• FORUS-P ASR database management• G2 Speech ASR dictating, work flow

management, medical domain, legal domain

• Group 2000 ASR telephone applications

NLT Speech (18)• Kompagne ASR TTS medical domain• Logica ASR TTS contact & call centers• ORCAvoice ASR TTS telephone applications• Philips ASR TTS dictating, medical domain,

legal domain• Sound Intelligence Sound hearing aid, medical

domain• Telecats ARS TTS telephone applications,

information retrieval, messaging, routing, call

handling en large platforms

• VoCognition ASR TTS logistics of storage centres• Voice Data Bridge ASR TTS telephone applications,

information retrieval, telecom operators

• YPCA ASR interactive services, automobile concepts

NLT-Text

Collexis:• http://www.collexis.com/• Technology:

– Fingerprints of documents using the knowledge residing in a thesaurus or multiple thesauri

– Fingerprints from existing results used to generate new results with higher precision

– Discovering the relationships between the elements of different content sources and uncovering unique information

• Application: Search, Knowledge management, Text mining• Market: Government, Legal, Health science• Projects & software

NLT-Text

Gridline:• http://www.gridline.nl/ • Technology: Semi-automatic development of thesauri and

ontologies• Application: Search, Authoring• Market: Government, Law firms• Projects & software

NLT-Text

KnowledgeConcepts:• http://www.knowledge-concepts.com/• Technology:

– Relation detection in text through use of semantic networks, thesauri and taxonomies

– Part-Of-Speech taggers, lemmatisers, entity extractors, stopword lists, and language identifiers

• Application: – multilingual search,– classification and analytical products

• Market: – Government, Banks, PTT, Publishers

• Projects & software

NLT-Text

Polderland:• http://www.polderland.biz/• Technology:

– spelling suggestion/correction– fuzzy matching– semantic expansion

• Application: – search, – content- and document management software, – authoring, – automatic classification and meta data extraction

• Market: – CRM-systemen, – contactcenter software, – publishing systems, – sharepoint and portal-server systems

• Projects & software

NLT-Text

Q-GO:• http://www.q-go.nl/• Technology:

– question analysis and normalization– search– dialogue, Q&A

• Application:– Search through dialogue/ Q&A, – Online customer support

• Market: – Banks, Insurance companies

• Projects & software

NLT-Text

Elitech:• http://www.elitech.nl/• Technology:

– Question analysis, user profile,

– question answer matching, answer database

• Application: multimodal Q&A, selfservice• Market: Railways, cities, Banks, Energy, Insurance, Travel agents,

Telecom, Government• Projects & software

NLT-Text

TextKernel:• http://www.textkernel.com/• Technology:

– memory based learning (analogical or similarity-based reasoning)– text classification– string extraction (names, numbers, formulations, zip codes)– Hidden Markov Models, Decision Trees, Naive Bayes, SVM's, or

Stochastic Grammars• Application:

– Text classification, Information extraction• Market:

– Recruiting, Tangram, WiseGuys, – Cooperates with system integrators (e.g. Capgemini, WCC Search &

Match, Connexys)• Projects & software

NLT-Text

Carp:• http://www.carp-technologies.nl/nld/Home/• Technology:

– Parsing– Semantic network

• Application: – summarizers, – search, – anonymizer, – text analysis

• Market:– local governments (province cities), – Department of Justice

• Projects & software

NLT-Text

Irion:• http://www.irion.nl/• Technology:

– statistic and phrase retrieval– text classification– language identification, taggers, grammars, wsd– information extraction– dialogue modelling– multilingual semantic networks, thesauri

• Application:– Text classification, text mining, cross-lingual retrieval, dialogue systems,

language-analysis for text complexity• Market:

– Governments (local and national)– Libraries– Publishers

• Projects & software

NLT-Text

TextKernel:• http://www.textkernel.com/• Technology:

– memory based learning (analogical or similarity-based reasoning)– text classification– string extraction (names, numbers, formulations, zip codes)– Hidden Markov Models, Decision Trees, Naive Bayes, SVM's, or

Stochastic Grammars• Application:

– Text classification, Information extraction• Market:

– Recruiting, Tangram, WiseGuys, – Cooperates with system integrators (e.g. Capgemini, WCC Search &

Match, Connexys)• Projects & software

TechnologiesSpeech-to-Text 15

Text-to-speech 8

Statistics 7

Relation detection through thesauri/ontologies/lexicons 5

Automatic Thesaurus/Ontology/Lexicon Learning 4

Dialogues 4

Manual text analysis 4

Tagging 4

Parsing 4

Text classification 3

Q&A 3

Stochastic NLP 2

Multilingual processing 2

User profiling 1

Spelling 1

Memory Based Learning 1

Ontologies & reasoning 1

Applications

Search 13

Knowledge management 5

Text mining 4

Meta data enrichment 4

Dialogue & Q&A, both text and speech 4

Authoring 3

User adaptation & analysis 2

Training & therapy 2

Political Position Mining 2

Summarization 1

Business rules 1

Text complexity 1

Trend Analysis 1

Market

Government (local, national) 9

Legal 5

Finance 5

Publishers 4

Police 3

Insurance 3

Aid for disabled 2

Telecom 2

Health science 1

Transport 1

Energy 1

Recruitment 1

System integrators 1

Simple

Complex

Reason

Semantics

Syntax

Tagging

Lemmatize

Statistics

Automatic

Manual

Co-training

Index Search Classifi-cation

Mining DialogueQ&A

Analysis Decide

Bureau TaalTrendLight

KiesKompas

LibRT

IrionCarp

Collexis

Q-goKnowledgeConcepts

AskNow

Gridline

Polderland TextKernel

WiseGuys

AutonomyFast

EndecaGoogle

Microsoft

ViaTaal

inTaal

Discussion

• Is the technology mature enough?• Long way from technology to software products

> software development.

• More money from investors required• Small company syndrome: sails & marketing• More money from investors required• Need for commercial software developers

(salaries)• Need for NLP developers

Some cases of failure

• Government departments & university/education libraries• VWS bought Verity

– cheap license but still expensive (100K and 30K maintenance per year)

– does not work:• diacritics• morphology• compounds• upper/lower case

– expensive IT consult to investigate solution: alternative search was not an option (no money & people for another integration)

– classify text, index thesaurus labels, match queries to labels• Many RFIs involving Autonomy, Verity, Fast and Irion

– best system– to small to be thrustworthy