Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

TM

Contact detailsAppen Pty Ltd

Level 69 Help Street

Chatswood, SydneyNSW 2067 Australia

Enquiries:

Sydney office: +61-2-9468-6335US sales enquiries: +1-315-335-4020

Europe: +31-622-799-535Japan & Korea: +1-202-765-7106

China: +61-2-9468-6310

[email protected]

www.appen.com

LanguageResources Catalog

Table of Contents

A global leader in linguistic technology solutions 3

Speech Databases - Summary 7

Speech Databases - Detailed 11

Lexica 84

Other Language Resources 88

Appen brings the forefront of speech and language technology to you. We deliver the highest quality in linguistic solutions to government agencies and the world’s largest corporations, with proven expertise in over 150 languages.

We understand the complex linguistic needs of today’s leading organizations. Our unparalleled range of resources and solutions gives you the edge in a wide array of applications, including:

• speech recognition

• text-to-speech synthesis

• speech analytics

• machine translation

• natural language processing

Appen’s reputation as a global leader guarantees you:

• flexibility and rapid response capability

• global coverage in over 150 languages

• highly qualified specialist personnel

• large, closely vetted crowds of in-country native speakers

• tight project management

• keen innovation and creativity

• strict client confidentiality

Appen remains fully independent of any systems provider, although we do enter into close strategic relationships with selected clients. We have been a principal sub-contractor on several European consortium projects, also in addition to supporting similar projects funded by DARPA and other US agencies.

Whatever speech and language data you need for your application, Appen will collect it for you.

Our end-to-end data collection service delivers efficiency and quality, even on multiple large-scale collections in parallel.

Available collection types include:

• telephony – fixed-line, mobile, in-car

• embedded device – in-car, desktop, smartphone, tablet

• single/multi-speaker – speakers selected by demographic or other requirements

• prompt variation – scripted, spontaneous, conversational (dialogue), meeting data

• modality – speech, text, handwriting, gesture, image and other acoustic data

• text corpora and other resources – email, SMS, named entity tags, POS tags

As part of a standard collection, we offer you the following:

• detailed linguistic and cultural research

• script preparation and localization

• crowdsourcing of native speakers

• local and remote speech recording

• transcription and annotation of collected data

• quality assurance and project management

• lexicon entries matching database contents

• packaging of database in a coherent format

A global leader in linguistic technology solutions

Data Collection

A global leader in linguistic technology solutions

Appen provides high quality speech and language technology products and services to technology developers and government organizations, and is recognized as a global leader in the quality and coverage of its products and services.

Our products and services cover a wide range of applications in speech recognition, text-‐to-‐speech synthesis, phonetic search, machine translation and text processing including Natural Language Processing (NLP).

Appen’s client base includes both government agencies and the world’s largest and most respected IT organizations. Our objective is to enhance our clients’ capabilities in the fields of speech and language technology by offering:

• fast-‐track production • tight project management, working to strict timing,

quality and productivity criteria • specialist personnel including highly qualified linguists

and computational linguists to support our customers’ internal resources, particularly in response to surge requirements

• flexibility and rapid response capability which may be difficult for larger organizations to achieve

• global coverage which may be difficult for smaller organizations to achieve

• large crowds of in-‐country native speakers that have passed our screening processes

• high levels of innovation • strict client confidentiality

Appen remains fully independent of any systems provider, although we can and do enter into close strategic relationships with selected clients. We have been a principal sub-‐contractor on several European consortium projects, such as SpeeCon (multiple projects); SALA II (multiple projects); LILA (multiple projects); Orientel and LC-‐Star. Appen has also supported several DARPA and other US-‐funded consortium projects.

Appen Catalogue – Speech and Language Resources

Appen has a large number of licensable speech and language resources currently available and in development. Most of the 150+ languages that Appen has worked in are included in off-‐the-‐shelf offerings. Up-‐to-‐date catalogue information is available at appen.com

Licensable materials cover:

• Fully transcribed speech databases for broadcast, embedded, in-‐car and telephony applications

• Pronunciation lexicons to provide both general and domain specific coverage for a given language (specific categories include names, places, natural numbers)

• Part-‐Of-‐Speech tagged lexicons and Thesauri to support a wide range of Speech and Language Technology development activities

• Corpora annotated for Part-‐Of-‐Speech, Morphological Information, Named Entities

• Parallel Corpora for use in the development of Machine Translation

Appen’s licensable Speech and Language resources offer wide coverage of less commonly taught languages, including languages and dialects of West and North Asia, the Middle East and Africa.

In many cases, licensable resources can be developed on request to meet a particular client’s requirements.

Appen Catalogue – Speech and Language Resources

We use AppenScribe, our proprietary web-based transcription interface, to deliver high-volume, high-quality data transcription and annotation to you.

Whether you are working with speech, text, video or handwriting, AppenScribe supports a large number of languages in native orthography.

Our transcription and annotation services include:

• orthographic transcription

• acoustic event transcription

• phonetic and phonemic transcription

• semantic annotation and Named Entity tagging

• annotation of handwriting and other language data

• TTS evaluation through the provision of MOS scores

• time alignment of transcription and acoustic signal

While we have experience in processing millions of US English utterances in a matter of weeks, we are equally practiced in languages like Somali which lack a standardized written form.

We ensure the highest quality of work through:

• screening and training of in-country transcribers

• automated spelling checks

• rigorous post-processing by senior team members

If you need immediate access to a complete speech and language database, Appen has a long list of licensable resources available. See www.appen.com for our latest catalogue.

Our high-quality licensable materials cover:

• fully transcribed speech databases for broadcast, call center, in-car and telephony applications

• pronunciation lexicons, both general and domain-specific (e.g. names, places, natural numbers)

• POS-tagged lexicons and thesauri

• corpora annotated for POS, morphological information and named entities

• parallel corpora for use in the development of machine translation

Appen’s databases also cover less resourced languages, including dialects of West and North Asia, the Middle East and Africa.

Transcription and Annotation

We offer you the collective expertise of our premier network of freelance consultants around the globe, currently covering over 60 languages.

Appen’s team of over 1,000 highly qualified consultants includes:

• linguists, phoneticians and lexicographers

• language specialists with backgrounds in translation, localization, terminology, education and library sciences

• data annotators with experience in Internet research and search evaluation

Among the key benefits we offer to you:

• specialized resources for custom linguistic consulting

• resource pools of language specialists in over 60 countries

• large-scale recruiting and training for rapid market expansion

• on-demand staffing to respond to urgent project changes

Contact us directly for additional information and project-specific enquiries

Appen’s highly trained evaluation teams maximize the relevance of your search engine in over twenty local markets around the world.

Our in-country search experts each review hundreds of queries daily, ranking results for relevance to user input.

Our teams are familiar with search trends, popular and obscure topics, and the linguistic nuances of your search engine’s target users.

In addition to general-purpose search, we also specialize in vertical categories, including:

• local

• news

• medical

• travel

• finance

• shopping

• social

We also provide you with valuable testing of search features, such as:

• spam filtering

• related query suggestion

• duplicate removal

• business listing verification

• caption generation

Search Relevance Evaluation

Human Resourcing and Crowdsourcing

• Afrikaans

• Arabic (15+ varieties)

• Assamese

• Bahasa Indonesia

• Bahasa Malaysia

• Bakhtiari (Iran)

• Basque

• Bengali

• Bulgarian

• Cantonese (China PRC, China Hong Kong)

• Catalan

• Croatian

• Czech

• Danish

• Dari

• Dutch (Netherlands, Belgium)

• English (10+ varieties)

• Estonian

• Farsi

• Finnish

• French (5 varieties)

• Gallego (Galician)

• German (Austrian, German, Luxembourg, Swiss)

• Greek

• Gujarati

• Haitian Creole

• Hausa

• Hebrew

• Hindi

• Hungarian

• Italian

• Japanese

• Kannada

• Kermanji (Iran)

• Korean (North, South)

• Kurdish (Sorani)

• Laki (Iran)

• Latvian

• Lithuanian

• Luri (Iran)

• Malayalam

• Malagasy

• Mandarin (China, Taiwan)

• Marathi

• Mazanderani (Iran)

• Min

• Norwegian (Nynorsk, Bokmal)

• Oriya

• Pashto

• Polish

• Portuguese (Brazilian, European)

• Romanian

• Russian

• Serbian

• Slovak

• Slovenian

• Somali

• Spanish (15+ varieties)

• Swedish

• Sylheti

• Tagalog

• Tamil

• Telugu

• Thai

• Turkish

• Ukrainian

• Urdu

• Vietnamese

• Wu

• Xiang

Languages covered

The list of languages in which Appen works is continually expanding, and includes:

Capability for additional languages can, on request, be developed rapidly.

Data

base

s - S

umm

ary

9

Database - Summary

Language Name Database Type Speakers SamplingAudio Hrs

Price

Arabic CGA_ASR001 Microphone, Scripted Speech 150 16.00 345 USD 20,000

Arabic (Eastern Algerian)

EAR_ASR001 Telephony (cell and fixed), Conversational Speech

496 8.00 58 USD 57,500

Arabic English ENA_ASR001 Conversational Telephony 250 8.00 56 USD 35,000

Arabic (MSA) MSA_ASR001 Microphone, Scripted Speech 78 16.00 12 EUR 3,600

Bahasa Indonesia BAH_ASR001 Telephony (cell and fixed), Conversational Speech

1002 8.00 63 USD 45,000

Bengali BEN_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 94 USD 45,000

Bulgarian BUL_ASR001 Telephony (cell and fixed), Conversational Speech

217 8.00 77 USD 30,000

BUL_ASR002 Microphone, Scripted Speech 77 16.00 22 EUR 3,600

Croatian CRO_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 79 USD 30,000

CRO_ASR002 Microphone, Scripted Speech 94 16.00 11 EUR 3,600

Czech CZE_ASR001 Microphone, Scripted Speech 102 16.00 31 EUR 3,600

Dari DAR_ASR001 Telephony (cell and fixed), Conversational Speech

500 8.00 80 USD 45,000

DAR_BRC001 Broadcast Data 0.00 40 USD 22,500

Dutch (Netherlands)

NLD_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 73 USD 30,000

English (Australian)

AUS_ASR001 Telephony (cell and fixed), Conversational Speech

500 8.00 94 USD 20,000

AUS_ASR002 Telephony (cell and fixed), Scripted Speech

1000 8.00 120 USD 31,500

English (Canadian)

ENC_ASR001 Telephony (cell and fixed), Scripted Speech

1000 8.00 144 USD 37,500

English (Indian)

ENI_ASR001 Telephony (cell and fixed), Scripted Speech

2358 8.00 225 USD 45,000

Data

base

s - S

umm

ary

10

Database - Summary


Price

Indian English ENI_ASR002 Conversational Telephony 540 8.00 135 USD 28,000

English (UK) UKE_ASR001 Telephony (cell and fixed), Conversational Speech

1150 8.00 102 USD 45,000

UKE_ASR002 Voicemail Telephony, Spontaneous Speech

592 8.00 69 USD 37,500

English (US) USE_ASR001 Microphone, Scripted Speech 200 48.00 124 USD 15,000

USE_ASR002 Telephony (cell and fixed), Conversational Speech

20 8.00 14 USD 7,500

Farsi/Persian FAR_ASR001 Telephony (cell and fixed), Scripted Speech

789 8.00 85 USD 45,000

FAR_ASR002 Telephony (cell and fixed), Conversational Speech

1000 8.00 61 USD 57,500

Filipino English ENF_ASR001 Conversational Telephony 450 8.00 107 USD 35,000

French (Canadian)

FRC_ASR001 Telephony (cell and fixed), Scripted Speech

1000 8.00 131 USD 37,500

FRC_ASR002 Microphone, Scripted Speech 120 16.00 46 USD 22,500

FRC_ASR003 Telephony (cell and fixed), Conversational Speech

251 8.00 20 USD 31,500

French (European)

FRF_ASR001 Telephony (cell and fixed), Conversational Speech

563 8.00 50 USD 31,500

FRF_ASR002 Voicemail Telephony, Spontaneous Speech

560 8.00 95 USD 37,500

FRF_ASR003 Microphone, Scripted Speech 98 16.00 26 EUR 3,600

German DEU_ASR001 Microphone, Scripted Speech 127 16.00 33 USD 11,500

DEU_ASR002 Voicemail Telephony, Spontaneous Speech

890 8.00 65 USD 37,500

DEU_ASR003 Microphone, Scripted Speech 77 16.00 25 EUR 3,600

Data

base

s - S

umm

ary

11

Database - Summary


Price

Hausa HAU_ASR001 Microphone, Scripted Speech 103 16.00 20 EUR 3,600

HAU_ASR002 Telephony (cell), Conversational Speech

200 8.00 66 USD 40,000

Hebrew HEB_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 69 USD 30,000

Hindi HIN_ASR001 Telephony (cell), Scripted Speech 1920 8.00 224 USD 45,000

HIN_ASR002 Telephony (cell and fixed), Conversational Speech

996 8.00 65 USD 45,000

Italian ITA_ASR001 Microphone, Scripted Speech 200 22.05 177 USD 12,500

ITA_ASR002 Microphone, Scripted Speech 103 48.00 189 USD 19,500

ITA_ASR003 Telephony (cell and fixed), Conversational Speech

200 8.00 72 USD 30,000

ITA_ASR004 Voicemail Telephony, Spontaneous Speech

550 8.00 123 USD 37,500

ITA_TTS001 Microphone, Scripted Speech 1 22.05 3 USD 11,500

Japanese JPN_ASR001 Microphone, Scripted Speech 144 16.00 33 EUR 3,600

Kannada KAN_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 30 USD 45,000

Korean KOR_ASR001 Microphone, Scripted Speech 100 16.00 20 EUR 3,600

Mandarin MAC_ASR001 Telephony (cell), Mixed environments 2000 8.00 115 USD 45,000

MAC_ASR002 Microphone, Scripted Speech 132 16.00 26 EUR 3,600

Marathi MAR_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 30 USD 45,000

Pashto PAS_ASR001 Telephony (cell and fixed), Conversational Speech

967 8.00 111 USD 65,000

PAS_ASR002 Conversational microphone data 40 16.00 80 USD 75,000

PAS_BRC001 Broadcast Data 0.00 51 USD 22,500

Data

base

s - S

umm

ary

12

Database - Summary


Price

Polish POL_ASR001 Microphone, Scripted Speech 99 16.00 25 EUR 3,600

Portuguese (Brazilian)

PTB_ASR001 Microphone, Scripted Speech 102 16.00 26 EUR 3,600

PTB_ASR002 Telephony (cell and fixed), Conversational Speech

200 8.00 66 USD 35,000

Portuguese (European)

PTP_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 72 USD 30,000

Romanian ROM_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 74 USD 30,000

Russian RUS_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 74 USD 30,000

RUS_ASR002 Microphone, Scripted Speech 115 16.00 31 EUR 3,600

Somali SOM_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 101 USD 65,000

Sorani (Kurdish) SOR_ASR001 Telephony (cell and fixed), Conversational Speech

170 8.00 11 USD 30,000

Spanish (European)

ESP_ASR001 Microphone, Scripted Speech 200 22.05 159 USD 12,500

ESP_ASR002 Voicemail Telephony, Spontaneous Speech

512 8.00 97 USD 37,500

ESP_TTS001 Microphone, Scripted Speech 1 22.05 1 USD 6,000

Spanish (Latin America)

ESL_ASR001 Microphone, Scripted Speech 100 16.00 17 EUR 3,600

Swedish SWE_ASR001 Microphone, Scripted Speech 98 16.00 30 EUR 3,600

Thai THA_ASR001 Microphone, Scripted Speech 98 16.00 35 EUR 3,600

Turkish TUR_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 83 USD 30,000

TUR_ASR002 Microphone, Scripted Speech 100 16.00 17 EUR 3,600

Urdu URD_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 95 USD 45,000

Vietnamese VIE_ASR001 Microphone, Scripted Speech 129 16.00 47 EUR 3,600

Data

base

s - D

etai

led

13

Databases

Language Arabic

DB Name CGA_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 150

Prompts per speaker 280

Total utterances/Entr ies 42,000

Audio Hours 345

Sampling rate - kHz 16.00

Recording channels 4

List Pr ice USD 20,000

Brief Descript ion

• This is a 150 speaker microphone recorded database Language Materials

• Each script elicits approximately 30 minutes of recorded speech

• Each Script includes:

o 30 Person names (first name and family name) from a set of 150

o 10 single isolated digits 0-9

o 10 8-digit sequences (randomly generated)

o 200 Phonetically balanced sentences

o 30 10-word phonetically balanced word strings

Demographics

• 50% of speakers are from the United Arab Emirates

• 50% of speakers are from Saudi Arabia

Transcriptions

• Complete transcriptions of the content of the speech files at a word level

• All acoustic events have been tagged using conventions derived from the SpeechDAT

model

• All transcriptions fully vowelized

Contact Appen for further information

mailto:[email protected]?Subject=Inquiry - CGA_ASR001

Data

base

s - D

etai

led

14

Databases

Language Arabic (Eastern Algerian)

DB Name EAR_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony


Speakers 496

Prompts per speaker

Total utterances/Entr ies

Audio Hours 58




Brief Descript ion

• This is a 496 speaker conversational** telephony database

• Approximately 29 hours of conversation data (equivalent to 58 hours of single channel

audio)

• Broad distribution of age, gender and dialects (Algiers and Constantine)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For

a smaller number of calls, only one half of the conversation was collected and transcribed


mailto:[email protected]?Subject=Inquiry - EAR_ASR001

Data

base

s - D

etai

led

15

Databases

Language Arabic English

DB Name ENA_ASR001

DB type 1 ASR


Environments Low background noise

Unique speakers 250

Average cal l length 10-15 minutes

Total utterances/Entr ies N/A

Audio Hours 56

Sampling rate – kHz 8.00



Brief Descript ion

• 115 telephony conversations are recorded for this project

• Demographic information is as follows:

o Roughly equal distribution of male and female

o Broad range of ages from 18 years – 55 years

o Approximately 50% landline/50% mobile

o Speakers speak on a range of generic topics

o Roughly equal distribution of Levantine Arabic and Egyptian Arabic speakers


audio)




mailto:[email protected]?Subject=Inquiry - ENA_ASR001

Data

base

s - D

etai

led

16

Databases

Language Arabic (MSA)

DB Name MSA_ASR001

DB type 1 ASR



Speakers 78

Prompts per speaker


Audio Hours 12



List Pr ice EUR 3,600

Brief Descript ion

• This is a 78 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)


mailto:[email protected]?Subject=Inquiry - MSA_ASR001

Data

base

s - D

etai

led

17

Databases

Language Bahasa Indonesia

DB Name BAH_ASR001

DB type 1 ASR



Speakers 1,002

Prompts per speaker


Audio Hours 63




Brief Descript ion

• This is a 1,002 speaker conversational** telephony database


audio)



** For a large proportion of calls, only one half of the conversation was collected and transcribed


mailto:[email protected]?Subject=Inquiry - BAH_ASR001

Data

base

s - D

etai

led

18

Databases

Language Bengali

DB Name BEN_ASR001

DB type 1 ASR



Speakers 1,000

Prompts per speaker


Audio Hours 94




Brief Descript ion

• This is a 1,000 speaker conversational telephony database

• Approximately 47 hours of conversation data (equivalent to 94 hours of single

channel audio)




mailto:[email protected]?Subject=Inquiry - BEN_ASR001

Data

base

s - D

etai

led

19

Databases

Language Bulgarian

DB Name BUL_ASR001

DB type 1 ASR



Speakers 217

Prompts per speaker


Audio Hours 77




Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers


channel audio)




mailto:[email protected]?Subject=Inquiry - BUL_ASR001

Data

base

s - D

etai

led

20

Databases

Language Bulgarian

DB Name BUL_ASR002

DB type 1 ASR



Speakers 77

Prompts per speaker


Audio Hours 22




Brief Descript ion


collaboration with the Karlsruhe Institute of Technology (KIT).








Romanized form (where applicable).


mailto:[email protected]?Subject=Inquiry - BUL_ASR002

Data

base

s - D

etai

led

21

Databases

Language Croatian

DB Name CRO_ASR001

DB type 1 ASR



Speakers 200

Prompts per speaker


Audio Hours 79




Brief Descript ion


• 200 telephony conversations are recorded for this project – 100 speakers make 2 calls



channel audio)




mailto:[email protected]?Subject=Inquiry - CRO_ASR001

Data

base

s - D

etai

led

22

Databases

Language Croatian

DB Name CRO_ASR002

DB type 1 ASR



Speakers 94

Prompts per speaker

Audio Hours 11





Brief Descript ion












mailto:[email protected]?Subject=Inquiry - CRO_ASR002

Data

base

s - D

etai

led

23

Databases

Language Czech

DB Name CZE_ASR001

DB type 1 ASR



Speakers 102

Prompts per speaker


Audio Hours 31




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - CZE_ASR001

Data

base

s - D

etai

led

24

Databases

Language Dari

DB Name DAR_ASR001

DB type 1 ASR



Speakers 500

Prompts per speaker


Audio Hours 80




Brief Descript ion



audio)



• Telephony Distribution

o Landline 13%

o Mobile 87%


mailto:[email protected]?Subject=Inquiry - DAR_ASR001

Data

base

s - D

etai

led

25

Databases

Language Dari

DB Name DAR_BRC001

DB type 1 Broadcast

DB type 2 Broadcast Data

Environments Broadcast Data

Speakers

Prompts per speaker


Audio Hours 40




Brief Descript ion

• Database contains 40 hours of Dari broadcast data

• Database is largely speech only and does not include music or advertisements

• Data types include:

o Talk shows

o Interviews

o News broadcasts (excluding news reading by anchors)



mailto:[email protected]?Subject=Inquiry - DAR_BRC001

Data

base

s - D

etai

led

26

Databases

Language Dutch (Netherlands)

DB Name NLD_ASR001

DB type 1 ASR



Speakers 200

Prompts per speaker


Audio Hours 73




Brief Descript ion





audio)




mailto:[email protected]?Subject=Inquiry - NLD_ASR001

Data

base

s - D

etai

led

27

Databases

Language English (Australian)

DB Name AUS_ASR001

DB type 1 ASR

DB type 2 Telephony


Speakers 500



Audio Hours 94




Brief Descript ion

• This is a 500 speaker telephony database

• 500 Speakers (including some migrant representation - Asian (predominantly Chinese),

Middle Eastern (Predominantly Lebanese), and New Zealand accented English

• 165 prompts (read speech) per speaker, including:

o Digits

o Natural Numbers

o Letter strings

o Personal, place, and business names

o Confirmation items (yes, no + fuzzy)

o Generic Command and Control items (from a set of 215)

o Phonetically rich Sentences and Words

• Mobile 50%, fixed line 50%

• Age and Gender balanced

• Moderately quiet environments (home/office)

• Total audio length: 94 hours

• Fully transcribed to SpeechDAT type conventions



mailto:[email protected]?Subject=Inquiry - AUS_ASR001

Data

base

s - D

etai

led

28

Databases

Language English (Australian)

DB Name AUS_ASR002

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 1,000



Audio Hours 120




Brief Descript ion

• This is a 1,000 speaker Australian English database

• 75 prompts per speaker, including:

o Digits

o Natural Numbers

o Letter strings



o Generic Command and Control items


• The prompts are a mixture of 'read' and 'elicited' items. 5 prompts per script are

'spontaneous free speech’

• Mixture of mobile and landline

• Age and Gender balanced





mailto:[email protected]?Subject=Inquiry - AUS_ASR002

Data

base

s - D

etai

led

29

Databases

Language English (Canadian)

DB Name ENC_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 1,000



Audio Hours 144




Brief Descript ion

• This is an extended SALA II database.

• 49 prompts per speaker are as specified by the SALA II consortium. An additional 50

prompts (similar content) were recorded by each speaker.


o Digits

o Natural Numbers

o Letter strings





• Mobile telephony recorded in a range of environments including in-car, home/office,

roadside and other public place

• Fully transcribed to SALA II/SpeechDAT type conventions



mailto:[email protected]?Subject=Inquiry - ENC_ASR001

Data

base

s - D

etai

led

30

Databases

Language English (Indian)

DB Name ENI_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 2,358



Audio Hours 225




Brief Descript ion

• This is a 2,358 speaker Indian English mobile telephony speech database recorded on

location in India

• Database Type

o Medium Level background noise - in-car, home/office, roadside and other public

place type environments

o Total audio length - Approximately 225 hours

• Demographics

o 2,358 speakers recorded in India

o 50% male, 50% female

o Broad distribution of age groups (16-60 years) and dialects

• Language Materials

o 50 prompts per speaker, including Digits; Natural Numbers; Personal, Place, and

Business Names; Confirmation items (yes, no + fuzzy); Generic Command and

Control items and Phonetically rich Sentences and Words

• Transcription and Lexicon

o Fully transcribed to SpeechDAT type conventions.

o Database is accompanied by a pronunciation lexicon [SAMPA] containing all

transcribed words.

o Lexicon - 10,128 unique headwords


mailto:[email protected]?Subject=Inquiry - ENI_ASR001

Data

base

s - D

etai

led

31

Databases

Language Indian English

DB Name ENI_ASR002

DB type 1 ASR



Unique speakers 540



Audio Hours 135




Brief Descript ion





o Dialect distribution:

Eastern India 10%

Northern India 35%

Pakistan 15%

Southern India 20%

Western India 19%




audio).




mailto:[email protected]?Subject=Inquiry - ENI_ASR002

Data

base

s - D

etai

led

32

Databases

Language English (UK)

DB Name UKE_ASR001

DB type 1 ASR



Speakers 1,150

Prompts per speaker


Audio Hours 102




Brief Descript ion


• Provides good coverage of key accents across the UK and Ireland


audio).



• Note: additional data available - please contact Appen for more details


mailto:[email protected]?Subject=Inquiry - UKE_ASR001

Data

base

s - D

etai

led

33

Databases

Language English (UK)

DB Name UKE_ASR002

DB type 1 ASR

DB type 2 Voicemail Telephony


Speakers 592

Prompts per speaker


Audio Hours 69




Brief Descript ion

• This is a 592 speaker voicemail telephony database

• Broad distribution of age, gender and landline/mobile coverage

• Provides good representation of key accents across the United Kingdom

• Approximately 69 audio hours of voicemail data

• The database covers speakers providing spontaneous voicemail type responses selected

from a pool of approx. 200 common voicemail scenarios (e.g. Leave a message to tell your

colleague that you are running late for a meeting)

• The database is fully transcribed and is accompanied by a pronunciation lexicon

containing all transcribed words


mailto:[email protected]?Subject=Inquiry - UKE_ASR002

Data

base

s - D

etai

led

34

Databases

Language English (US)

DB Name USE_ASR001

DB type 1 ASR

DB type 2 Studio/microphone recordings

Environments Studio

Speakers 200



Audio Hours 124




Brief Descript ion

• This is a 200 speaker microphone recorded database

• Each speaker read 400 prompts including:

o Digits

o Natural Numbers

o Personal and City names

o Telephone Numbers



• All speakers were recorded in a studio type environment in USA

• Database is fully transcribed and is accompanied by a pronunciation lexicon containing all

transcribed words


mailto:[email protected]?Subject=Inquiry - USE_ASR001

Data

base

s - D

etai

led

35

Databases

Language English (US)

DB Name USE_ASR002

DB type 1 ASR



Speakers 20

Prompts per speaker


Audio Hours 14




Brief Descript ion


• Call-Centre style conversations

• Approximately 7 hours of conversation data in total




mailto:[email protected]?Subject=Inquiry - USE_ASR002

Data

base

s - D

etai

led

36

Databases

Language Farsi/Persian

DB Name FAR_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 789


Audio Hours 85





Brief Descript ion

• This is a 789 speaker Farsi telephony speech database recorded on location in Iran.

• 50% male, 50% female

• Broad distribution of age groups (16-60 years) and dialects

• Medium Level background noise - in-vehicle, home/office, roadside and other public place

type environments


o 48 prompts per speaker, including Digits; Natural Numbers; Letter strings;

Personal, Place, and Business names; Confirmation items (yes and no); Generic

Command and Control items and Phonetically Rich sentences and words

• Transcriptions

o Fully transcribed to OrienTel type conventions

• Lexicon


transcribed words

• Total audio length - Approximately 85 hours


mailto:[email protected]?Subject=Inquiry - FAR_ASR001

Data

base

s - D

etai

led

37

Databases

Language Farsi/Persian

DB Name FAR_ASR002

DB type 1 ASR


Environments Mixed

Speakers 1,000

Prompts per speaker


Audio Hours 61




Brief Descript ion



audio)

• Database is fully transcribed and time stamped



mailto:[email protected]?Subject=Inquiry - FAR_ASR002

Data

base

s - D

etai

led

38

Databases

Language Filipino English

DB Name ENF_ASR001

DB type 1 ASR



Unique speakers 450



Audio Hours 107




Brief Descript ion








audio).




mailto:[email protected]?Subject=Inquiry - ENF_ASR001

Data

base

s - D

etai

led

39

Databases

Language French (Canadian)

DB Name FRC_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 1,000



Audio Hours 131




Brief Descript ion

• This is an extended SALA II database

• 48 prompts per speaker are as specified by the SALA II consortium. An additional 52

prompts (similar content) were recorded by each speaker


o Digits

o Natural Numbers

o Letter strings





• Mobile telephony recorded in a range of environments including in-car, home/office,

roadside and other public place



• Database is accompanied by a pronunciation lexicon [SAMPA] containing all

transcribed words


mailto:[email protected]?Subject=Inquiry - FRC_ASR001

Data

base

s - D

etai

led

40

Databases


DB Name FRC_ASR002

DB type 1 ASR

DB type 2 Microphone recordings


Speakers 120



Audio Hours 46




Brief Descript ion


• Scripts include:

o Person names

o Digits

o Digit strings (randomly generated)

o Addresses

o Phonetically rich sentences

• Dialects

o 50% Quebecois – Montreal

o 50% Quebecois – Other





Data

base

s - D

etai

led

41

Databases


DB Name FRC_ASR003

DB type 1 ASR



Speakers 251

Prompts per speaker


Audio Hours 20




Brief Descript ion



audio)



** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For a

small number of calls, only one half of the conversation was collected and transcribed



Data

base

s - D

etai

led

42

Databases

Language French (European)

DB Name FRF_ASR001

DB type 1 ASR



Speakers 563

Prompts per speaker


Audio Hours 50




Brief Descript ion



audio).



** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For a

smaller number of calls, only one half of the conversation was collected and transcribed


mailto:[email protected]?Subject=Inquiry - FRF_ASR001

Data

base

s - D

etai

led

43

Databases


DB Name FRF_ASR002

DB type 1 ASR



Speakers 560

Prompts per speaker


Audio Hours 95




Brief Descript ion



• Provides good representation of key accents across France









Data

base

s - D

etai

led

44

Databases


DB Name FRF_ASR003

DB type 1 ASR



Speakers 98

Prompts per speaker


Audio Hours 26




Brief Descript ion













Data

base

s - D

etai

led

45

Databases

Language German

DB Name DEU_ASR001

DB type 1 ASR


Environments Studio

Speakers 127



Audio Hours 33




Brief Descript ion


• Each speaker read 100 prompts including:

o Digits

o Natural Numbers

o Personal and City names

o Telephone Numbers



• All speakers were recorded in a studio type environment in Germany

• Database is fully transcribed and is accompanied by a pronunciation lexicon containing all

transcribed words


mailto:[email protected]?Subject=Inquiry - DEU_ASR001

Data

base

s - D

etai

led

46

Databases

Language German

DB Name DEU_ASR002

DB type 1 ASR



Speakers 890

Prompts per speaker


Audio Hours 65


Recording channels


Brief Descript ion

• This is an 890 speaker voicemail telephony database


• Provides good representation of key accents across Germany









Data

base

s - D

etai

led

47

Databases

Language German

DB Name DEU_ASR003

DB type 1 ASR



Speakers 77

Prompts per speaker


Audio Hours 25




Brief Descript ion


collaboration with the Karlsruhe Institute of Technology (KIT).

• Each speaker reads a number phonetically rich sentences










Data

base

s - D

etai

led

48

Databases

Language Hausa

DB Name HAU_ASR001

DB type 1 ASR



Speakers 103

Prompts per speaker


Audio Hours 20




Brief Descript ion



• Each speaker reads a number phonetically rich sentences









mailto:[email protected]?Subject=Inquiry - HAU_ASR001

Data

base

s - D

etai

led

49

Databases

Language Hausa

DB Name HAU_ASR002

DB type 1 ASR

DB type 2 Conversational telephony


Speakers 200

Prompts per speaker


Audio Hours 66




Brief Descript ion



each, to a pool of 100 call receivers


• channel audio).




mailto:[email protected]?Subject=Inquiry - HAU_ASR002

Data

base

s - D

etai

led

50

Databases

Language Hebrew

DB Name HEB_ASR001

DB type 1 ASR



Speakers 200

Prompts per speaker


Audio Hours 69




Brief Descript ion





• channel audio).




mailto:[email protected]?Subject=Inquiry - HEB_ASR001

Data

base

s - D

etai

led

51

Databases

Language Hindi

DB Name HIN_ASR001

DB type 1 ASR

DB type 2 Telephony


Speakers 1,920



Audio Hours 224




Brief Descript ion

• This is a 1,920 speaker Hindi mobile telephony speech database. The database comprises

1,920 speakers who speak Hindi as a second language (i.e. native speakers of Telugu,

Gujarati, etc who use Hindi as a second language) recorded on location in India

• Database Type

o 1,920 speakers recorded in India

o 50% male, 50% female

o Broad distribution of age groups (16-60 years) and dialects

o Medium Level background noise - in-car, home/office, roadside and other public

place type environments


o 50 prompts per speaker, including Digits; Natural Numbers; Personal, Place and

Business names; Confirmation items (yes, no + fuzzy); Generic Command and

Control items; Phonetically rich Sentences and Words; and Web addresses

• Transcriptions

o Fully transcribed to SpeechDAT type conventions

• Lexicon


transcribed words

o Lexicon - 9,853 unique headwords

o Total audio length - Approximately 224 hours


mailto:[email protected]?Subject=Inquiry - HIN_ASR001

Data

base

s - D

etai

led

52

Databases

Language Hindi

DB Name HIN_ASR002

DB type 1 ASR


Environments Mixed

Speakers 996

Prompts per speaker


Audio Hours 65




Brief Descript ion



channel audio)



** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed.

For a smaller number of calls, only one half of the conversation was collected and

transcribed


mailto:[email protected]?Subject=Inquiry - HIN_ASR002

Data

base

s - D

etai

led

53

Databases

Language Italian

DB Name ITA_ASR001

DB type 1 ASR


Environments Mixed

Speakers 200



Audio Hours 177




Brief Descript ion


• Each speaker read 200 utterances:

o 100 - command and control type items

o 100 - phonetically rich sentences



• Lexicon - 7,316 unique headwords

• Total audio length - 177 hours


mailto:[email protected]?Subject=Inquiry - ITA_ASR001

Data

base

s - D

etai

led

54

Databases

Language Italian

DB Name ITA_ASR002

DB type 1 ASR


Environments In-Car

Speakers 103


Audio Hours 189





Brief Descript ion

• This is a 205 session In-Car database

• Each speaker recorded 1or 2 sessions:

o Session 1 in a parked vehicle with the engine running

o Session 2 in a vehicle travelling at 60 mph (100 km/h).

• 350 prompts were read by each speaker (175) per session) including:

o Digits

o Street names







Data

base

s - D

etai

led

55

Databases

Language Italian

DB Name ITA_ASR003

DB type 1 ASR



Speakers 200

Prompts per speaker


Audio Hours 72




Brief Descript ion





audio)





Data

base

s - D

etai

led

56

Databases

Language Italian

DB Name ITA_ASR004

DB type 1 ASR



Speakers 550

Prompts per speaker


Audio Hours 123




Brief Descript ion



• Provides good representation of key accents across Italy









Data

base

s - D

etai

led

57

Databases

Language Italian

DB Name ITA_TTS001

DB type 1 TTS


Environments Studio

Speakers 1

Prompts per speaker 3,300


Audio Hours 3




Brief Descript ion

• This is a single speaker TTS speech database. The database comprises 3,300 phonetically

rich sentences recorded by a male Italian speaker in a studio environment. The database is

accompanied by a pronunciation lexicon containing an entry for each of the words spoken

in the database


mailto:[email protected]?Subject=Inquiry - ITA_TTS001

Data

base

s - D

etai

led

58

Databases

Language Japanese

DB Name JPN_ASR001

DB type 1 ASR



Speakers 144

Prompts per speaker


Audio Hours 33




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - JPN_ASR001

Data

base

s - D

etai

led

59

Databases

Language Kannada

DB Name KAN_ASR001

DB type 1 ASR


Environments Mixed

Speakers 1,000

Prompts per speaker


Audio Hours 30




Brief Descript ion



audio).




mailto:[email protected]?Subject=Inquiry - KAN_ASR001

Data

base

s - D

etai

led

60

Databases

Language Korean

DB Name KOR_ASR001

DB type 1 ASR



Speakers 100

Prompts per speaker


Audio Hours 20




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - KOR_ASR001

Data

base

s - D

etai

led

61

Databases

Language Mandarin

DB Name MAC_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 2,000



Audio Hours 115




Brief Descript ion

• This is a 2,000 speaker Mandarin mobile telephony speech data collection

• The database comprises 2,000 Mandarin speakers recorded on location in China

• 2,000 speakers recorded in China

• 50% male, 50% female

• 100% Mobile Telephony

• Broad distribution of age groups (16-60 years) Language Materials


o Digits

o Natural Numbers





• Transcriptions


• Lexicon

• Database is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed

words


mailto:[email protected]?Subject=Inquiry - MAC_ASR001

Data

base

s - D

etai

led

62

Databases

Language Mandarin

DB Name MAC_ASR002

DB type 1 ASR



Speakers 132

Prompts per speaker


Audio Hours 26




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - MAC_ASR002

Data

base

s - D

etai

led

63

Databases

Language Marathi

DB Name MAR_ASR001

DB type 1 ASR


Environments Mixed

Speakers 1,000

Prompts per speaker


Audio Hours 30




Brief Descript ion



audio).




mailto:[email protected]?Subject=Inquiry - MAR_ASR001

Data

base

s - D

etai

led

64

Databases

Language Pashto

DB Name PAS_ASR001

DB type 1 ASR



Speakers 967

Prompts per speaker


Audio Hours 111




Brief Descript ion



audio).



• For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For

a smaller number of calls, only one half of the conversation was collected and transcribed


mailto:[email protected]?Subject=Inquiry - PAS_ASR001

Data

base

s - D

etai

led

65

Databases

Language Pashto

DB Name PAS_ASR002

DB type 1 ASR

DB type 2 Conversational microphone data


Number of sessions 40

Average session length 120 minutes


Audio Hours 80



L ist Pr ice USD 75,000

Br ief Descript ion

• Each recording consists of a number of TransTAC style dialogues (monolingual 2-way

conversations). One speaker acts as an interviewer and the other as the interviewee

• The interviewer appears in more than one set of dialogues but the interviewee is unique for

each set

• Data collection scenarios are similar to TransTAC style (e.g. civil affairs, checkpoints etc.)


o Roughly 25% female and 75% male speakers


o Broad distribution across two dialect regions in Afghanistan

• 40 hours of conversation data (equivalent to 80 hours of single channel audio)



• A full translation of the transcripts into French is also available as an optional additional

purchase


mailto:[email protected]?Subject=Inquiry - PAS_ASR002

Data

base

s - D

etai

led

66

Databases

Language Pashto

DB Name PAS_BRC001

DB type 1 Broadcast

DB type 2 Broadcast Data

Environments Broadcast Data

Speakers

Prompts per speaker


Audio Hours 51




Brief Descript ion

• Database contains 50 hours of Pashto broadcast data

• Database is largely speech only and does not include music or advertisements

• Data types include:

o Talk shows

o Interviews

o News broadcasts (excluding news reading by anchors)



mailto:[email protected]?Subject=Inquiry - PAS_BRC001

Data

base

s - D

etai

led

67

Databases

Language Polish

DB Name POL_ASR001

DB type 1 ASR



Speakers 99

Prompts per speaker


Audio Hours 25




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - POL_ASR001

Data

base

s - D

etai

led

68

Databases

Language Portuguese (Brazilian)

DB Name PTB_ASR001

DB type 1 ASR



Speakers 102

Prompts per speaker


Audio Hours 26




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - PTB_ASR001

Data

base

s - D

etai

led

69

Databases

Language Portuguese (Brazilian)

DB Name PTB_ASR002

DB type 1 ASR



Speakers 200

Prompts per speaker


Audio Hours 66




Brief Descript ion

• This is a 300 speaker conversational telephony database. For this project (some speakers

have participated in up to 2 calls)


audio).




mailto:[email protected]?Subject=Inquiry - PTB_ASR002

Data

base

s - D

etai

led

70

Databases

Language Portuguese (European)

DB Name PTP_ASR001

DB type 1 ASR



Speakers 200

Prompts per speaker


Audio Hours 72




Brief Descript ion





audio).




mailto:[email protected]?Subject=Inquiry - PTP_ASR001

Data

base

s - D

etai

led

71

Databases

Language Romanian

DB Name ROM_ASR001

DB type 1 ASR



Speakers 200

Prompts per speaker


Audio Hours 74




Brief Descript ion


• 200 telephony conversations are recorded for this project – 100 speakers make 2 calls



audio)




mailto:[email protected]?Subject=Inquiry - ROM_ASR001

Data

base

s - D

etai

led

72

Databases

Language Russian

DB Name RUS_ASR001

DB type 1 ASR



Speakers 200

Prompts per speaker


Audio Hours 74




Brief Descript ion





audio).




mailto:[email protected]?Subject=Inquiry - RUS_ASR001

Data

base

s - D

etai

led

73

Databases

Language Russian

DB Name RUS_ASR002

DB type 1 ASR



Speakers 115

Prompts per speaker


Audio Hours 31




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - RUS_ASR002

Data

base

s - D

etai

led

74

Databases

Language Somali

DB Name SOM_ASR001

DB type 1 ASR



Speakers 1,000

Prompts per speaker


Audio Hours 101




Brief Descript ion



audio)




mailto:[email protected]?Subject=Inquiry - SOM_ASR001

Data

base

s - D

etai

led

75

Databases

Language Sorani (Kurdish)

DB Name SOR_ASR001

DB type 1 ASR



Speakers 170

Prompts per speaker


Audio Hours 11




Brief Descript ion



audio).



• For a large proportion of calls, only one half of the conversation was collected and

transcribed


SOR_ASR001

Data

base

s - D

etai

led

76

Databases

Language Spanish (European)

DB Name ESP_ASR001

DB type 1 ASR


Environments Mixed

Speakers 200



Audio Hours 159




Brief Descript ion


• Each speaker read 200 utterances:

o 100 - command and control type items

o 100 - phonetically rich sentences



• Lexicon - 6,367 unique headwords

• Total audio length - 159 hours


mailto:[email protected]?Subject=Inquiry - ESP_ASR001

Data

base

s - D

etai

led

77

Databases


DB Name ESP_ASR002

DB type 1 ASR



Speakers 512

Prompts per speaker


Audio Hours 97




Brief Descript ion



• Provides good representation of key accents across Spain








mailto:[email protected]?Subject=Inquiry - ESP_ASR002

Data

base

s - D

etai

led

78

Databases


DB Name ESP_TTS001

DB type 1 TTS


Environments Studio

Speakers 1

Prompts per speaker 1,787


Audio Hours 1




Brief Descript ion

• This is a single speaker TTS speech database. The database comprises 1,786 phonetically

rich sentences recorded by a male Spanish speaker in a studio environment. The database

is accompanied by a pronunciation lexicon containing an entry for each of the words

spoken in the database


mailto:[email protected]?Subject=Inquiry - ESP_TTS001

Data

base

s - D

etai

led

79

Databases

Language Spanish (Latin America)

DB Name ESL_ASR001

DB type 1 ASR



Speakers 100

Prompts per speaker


Audio Hours 17




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - ESL_ASR001

Data

base

s - D

etai

led

80

Databases

Language Swedish

DB Name SWE_ASR001

DB type 1 ASR



Speakers 98

Prompts per speaker


Audio Hours 30




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - SWE_ASR001

Data

base

s - D

etai

led

81

Databases

Language Thai

DB Name THA_ASR001

DB type 1 ASR



Speakers 98

Prompts per speaker


Audio Hours 35




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - THA_ASR001

Data

base

s - D

etai

led

82

Databases

Language Turkish

DB Name TUR_ASR001

DB type 1 ASR



Speakers 200

Prompts per speaker


Audio Hours 83




Brief Descript ion





audio).




mailto:[email protected]?Subject=Inquiry - TUR_ASR001

Data

base

s - D

etai

led

83

Databases

Language Turkish

DB Name TUR_ASR002

DB type 1 ASR



Speakers 100

Prompts per speaker


Audio Hours 17




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - TUR_ASR002

Data

base

s - D

etai

led

84

Databases

Language Urdu

DB Name URD_ASR001

DB type 1 ASR


Environments Mixed

Speakers 1,000

Prompts per speaker


Audio Hours 95




Brief Descript ion

• This is a 1,000 speaker conversational telephony database recorded by native Urdu

speakers in Pakistan (700 speakers) and India (300 speakers)


audio).




mailto:[email protected]?Subject=Inquiry - URD_ASR001

Data

base

s - D

etai

led

85

Databases

Language Vietnamese

DB Name VIE_ASR001

DB type 1 ASR



Speakers 129

Prompts per speaker


Audio Hours 47




Brief Descript ion












mailto:[email protected]?Subject=Inquiry - VIE_ASR001

Lexica

86

Lexica

OverviewAppen Butler Hill has considerable experience in providing a variety of lexicon types. These include

• Pronunciat ion Lexica providing phonemic representation, syllabification, and stress (primary andsecondary as appropriate)

• Part-of-speech tagged Lexica providing grammatical and semantic labels• Other reference text based materia ls including spelling/mis-spelling lists, spell-check

dictionaries, mappings of colloquial language to standard forms, orthographic normalisation lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages (please see language list below).

Domain CoverageTypical domains covered in our off-the-shelf holdings for a given language include:

• General Vocabulary• Geographical Names e.g. Place Names (City, State, Suburb)• Numbers (0-10,000)• Person Names (both Given and Family)

Lexica can be developed from a wordl ist provided by the cl ient or by Appen Butler Hi l l . I f acl ient requires vocabulary of a specif ic nature or to cover a specif ic domain, this cantypical ly be provided under the same license and pricing terms as our pre-exist ing (off- the-shelf) holdings.

Lexicon Structure• Our Lexica are usually created using a SAMPA phone set for the language which aligns SAMPA

symbols with IPA equivalents. We can convert to most other machine readable formats on request• We also include documentation files which include phone set definitions, statistical notes about phone

coverage within a given Lexicon, and may include background information on data quality andvalidation.

Lexica are typically delivered as text files consisting of three or four tab-delimited fields:Field 1 - HeadwordField 2 - SAMPA pronunciationField 3 - Variant Rank (0 = preferred pronunciation; 1 = also heard, less common)Field 4 - Label e.g. (FAMILY_NAME, GIVEN_NAME, COMMON_WORD…etc.)

In addition to the phonemic mark-up, our Lexica are marked up for primary and secondary stress and forsyllabification where applicable. They will also include pronunciation variants where relevant.

LexiconCategory

Brief Descript ionLicense Priceper headword

(USD)

1 Most languages using Latin based orthographies USD 0.335

2Languages requiring tone mark-up (e.g. Mandarin, Cantonese) and languages

requiring multiple representational forms in the orthography (e.g. Japanese)USD 0.415

3 Languages requiring full diacritization/vowelization (e.g.Arabic) USD 0.460

Pric ing for special ized Languages and Part-of-Speech Tagged Lexica can be provided on

request.

Lexica

Overview Appen has considerable experience in providing a variety of lexicon types. These include:

• Pronunciat ion Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

• Part-of-speech tagged Lexica providing grammatical and semantic labels • Other reference text based materia ls including spelling/mis-spelling lists, spell-check

dictionaries, mappings of colloquial language to standard forms, orthographic normalisation lists.

Over a period of 15 years, Appen has generated a significant volume of licensable material for a wide range of languages (please see language list below). Domain Coverage Typical domains covered in our off-the-shelf holdings for a given language include:

• General Vocabulary • Geographical Names e.g. Place Names (City, State, Suburb) • Numbers (0-10,000) • Person Names (both Given and Family)

Lexica can be developed from a wordl ist provided by the cl ient or by Appen. I f a cl ient requires vocabulary of a specif ic nature or to cover a specif ic domain, th is can typical ly be provided under the same l icense and pric ing terms as our pre-exist ing (off- the-shelf) holdings. Lexicon Structure

• Our Lexica are usually created using a SAMPA phone set for the language which aligns SAMPA symbols with IPA equivalents. We can convert to most other machine readable formats on request

• We also include documentation files which include phone set definitions, statistical notes about phone coverage within a given Lexicon, and may include background information on data quality and validation.

Lexica are typically delivered as text files consisting of three or four tab-delimited fields: Field 1 - Headword Field 2 - SAMPA pronunciation Field 3 - Variant Rank (0 = preferred pronunciation; 1 = also heard, less common) Field 4 - Label e.g. (FAMILY_NAME, GIVEN_NAME, COMMON_WORD…etc.) In addition to the phonemic mark-up, our Lexica are marked up for primary and secondary stress and for syllabification where applicable. They will also include pronunciation variants where relevant.

Lexicon Category

Brief Descript ion

License Price per headword

(USD)

1 Most languages using Latin based orthographies USD 0.335

2 Languages requiring tone mark-up (e.g. Mandarin, Cantonese) and languages

requiring multiple representational forms in the orthography (e.g. Japanese) USD 0.415

3 Languages requiring full diacritization/vowelization (e.g.Arabic) USD 0.460

Lexica

87

Lexica

Number of headwords

New offerings are frequently added. For holdings information in a given language or to discuss any

customized development efforts, please contact:

[email protected]

appen.com

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000

English (Canadian)

English (Australian)

Dutch

Dari

Danish

Czech

Croa>an

Catalan

Cantonese

Bulgarian

Bengali

Basque

Bahasa Malay

Bahasa Indonesia

Arabic (UAE)

Arabic (Syrian)

Arabic (South Levan>ne)

Arabic (Pales>nian)

Arabic (North Levan>ne)

Arabic (MSA)

Arabic (Maghrebi)

Arabic (Iraqi)

Arabic (Gulf)

Arabic (Egyp>an)

Arabic (Algerian)

Assamese

>75,000

>55,000

>100,000

>110,000

>75,000

>70,000

Lexica

88

Lexica

Number of headwords



[email protected]

appen.com

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000

Norwegian

Marathi

Mandarin

Malayalam

Korean

Kannada

Japanese

Italian

Hungarian

Hindi

Hebrew

Hausa

Greek

German (Switzerland)

German (Austria)

German

French (Switzerland)

French (Luxembourg)

French (European)

French (Canadian)

French (Belgian)

Finnish

English (US)

English (UK)

English (New Zealand)

English (Indian)

>155,000

>85,000

>60,000

>110,000

>55,000

>190,000

>260,000

>100,000

>115,000

>200,000

Lexica

89

Lexica

Number of headwords



[email protected]

appen.com

>250,000

>100,000

>100,000

>90,000

>115,000

>100,000

>50,000

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000

Xiang

Wu

Vietnamese

Urdu

Ukrainian

Turkish

Thai

Telugu

Tamil

Tagalog

Sylhe?

Swedish

Swahili(Kenya)

Spanish (Mexican)

Spanish (EU -‐ Cas?lian)

Spanish (American -‐ US)

Spanish (All La?n America)

Sorani (Kurdish)

Somali

Serbian

Russian

Romanian

Portuguese (EU)

Portuguese (Brazil)

Polish

Persian/Farsi

Pashto

Oriya

>100,000

>100,000

Oth

er R

esou

rces

90

Other Language Resources Apart from speech databases and lexica, Appen also has a range of other language resources

available for license, which can be found in this section. These resources include:

1. Text Corpora — We have a wide variety of text collections in different languages

available for license. Apart from the Vowelized Arabic Corpus, Appen also has a range of

Named Entity annotated texts. These are corpora of 500,000 words of news text that have

been annotated for persons, titles, quantities, geopolitical entities, locations, facilities, etc.

2. Morphological Analyzers — Our morphological analyzers are designed to generate

grammatically acceptable words using tagged stem dictionaries and information on

inflectional affixes and their combinations. They can manipulate text from languages with

non-Latin scripts and currently generate Urdu and Persian, including informal written

variants of affixes.

3. Thesaurus — Appen can undertake thesaurus development in several ways: from first

principles, as an extension to existing work or as validation of an existing thesaurus, with

consistency and coverage an important focus. Because each language is subtly different

and requires deep grammatical analysis to produce a quality product, native speakers are

always used to build a thesaurus. Appen can produce thesauri to client specifications as a

licensable database which is supplied in a standard XML format or to client specifications.

4. Language Analysis Documentat ion — Appen can provide comprehensive language

analysis documents under license for all languages of interest. These documents support

system and application developers and include phonological features and processes,

analysis of

Romanization schemes (where applicable), regional and dialectal differences and

population statistics of speakers. Appen can also provide analysis and recommendations

on specific collections for a nominated language.

Oth

er R

esou

rces

91

Language Analysis Documents

Language DB Name List Pr ice Brief Descript ion Arabic (Iraqi) ARB_LAN001

USD 2,500

(per language)

The key topics that are typically covered in the language analysis document include:

• General Information about the

country • General Information about the

language • Language classification of the

language • Other Languages spoken in the

country • History of the language (where

relevant) including changes due to immigration etc

• Dialects of the language • maps indicating dialect regions • discussion of dialects –

distribution,

features etc.

• recommendations on a dialect distribution that would be feasible to use in a speech data collection

• Sound System of the language • Relevant Phonological Processes

prevalent in the language/country • Orthographic Conventions for the

language • Communications

Arabic

(North

Levantine)

ARB_LAN002

Bahasa

Indonesia BAH_LAN001

Brazilian

Portuguese PTB_LAN001

Croatian CRO_LAN001

Dari DAR_LAN001

English (US) ENG_LAN001

Farsi/Persian FAR_LAN001

French

(Canadian) FRC_LAN001

German DEU_LAN001

Hebrew HEB_LAN001

Japanese JAP_LAN001

Korean KOR_LAN001

Mandarin MAC_LAN001

Pashto PAS_LAN001

Russian RUS_LAN001

Serbian SRB_LAN001

Sorani (Kurdish) SOR_LAN001

Thai THA_LAN001

Urdu URD_LAN001

Oth

er R

esou

rces

92

NER Corpora

Language DB Name Words List Pr ice Brief Descript ion

Arabic ARB_NER001

500,000

(per language)

USD 7,500

(per language)

Corpora containing text material collected from a variety of sources.

Each Text Corpus contains approximately

500,000 words and is

tagged for the following Named

Entities:

- Person

- Organization

- Location

- Nationality

- Religion

- Facility

- Geo-Political Entity

- Titles

English ENG_NER001

Farsi/

Persian FAR_NER001

Japanese JPY_NER001

Korean KOR_NER001

Mandarin MAC_NER001

Russian RUS_NER001

Urdu URD_NER001

Oth

er R

esou

rces

93

Text Corpora Language Arabic (MSA)

DB Name ARB_THE001

DB type 2 Thesaurus

Words 28,000

List Pr ice Provided on request

Br ief Descript ion:

• The thesaurus contains 28,000 headwords

• For each headword, the following information is provided:

o Detailed Part-Of-Speech information including Verb (Intransitive/Transitive),

• Adverb, Noun, Adjective

o A broad definition in English

o Synonyms

o Antonyms

o A broad definition of the antonym group linked to the sense group

Oth

er R

esou

rces

94

Text Corpora Language Arabic (MSA)

DB Name ARB_TXT001

DB type 2 Vowelized text corpus

Words 450,000


Brief Descript ion:

• This vowelised corpus is made up of 450,000 words of Arabic news text

• The text has been 100% manually vowelised and checked

Oth

er R

esou

rces

95

Text Corpora Language Farsi/Persian

DB Name FAR_MOR001

DB type 2 Morphological Database

Words 0


Brief Descript ion:

• The Farsi/Persian morphological database comprises six files in text format:

-‐ a stems dictionary;

-‐ a dictionary of inflectional prefixes;

-‐ a dictionary of inflectional suffixes; and

-‐ three compatibility tables, which define the grammatically acceptable combinations

of stems, prefixes and suffixes for any given stem in the stems dictionary (prefix-

suffix; prefix-stem; suffix-stem).

• The format of the six files corresponds to the input format required by the Buckwalter

AraGen generation program. This program uses the input file to output the complete set of

potential words defined by the stem and affix dictionaries and compatibility tables

• All words and affixes in the six files are in a Romanized form (converted using an Appen

conversion table). Each word and affix is shown with and without short vowels. The form

with short vowels (the vowelized form) reflects the pronunciation of the word or affix

• SUMMARY OF CONTENTS

-‐ Stems in stem dictionary - 18,364 (including stem alternations)

-‐ Stems in stem dictionary - 16,492 (excluding stem alternations)

-‐ Number of suffixes: 506 (including zero suffix and variants of suffixes with

and without the zero width non-joiner character)

-‐ Number of prefixes: 14 (including zero prefix)

-‐ Number of unique words generated: 1,608,559

Oth

er R

esou

rces

96

Text Corpora Language Urdu

DB Name URD_MOR001

DB type 2 Morphological Database

Words 0


Brief Descript ion:

• The Urdu morphological database comprises six files in text format:

-‐ a stems dictionary;

-‐ a dictionary of inflectional prefixes;

-‐ a dictionary of inflectional suffixes; and

-‐ three compatibility tables, which define the grammatically acceptable combinations

of stems, prefixes and suffixes for any given stem in the stems dictionary (prefix-

suffix; prefix-stem; suffix-stem)

• The format of the six files corresponds to the input format required by the Buckwalter

AraGen generation program. This program uses the input file to output the complete set of

potential words defined by the stem and affix dictionaries and compatibility tables.

• All words and affixes in the six files are in a Romanized form (converted using an Appen

conversion table). Each word and affix is shown with and without short vowels. The form

with short vowels (the vowelized form) reflects the pronunciation of the word or affix.

• SUMMARY OF CONTENTS

-‐ Stems in stem dictionary - 13,267 (including stem alternations)

-‐ Stems in stem dictionary - 13,116 (excluding stem alternations)

-‐ Number of suffixes: 115 (including zero suffix)

-‐ Number of prefixes: 1 (zero prefix)

-‐ Number of unique words generated: 31,109

TM

Contact detailsAppen Pty Ltd

Level 69 Help Street

Chatswood, SydneyNSW 2067 Australia

Enquiries:

Sydney office: +61-2-9468-6335US sales enquiries: +1-315-335-4020

Europe: +31-622-799-535Japan & Korea: +1-202-765-7106

China: +61-2-9468-6310

[email protected]

www.appen.com

LanguageResources Catalog