98
TM Language Resources Catalog

Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Embed Size (px)

Citation preview

Page 1: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

TM

Contact detailsAppen Pty Ltd

Level 69 Help Street

Chatswood, SydneyNSW 2067 Australia

Enquiries:

Sydney office: +61-2-9468-6335US sales enquiries: +1-315-335-4020

Europe: +31-622-799-535Japan & Korea: +1-202-765-7106

China: +61-2-9468-6310

[email protected]

www.appen.com

LanguageResources Catalog

Page 2: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)
Page 3: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Table of Contents

A global leader in linguistic technology solutions 3

Speech Databases - Summary 7

Speech Databases - Detailed 11

Lexica 84

Other Language Resources 88

Page 4: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)
Page 5: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Appen brings the forefront of speech and language technology to you. We deliver the highest quality in linguistic solutions to government agencies and the world’s largest corporations, with proven expertise in over 150 languages.

We understand the complex linguistic needs of today’s leading organizations. Our unparalleled range of resources and solutions gives you the edge in a wide array of applications, including:

• speech recognition

• text-to-speech synthesis

• speech analytics

• machine translation

• natural language processing

Appen’s reputation as a global leader guarantees you:

• flexibility and rapid response capability

• global coverage in over 150 languages

• highly qualified specialist personnel

• large, closely vetted crowds of in-country native speakers

• tight project management

• keen innovation and creativity

• strict client confidentiality

Appen remains fully independent of any systems provider, although we do enter into close strategic relationships with selected clients. We have been a principal sub-contractor on several European consortium projects, also in addition to supporting similar projects funded by DARPA and other US agencies.

Whatever speech and language data you need for your application, Appen will collect it for you.

Our end-to-end data collection service delivers efficiency and quality, even on multiple large-scale collections in parallel.

Available collection types include:

• telephony – fixed-line, mobile, in-car

• embedded device – in-car, desktop, smartphone, tablet

• single/multi-speaker – speakers selected by demographic or other requirements

• prompt variation – scripted, spontaneous, conversational (dialogue), meeting data

• modality – speech, text, handwriting, gesture, image and other acoustic data

• text corpora and other resources – email, SMS, named entity tags, POS tags

As part of a standard collection, we offer you the following:

• detailed linguistic and cultural research

• script preparation and localization

• crowdsourcing of native speakers

• local and remote speech recording

• transcription and annotation of collected data

• quality assurance and project management

• lexicon entries matching database contents

• packaging of database in a coherent format

A global leader in linguistic technology solutions

Data Collection

A  global  leader  in  linguistic  technology  solutions

 Appen  provides  high  quality  speech  and  language  technology  products  and  services  to  technology  developers  and  government  organizations,  and  is  recognized  as  a  global  leader  in  the  quality  and  coverage  of  its  products  and  services.  

Our  products  and  services  cover  a  wide  range  of  applications  in  speech  recognition,  text-­‐to-­‐speech  synthesis,  phonetic  search,  machine  translation  and  text  processing  including  Natural  Language  Processing  (NLP).    

Appen’s  client  base  includes  both  government  agencies  and  the  world’s  largest  and  most  respected  IT  organizations.  Our  objective  is  to  enhance  our  clients’  capabilities  in  the  fields  of  speech  and  language  technology  by  offering:  

• fast-­‐track  production  • tight  project  management,  working  to  strict  timing,  

quality  and  productivity  criteria  • specialist  personnel  including  highly  qualified  linguists  

and  computational  linguists  to  support  our  customers’  internal  resources,  particularly  in  response  to  surge  requirements  

• flexibility  and  rapid  response  capability  which  may  be  difficult  for  larger  organizations  to  achieve    

• global  coverage  which  may  be  difficult  for  smaller  organizations  to  achieve  

• large  crowds  of  in-­‐country  native  speakers  that  have  passed  our  screening  processes    

• high  levels  of  innovation  • strict  client  confidentiality  

Appen  remains  fully  independent  of  any  systems  provider,  although  we  can  and  do  enter  into  close  strategic  relationships  with  selected  clients.  We  have  been  a  principal  sub-­‐contractor  on  several  European  consortium  projects,  such  as  SpeeCon  (multiple  projects);  SALA  II  (multiple  projects);  LILA  (multiple  projects);  Orientel  and  LC-­‐Star.  Appen  has  also  supported  several  DARPA  and  other  US-­‐funded  consortium  projects.  

Appen  Catalogue  –  Speech  and  Language  Resources  

 

 Appen  has  a  large  number  of  licensable  speech  and  language  resources  currently  available  and  in  development.  Most  of  the  150+  languages  that  Appen  has  worked  in  are  included  in  off-­‐the-­‐shelf  offerings.  Up-­‐to-­‐date  catalogue  information  is  available  at  appen.com  

Licensable  materials  cover:  

• Fully  transcribed  speech  databases  for  broadcast,  embedded,  in-­‐car  and  telephony  applications    

• Pronunciation  lexicons  to  provide  both  general  and  domain  specific  coverage  for  a  given  language  (specific  categories  include  names,  places,  natural  numbers)  

• Part-­‐Of-­‐Speech  tagged  lexicons  and  Thesauri  to  support  a  wide  range  of  Speech  and  Language  Technology  development  activities  

• Corpora  annotated  for  Part-­‐Of-­‐Speech,  Morphological  Information,  Named  Entities  

• Parallel  Corpora  for  use  in  the  development  of  Machine  Translation  

Appen’s  licensable  Speech  and  Language  resources  offer  wide  coverage  of  less  commonly  taught  languages,  including  languages  and  dialects  of  West  and  North  Asia,  the  Middle  East  and  Africa.  

In  many  cases,  licensable  resources  can  be  developed  on  request  to  meet  a  particular  client’s  requirements.    

 

Page 6: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Appen Catalogue – Speech and Language Resources

We use AppenScribe, our proprietary web-based transcription interface, to deliver high-volume, high-quality data transcription and annotation to you.

Whether you are working with speech, text, video or handwriting, AppenScribe supports a large number of languages in native orthography.

Our transcription and annotation services include:

• orthographic transcription

• acoustic event transcription

• phonetic and phonemic transcription

• semantic annotation and Named Entity tagging

• annotation of handwriting and other language data

• TTS evaluation through the provision of MOS scores

• time alignment of transcription and acoustic signal

While we have experience in processing millions of US English utterances in a matter of weeks, we are equally practiced in languages like Somali which lack a standardized written form.

We ensure the highest quality of work through:

• screening and training of in-country transcribers

• automated spelling checks

• rigorous post-processing by senior team members

If you need immediate access to a complete speech and language database, Appen has a long list of licensable resources available. See www.appen.com for our latest catalogue.

Our high-quality licensable materials cover:

• fully transcribed speech databases for broadcast, call center, in-car and telephony applications

• pronunciation lexicons, both general and domain-specific (e.g. names, places, natural numbers)

• POS-tagged lexicons and thesauri

• corpora annotated for POS, morphological information and named entities

• parallel corpora for use in the development of machine translation

Appen’s databases also cover less resourced languages, including dialects of West and North Asia, the Middle East and Africa.

Transcription and Annotation

Page 7: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

We offer you the collective expertise of our premier network of freelance consultants around the globe, currently covering over 60 languages.

Appen’s team of over 1,000 highly qualified consultants includes:

• linguists, phoneticians and lexicographers

• language specialists with backgrounds in translation, localization, terminology, education and library sciences

• data annotators with experience in Internet research and search evaluation

Among the key benefits we offer to you:

• specialized resources for custom linguistic consulting

• resource pools of language specialists in over 60 countries

• large-scale recruiting and training for rapid market expansion

• on-demand staffing to respond to urgent project changes

Contact us directly for additional information and project-specific enquiries

Appen’s highly trained evaluation teams maximize the relevance of your search engine in over twenty local markets around the world.

Our in-country search experts each review hundreds of queries daily, ranking results for relevance to user input.

Our teams are familiar with search trends, popular and obscure topics, and the linguistic nuances of your search engine’s target users.

In addition to general-purpose search, we also specialize in vertical categories, including:

• local

• news

• medical

• travel

• finance

• shopping

• social

We also provide you with valuable testing of search features, such as:

• spam filtering

• related query suggestion

• duplicate removal

• business listing verification

• caption generation

Search Relevance Evaluation

Human Resourcing and Crowdsourcing

Page 8: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

• Afrikaans

• Arabic (15+ varieties)

• Assamese

• Bahasa Indonesia

• Bahasa Malaysia

• Bakhtiari (Iran)

• Basque

• Bengali

• Bulgarian

• Cantonese (China PRC, China Hong Kong)

• Catalan

• Croatian

• Czech

• Danish

• Dari

• Dutch (Netherlands, Belgium)

• English (10+ varieties)

• Estonian

• Farsi

• Finnish

• French (5 varieties)

• Gallego (Galician)

• German (Austrian, German, Luxembourg, Swiss)

• Greek

• Gujarati

• Haitian Creole

• Hausa

• Hebrew

• Hindi

• Hungarian

• Italian

• Japanese

• Kannada

• Kermanji (Iran)

• Korean (North, South)

• Kurdish (Sorani)

• Laki (Iran)

• Latvian

• Lithuanian

• Luri (Iran)

• Malayalam

• Malagasy

• Mandarin (China, Taiwan)

• Marathi

• Mazanderani (Iran)

• Min

• Norwegian (Nynorsk, Bokmal)

• Oriya

• Pashto

• Polish

• Portuguese (Brazilian, European)

• Romanian

• Russian

• Serbian

• Slovak

• Slovenian

• Somali

• Spanish (15+ varieties)

• Swedish

• Sylheti

• Tagalog

• Tamil

• Telugu

• Thai

• Turkish

• Ukrainian

• Urdu

• Vietnamese

• Wu

• Xiang

Languages covered

The list of languages in which Appen works is continually expanding, and includes:

Capability for additional languages can, on request, be developed rapidly.

Page 9: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - S

umm

ary

9

Database - Summary

Language Name Database Type Speakers SamplingAudio Hrs

Price

Arabic CGA_ASR001 Microphone, Scripted Speech 150 16.00 345 USD 20,000

Arabic (Eastern Algerian)

EAR_ASR001 Telephony (cell and fixed), Conversational Speech

496 8.00 58 USD 57,500

Arabic English ENA_ASR001 Conversational Telephony 250 8.00 56 USD 35,000

Arabic (MSA) MSA_ASR001 Microphone, Scripted Speech 78 16.00 12 EUR 3,600

Bahasa Indonesia BAH_ASR001 Telephony (cell and fixed), Conversational Speech

1002 8.00 63 USD 45,000

Bengali BEN_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 94 USD 45,000

Bulgarian BUL_ASR001 Telephony (cell and fixed), Conversational Speech

217 8.00 77 USD 30,000

BUL_ASR002 Microphone, Scripted Speech 77 16.00 22 EUR 3,600

Croatian CRO_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 79 USD 30,000

CRO_ASR002 Microphone, Scripted Speech 94 16.00 11 EUR 3,600

Czech CZE_ASR001 Microphone, Scripted Speech 102 16.00 31 EUR 3,600

Dari DAR_ASR001 Telephony (cell and fixed), Conversational Speech

500 8.00 80 USD 45,000

DAR_BRC001 Broadcast Data 0.00 40 USD 22,500

Dutch (Netherlands)

NLD_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 73 USD 30,000

English (Australian)

AUS_ASR001 Telephony (cell and fixed), Conversational Speech

500 8.00 94 USD 20,000

AUS_ASR002 Telephony (cell and fixed), Scripted Speech

1000 8.00 120 USD 31,500

English (Canadian)

ENC_ASR001 Telephony (cell and fixed), Scripted Speech

1000 8.00 144 USD 37,500

English (Indian)

ENI_ASR001 Telephony (cell and fixed), Scripted Speech

2358 8.00 225 USD 45,000

Page 10: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - S

umm

ary

10

Database - Summary

Language Name Database Type Speakers SamplingAudio Hrs

Price

Indian English ENI_ASR002 Conversational Telephony 540 8.00 135 USD 28,000

English (UK) UKE_ASR001 Telephony (cell and fixed), Conversational Speech

1150 8.00 102 USD 45,000

UKE_ASR002 Voicemail Telephony, Spontaneous Speech

592 8.00 69 USD 37,500

English (US) USE_ASR001 Microphone, Scripted Speech 200 48.00 124 USD 15,000

USE_ASR002 Telephony (cell and fixed), Conversational Speech

20 8.00 14 USD 7,500

Farsi/Persian FAR_ASR001 Telephony (cell and fixed), Scripted Speech

789 8.00 85 USD 45,000

FAR_ASR002 Telephony (cell and fixed), Conversational Speech

1000 8.00 61 USD 57,500

Filipino English ENF_ASR001 Conversational Telephony 450 8.00 107 USD 35,000

French (Canadian)

FRC_ASR001 Telephony (cell and fixed), Scripted Speech

1000 8.00 131 USD 37,500

FRC_ASR002 Microphone, Scripted Speech 120 16.00 46 USD 22,500

FRC_ASR003 Telephony (cell and fixed), Conversational Speech

251 8.00 20 USD 31,500

French (European)

FRF_ASR001 Telephony (cell and fixed), Conversational Speech

563 8.00 50 USD 31,500

FRF_ASR002 Voicemail Telephony, Spontaneous Speech

560 8.00 95 USD 37,500

FRF_ASR003 Microphone, Scripted Speech 98 16.00 26 EUR 3,600

German DEU_ASR001 Microphone, Scripted Speech 127 16.00 33 USD 11,500

DEU_ASR002 Voicemail Telephony, Spontaneous Speech

890 8.00 65 USD 37,500

DEU_ASR003 Microphone, Scripted Speech 77 16.00 25 EUR 3,600

Page 11: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - S

umm

ary

11

Database - Summary

Language Name Database Type Speakers SamplingAudio Hrs

Price

Hausa HAU_ASR001 Microphone, Scripted Speech 103 16.00 20 EUR 3,600

HAU_ASR002 Telephony (cell), Conversational Speech

200 8.00 66 USD 40,000

Hebrew HEB_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 69 USD 30,000

Hindi HIN_ASR001 Telephony (cell), Scripted Speech 1920 8.00 224 USD 45,000

HIN_ASR002 Telephony (cell and fixed), Conversational Speech

996 8.00 65 USD 45,000

Italian ITA_ASR001 Microphone, Scripted Speech 200 22.05 177 USD 12,500

ITA_ASR002 Microphone, Scripted Speech 103 48.00 189 USD 19,500

ITA_ASR003 Telephony (cell and fixed), Conversational Speech

200 8.00 72 USD 30,000

ITA_ASR004 Voicemail Telephony, Spontaneous Speech

550 8.00 123 USD 37,500

ITA_TTS001 Microphone, Scripted Speech 1 22.05 3 USD 11,500

Japanese JPN_ASR001 Microphone, Scripted Speech 144 16.00 33 EUR 3,600

Kannada KAN_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 30 USD 45,000

Korean KOR_ASR001 Microphone, Scripted Speech 100 16.00 20 EUR 3,600

Mandarin MAC_ASR001 Telephony (cell), Mixed environments 2000 8.00 115 USD 45,000

MAC_ASR002 Microphone, Scripted Speech 132 16.00 26 EUR 3,600

Marathi MAR_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 30 USD 45,000

Pashto PAS_ASR001 Telephony (cell and fixed), Conversational Speech

967 8.00 111 USD 65,000

PAS_ASR002 Conversational microphone data 40 16.00 80 USD 75,000

PAS_BRC001 Broadcast Data 0.00 51 USD 22,500

Page 12: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - S

umm

ary

12

Database - Summary

Language Name Database Type Speakers SamplingAudio Hrs

Price

Polish POL_ASR001 Microphone, Scripted Speech 99 16.00 25 EUR 3,600

Portuguese (Brazilian)

PTB_ASR001 Microphone, Scripted Speech 102 16.00 26 EUR 3,600

PTB_ASR002 Telephony (cell and fixed), Conversational Speech

200 8.00 66 USD 35,000

Portuguese (European)

PTP_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 72 USD 30,000

Romanian ROM_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 74 USD 30,000

Russian RUS_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 74 USD 30,000

RUS_ASR002 Microphone, Scripted Speech 115 16.00 31 EUR 3,600

Somali SOM_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 101 USD 65,000

Sorani (Kurdish) SOR_ASR001 Telephony (cell and fixed), Conversational Speech

170 8.00 11 USD 30,000

Spanish (European)

ESP_ASR001 Microphone, Scripted Speech 200 22.05 159 USD 12,500

ESP_ASR002 Voicemail Telephony, Spontaneous Speech

512 8.00 97 USD 37,500

ESP_TTS001 Microphone, Scripted Speech 1 22.05 1 USD 6,000

Spanish (Latin America)

ESL_ASR001 Microphone, Scripted Speech 100 16.00 17 EUR 3,600

Swedish SWE_ASR001 Microphone, Scripted Speech 98 16.00 30 EUR 3,600

Thai THA_ASR001 Microphone, Scripted Speech 98 16.00 35 EUR 3,600

Turkish TUR_ASR001 Telephony (cell and fixed), Conversational Speech

200 8.00 83 USD 30,000

TUR_ASR002 Microphone, Scripted Speech 100 16.00 17 EUR 3,600

Urdu URD_ASR001 Telephony (cell and fixed), Conversational Speech

1000 8.00 95 USD 45,000

Vietnamese VIE_ASR001 Microphone, Scripted Speech 129 16.00 47 EUR 3,600

Page 13: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

13

Databases

Language Arabic

DB Name CGA_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 150

Prompts per speaker 280

Total utterances/Entr ies 42,000

Audio Hours 345

Sampling rate - kHz 16.00

Recording channels 4

List Pr ice USD 20,000

Brief Descript ion

• This is a 150 speaker microphone recorded database Language Materials

• Each script elicits approximately 30 minutes of recorded speech

• Each Script includes:

o 30 Person names (first name and family name) from a set of 150

o 10 single isolated digits 0-9

o 10 8-digit sequences (randomly generated)

o 200 Phonetically balanced sentences

o 30 10-word phonetically balanced word strings

Demographics

• 50% of speakers are from the United Arab Emirates

• 50% of speakers are from Saudi Arabia

Transcriptions

• Complete transcriptions of the content of the speech files at a word level

• All acoustic events have been tagged using conventions derived from the SpeechDAT

model

• All transcriptions fully vowelized

Contact Appen for further information

Page 14: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

14

Databases

Language Arabic (Eastern Algerian)

DB Name EAR_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Home/office

Speakers 496

Prompts per speaker

Total utterances/Entr ies

Audio Hours 58

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 57,500

Brief Descript ion

• This is a 496 speaker conversational** telephony database

• Approximately 29 hours of conversation data (equivalent to 58 hours of single channel

audio)

• Broad distribution of age, gender and dialects (Algiers and Constantine)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For

a smaller number of calls, only one half of the conversation was collected and transcribed

Contact Appen for further information

Page 15: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

15

Databases

Language Arabic English

DB Name ENA_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Unique speakers 250

Average cal l length 10-15 minutes

Total utterances/Entr ies N/A

Audio Hours 56

Sampling rate – kHz 8.00

Recording channels 2

List Pr ice USD 35,000

Brief Descript ion

• 115 telephony conversations are recorded for this project

• Demographic information is as follows:

o Roughly equal distribution of male and female

o Broad range of ages from 18 years – 55 years

o Approximately 50% landline/50% mobile

o Speakers speak on a range of generic topics

o Roughly equal distribution of Levantine Arabic and Egyptian Arabic speakers

• Approximately 28 hours of conversation data (equivalent to 56 hours of single channel

audio)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 16: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

16

Databases

Language Arabic (MSA)

DB Name MSA_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Low background noise

Speakers 78

Prompts per speaker

Total utterances/Entr ies 4,908

Audio Hours 12

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 78 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 17: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

17

Databases

Language Bahasa Indonesia

DB Name BAH_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 1,002

Prompts per speaker

Total utterances/Entr ies

Audio Hours 63

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 45,000

Brief Descript ion

• This is a 1,002 speaker conversational** telephony database

• Approximately 31 hours of conversation data (equivalent to 63 hours of single channel

audio)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

** For a large proportion of calls, only one half of the conversation was collected and transcribed

Contact Appen for further information

Page 18: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

18

Databases

Language Bengali

DB Name BEN_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 1,000

Prompts per speaker

Total utterances/Entr ies

Audio Hours 94

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 45,000

Brief Descript ion

• This is a 1,000 speaker conversational telephony database

• Approximately 47 hours of conversation data (equivalent to 94 hours of single

channel audio)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 19: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

19

Databases

Language Bulgarian

DB Name BUL_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Home/office

Speakers 217

Prompts per speaker

Total utterances/Entr ies

Audio Hours 77

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice EUR 3,600

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers

• Approximately 38 hours of conversation data (equivalent to 77 hours of single

channel audio)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 20: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

20

Databases

Language Bulgarian

DB Name BUL_ASR002

DB type 1 ASR

DB type 2 Microphone

Environments Low background noise

Speakers 77

Prompts per speaker

Total utterances/Entr ies 8,674

Audio Hours 22

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 77 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT).

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable).

Contact Appen for further information

Page 21: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

21

Databases

Language Croatian

DB Name CRO_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Home/office

Speakers 200

Prompts per speaker

Total utterances/Entr ies

Audio Hours 79

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice EUR 3,600

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project – 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers

• Approximately 39 hours of conversation data (equivalent to 79 hours of single

channel audio)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 22: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

22

Databases

Language Croatian

DB Name CRO_ASR002

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 94

Prompts per speaker

Audio Hours 11

Total utterances/Entr ies 4,499

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 94 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 23: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

23

Databases

Language Czech

DB Name CZE_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Low background noise

Speakers 102

Prompts per speaker

Total utterances/Entr ies 12,425

Audio Hours 31

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 102 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 24: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

24

Databases

Language Dari

DB Name DAR_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 500

Prompts per speaker

Total utterances/Entr ies

Audio Hours 80

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 45,000

Brief Descript ion

• This is a 500 speaker conversational telephony database

• Approximately 40 hours of conversation data (equivalent to 80 hours of single channel

audio)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

• Telephony Distribution

o Landline 13%

o Mobile 87%

Contact Appen for further information

Page 25: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

25

Databases

Language Dari

DB Name DAR_BRC001

DB type 1 Broadcast

DB type 2 Broadcast Data

Environments Broadcast Data

Speakers

Prompts per speaker

Total utterances/Entr ies

Audio Hours 40

Sampling rate - kHz 0.00

Recording channels 1

List Pr ice USD 22,500

Brief Descript ion

• Database contains 40 hours of Dari broadcast data

• Database is largely speech only and does not include music or advertisements

• Data types include:

o Talk shows

o Interviews

o News broadcasts (excluding news reading by anchors)

• Database is fully transcribed and timestamped

Contact Appen for further information

Page 26: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

26

Databases

Language Dutch (Netherlands)

DB Name NLD_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 200

Prompts per speaker

Total utterances/Entr ies

Audio Hours 73

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 30,000

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers

• Approximately 36 hours of conversation data (equivalent to 73 hours of single channel

audio)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 27: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

27

Databases

Language English (Australian)

DB Name AUS_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Home/office

Speakers 500

Prompts per speaker 165

Total utterances/Entr ies 82,500

Audio Hours 94

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 20,000

Brief Descript ion

• This is a 500 speaker telephony database

• 500 Speakers (including some migrant representation - Asian (predominantly Chinese),

Middle Eastern (Predominantly Lebanese), and New Zealand accented English

• 165 prompts (read speech) per speaker, including:

o Digits

o Natural Numbers

o Letter strings

o Personal, place, and business names

o Confirmation items (yes, no + fuzzy)

o Generic Command and Control items (from a set of 215)

o Phonetically rich Sentences and Words

• Mobile 50%, fixed line 50%

• Age and Gender balanced

• Moderately quiet environments (home/office)

• Total audio length: 94 hours

• Fully transcribed to SpeechDAT type conventions

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 28: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

28

Databases

Language English (Australian)

DB Name AUS_ASR002

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 1,000

Prompts per speaker 75

Total utterances/Entr ies 75,000

Audio Hours 120

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 31,500

Brief Descript ion

• This is a 1,000 speaker Australian English database

• 75 prompts per speaker, including:

o Digits

o Natural Numbers

o Letter strings

o Personal, place, and business names

o Confirmation items (yes, no + fuzzy)

o Generic Command and Control items

o Phonetically rich Sentences and Words

• The prompts are a mixture of 'read' and 'elicited' items. 5 prompts per script are

'spontaneous free speech’

• Mixture of mobile and landline

• Age and Gender balanced

• Total audio length: 120 hours

• Fully transcribed to SpeechDAT type conventions

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 29: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

29

Databases

Language English (Canadian)

DB Name ENC_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 1,000

Prompts per speaker 99

Total utterances/Entr ies 99,000

Audio Hours 144

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 37,500

Brief Descript ion

• This is an extended SALA II database.

• 49 prompts per speaker are as specified by the SALA II consortium. An additional 50

prompts (similar content) were recorded by each speaker.

• 99 prompts per speaker, including:

o Digits

o Natural Numbers

o Letter strings

o Personal, place, and business names

o Confirmation items (yes, no + fuzzy)

o Generic Command and Control items

o Phonetically rich Sentences and Words

• Mobile telephony recorded in a range of environments including in-car, home/office,

roadside and other public place

• Fully transcribed to SALA II/SpeechDAT type conventions

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 30: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

30

Databases

Language English (Indian)

DB Name ENI_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 2,358

Prompts per speaker 50

Total utterances/Entr ies 117,900

Audio Hours 225

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 45,000

Brief Descript ion

• This is a 2,358 speaker Indian English mobile telephony speech database recorded on

location in India

• Database Type

o Medium Level background noise - in-car, home/office, roadside and other public

place type environments

o Total audio length - Approximately 225 hours

• Demographics

o 2,358 speakers recorded in India

o 50% male, 50% female

o Broad distribution of age groups (16-60 years) and dialects

• Language Materials

o 50 prompts per speaker, including Digits; Natural Numbers; Personal, Place, and

Business Names; Confirmation items (yes, no + fuzzy); Generic Command and

Control items and Phonetically rich Sentences and Words

• Transcription and Lexicon

o Fully transcribed to SpeechDAT type conventions.

o Database is accompanied by a pronunciation lexicon [SAMPA] containing all

transcribed words.

o Lexicon - 10,128 unique headwords

Contact Appen for further information

Page 31: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

31

Databases

Language Indian English

DB Name ENI_ASR002

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Unique speakers 540

Average cal l length 10-15 minutes

Total utterances/Entr ies N/A

Audio Hours 135

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 28,000

Brief Descript ion

• 271 telephony conversations are recorded for this project

• Demographic information is as follows:

o Roughly equal distribution of male and female

o Broad range of ages from 16 years – 60 years

o Dialect distribution:

Eastern India 10%

Northern India 35%

Pakistan 15%

Southern India 20%

Western India 19%

o Approximately 50% landline/50% mobile

o Speakers speak on a range of generic topics

• Approximately 67 hours of conversation data (equivalent to 135 hours of single channel

audio).

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 32: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

32

Databases

Language English (UK)

DB Name UKE_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 1,150

Prompts per speaker

Total utterances/Entr ies

Audio Hours 102

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 45,000

Brief Descript ion

• This is a 1,150 speaker conversational telephony database

• Provides good coverage of key accents across the UK and Ireland

• Approximately 51 hours of conversation data (equivalent to 102 hours of single channel

audio).

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

• Note: additional data available - please contact Appen for more details

Contact Appen for further information

Page 33: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

33

Databases

Language English (UK)

DB Name UKE_ASR002

DB type 1 ASR

DB type 2 Voicemail Telephony

Environments Low background noise

Speakers 592

Prompts per speaker

Total utterances/Entr ies

Audio Hours 69

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 37,500

Brief Descript ion

• This is a 592 speaker voicemail telephony database

• Broad distribution of age, gender and landline/mobile coverage

• Provides good representation of key accents across the United Kingdom

• Approximately 69 audio hours of voicemail data

• The database covers speakers providing spontaneous voicemail type responses selected

from a pool of approx. 200 common voicemail scenarios (e.g. Leave a message to tell your

colleague that you are running late for a meeting)

• The database is fully transcribed and is accompanied by a pronunciation lexicon

containing all transcribed words

Contact Appen for further information

Page 34: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

34

Databases

Language English (US)

DB Name USE_ASR001

DB type 1 ASR

DB type 2 Studio/microphone recordings

Environments Studio

Speakers 200

Prompts per speaker 400

Total utterances/Entr ies 80,000

Audio Hours 124

Sampling rate - kHz 48.00

Recording channels 2

List Pr ice USD 15,000

Brief Descript ion

• This is a 200 speaker microphone recorded database

• Each speaker read 400 prompts including:

o Digits

o Natural Numbers

o Personal and City names

o Telephone Numbers

o Generic Command and Control items

o Phonetically rich Sentences and Words

• All speakers were recorded in a studio type environment in USA

• Database is fully transcribed and is accompanied by a pronunciation lexicon containing all

transcribed words

Contact Appen for further information

Page 35: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

35

Databases

Language English (US)

DB Name USE_ASR002

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 20

Prompts per speaker

Total utterances/Entr ies

Audio Hours 14

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 7,500

Brief Descript ion

• This is a 20 speaker conversational telephony database

• Call-Centre style conversations

• Approximately 7 hours of conversation data in total

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 36: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

36

Databases

Language Farsi/Persian

DB Name FAR_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 789

Prompts per speaker 48

Audio Hours 85

Total utterances/Entr ies 38,400

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 45,000

Brief Descript ion

• This is a 789 speaker Farsi telephony speech database recorded on location in Iran.

• 50% male, 50% female

• Broad distribution of age groups (16-60 years) and dialects

• Medium Level background noise - in-vehicle, home/office, roadside and other public place

type environments

• Language Materials

o 48 prompts per speaker, including Digits; Natural Numbers; Letter strings;

Personal, Place, and Business names; Confirmation items (yes and no); Generic

Command and Control items and Phonetically Rich sentences and words

• Transcriptions

o Fully transcribed to OrienTel type conventions

• Lexicon

o Database is accompanied by a pronunciation lexicon [SAMPA] containing all

transcribed words

• Total audio length - Approximately 85 hours

Contact Appen for further information

Page 37: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

37

Databases

Language Farsi/Persian

DB Name FAR_ASR002

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Mixed

Speakers 1,000

Prompts per speaker

Total utterances/Entr ies

Audio Hours 61

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 57,500

Brief Descript ion

• This is a 1,000 speaker conversational telephony database

• Approximately 30 hours of conversation data (equivalent to 61 hours of single channel

audio)

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 38: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

38

Databases

Language Filipino English

DB Name ENF_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Unique speakers 450

Average cal l length 10-15 minutes

Total utterances/Entr ies N/A

Audio Hours 107

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 35,000

Brief Descript ion

• 216 telephony conversations are recorded for this project

• Demographic information is as follows:

o Roughly equal distribution of male and female

o Broad range of ages from 18 years – 70 years

o Approximately 50% landline/50% mobile

o Speakers speak on a range of generic topics

• Approximately 53 hours of conversation data (equivalent to 107 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 39: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

39

Databases

Language French (Canadian)

DB Name FRC_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 1,000

Prompts per speaker 100

Total utterances/Entr ies 100,000

Audio Hours 131

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 37,500

Brief Descript ion

• This is an extended SALA II database

• 48 prompts per speaker are as specified by the SALA II consortium. An additional 52

prompts (similar content) were recorded by each speaker

• 100 prompts per speaker, including:

o Digits

o Natural Numbers

o Letter strings

o Personal, place, and business names

o Confirmation items (yes, no + fuzzy)

o Generic Command and Control items

o Phonetically rich Sentences and Words

• Mobile telephony recorded in a range of environments including in-car, home/office,

roadside and other public place

• Total audio length: 131 hours

• Fully transcribed to SpeechDAT type conventions

• Database is accompanied by a pronunciation lexicon [SAMPA] containing all

transcribed words

Contact Appen for further information

Page 40: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

40

Databases

Language French (Canadian)

DB Name FRC_ASR002

DB type 1 ASR

DB type 2 Microphone recordings

Environments Home/office

Speakers 120

Prompts per speaker 150

Total utterances/Entr ies 22,500

Audio Hours 46

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice USD 22,500

Brief Descript ion

• This is a 120 speaker microphone recorded database

• Scripts include:

o Person names

o Digits

o Digit strings (randomly generated)

o Addresses

o Phonetically rich sentences

• Dialects

o 50% Quebecois – Montreal

o 50% Quebecois – Other

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 41: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

41

Databases

Language French (Canadian)

DB Name FRC_ASR003

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 251

Prompts per speaker

Total utterances/Entr ies

Audio Hours 20

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 31,500

Brief Descript ion

• This is a 251 speaker conversational** telephony database

• Approximately 10 hours of conversation data (equivalent to 20 hours of single channel

audio)

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For a

small number of calls, only one half of the conversation was collected and transcribed

Contact Appen for further information

Page 42: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

42

Databases

Language French (European)

DB Name FRF_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 563

Prompts per speaker

Total utterances/Entr ies

Audio Hours 50

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 31,500

Brief Descript ion

• This is a 563 speaker conversational** telephony database

• Approximately 25 hours of conversation data (equivalent to 50 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For a

smaller number of calls, only one half of the conversation was collected and transcribed

Contact Appen for further information

Page 43: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

43

Databases

Language French (European)

DB Name FRF_ASR002

DB type 1 ASR

DB type 2 Voicemail Telephony

Environments Low background noise

Speakers 560

Prompts per speaker

Total utterances/Entr ies

Audio Hours 95

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 37,500

Brief Descript ion

• This is a 560 speaker voicemail telephony database

• Broad distribution of age, gender and landline/mobile coverage

• Provides good representation of key accents across France

• Approximately 47 audio hours of voicemail data

• The database covers speakers providing spontaneous voicemail type responses selected

from a pool of approx. 200 common voicemail scenarios (e.g. Leave a message to tell your

colleague that you are running late for a meeting)

• The database is fully transcribed and is accompanied by a pronunciation lexicon

containing all transcribed words

Contact Appen for further information

Page 44: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

44

Databases

Language French (European)

DB Name FRF_ASR003

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 98

Prompts per speaker

Total utterances/Entr ies 10,273

Audio Hours 26

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 98 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 45: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

45

Databases

Language German

DB Name DEU_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Studio

Speakers 127

Prompts per speaker 100

Total utterances/Entr ies 12,700

Audio Hours 33

Sampling rate - kHz 16.00

Recording channels 2

List Pr ice USD 11,500

Brief Descript ion

• This is a 127 speaker microphone recorded database

• Each speaker read 100 prompts including:

o Digits

o Natural Numbers

o Personal and City names

o Telephone Numbers

o Generic Command and Control items

o Phonetically rich Sentences and Words

• All speakers were recorded in a studio type environment in Germany

• Database is fully transcribed and is accompanied by a pronunciation lexicon containing all

transcribed words

Contact Appen for further information

Page 46: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

46

Databases

Language German

DB Name DEU_ASR002

DB type 1 ASR

DB type 2 Voicemail Telephony

Environments Low background noise

Speakers 890

Prompts per speaker

Total utterances/Entr ies

Audio Hours 65

Sampling rate - kHz 8.00

Recording channels

List Pr ice USD 37,500

Brief Descript ion

• This is an 890 speaker voicemail telephony database

• Broad distribution of age, gender and landline/mobile coverage

• Provides good representation of key accents across Germany

• Approximately 65 audio hours of voicemail data

• The database covers speakers providing spontaneous voicemail type responses selected

from a pool of approx. 50 common voicemail scenarios (e.g. Leave a message to tell your

colleague that you are running late for a meeting)

• The database is fully transcribed and is accompanied by a pronunciation lexicon

containing all transcribed words

Contact Appen for further information

Page 47: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

47

Databases

Language German

DB Name DEU_ASR003

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 77

Prompts per speaker

Total utterances/Entr ies 10,085

Audio Hours 25

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 77 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT).

• Each speaker reads a number phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 48: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

48

Databases

Language Hausa

DB Name HAU_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 103

Prompts per speaker

Total utterances/Entr ies 7,895

Audio Hours 20

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 103 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 49: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

49

Databases

Language Hausa

DB Name HAU_ASR002

DB type 1 ASR

DB type 2 Conversational telephony

Environments Low background noise

Speakers 200

Prompts per speaker

Total utterances/Entr ies

Audio Hours 66

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 40,000

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls

each, to a pool of 100 call receivers

• Approximately 33 hours of conversation data (equivalent to 66 hours of single

• channel audio).

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 50: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

50

Databases

Language Hebrew

DB Name HEB_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 200

Prompts per speaker

Total utterances/Entr ies

Audio Hours 69

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 30,000

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers

• Approximately 34 hours of conversation data (equivalent to 69 hours of single

• channel audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 51: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

51

Databases

Language Hindi

DB Name HIN_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Low background noise

Speakers 1,920

Prompts per speaker 50

Total utterances/Entr ies 96,000

Audio Hours 224

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 45,000

Brief Descript ion

• This is a 1,920 speaker Hindi mobile telephony speech database. The database comprises

1,920 speakers who speak Hindi as a second language (i.e. native speakers of Telugu,

Gujarati, etc who use Hindi as a second language) recorded on location in India

• Database Type

o 1,920 speakers recorded in India

o 50% male, 50% female

o Broad distribution of age groups (16-60 years) and dialects

o Medium Level background noise - in-car, home/office, roadside and other public

place type environments

• Language Materials

o 50 prompts per speaker, including Digits; Natural Numbers; Personal, Place and

Business names; Confirmation items (yes, no + fuzzy); Generic Command and

Control items; Phonetically rich Sentences and Words; and Web addresses

• Transcriptions

o Fully transcribed to SpeechDAT type conventions

• Lexicon

o Database is accompanied by a pronunciation lexicon [SAMPA] containing all

transcribed words

o Lexicon - 9,853 unique headwords

o Total audio length - Approximately 224 hours

Contact Appen for further information

Page 52: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

52

Databases

Language Hindi

DB Name HIN_ASR002

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Mixed

Speakers 996

Prompts per speaker

Total utterances/Entr ies

Audio Hours 65

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 45,000

Brief Descript ion

• This is a 996 speaker conversational** telephony database

• Approximately 65 hours of conversation data (equivalent to 60 hours of single

channel audio)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

** For the majority of calls, both speakers (in-line/out-line) were collected and transcribed.

For a smaller number of calls, only one half of the conversation was collected and

transcribed

Contact Appen for further information

Page 53: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

53

Databases

Language Italian

DB Name ITA_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Mixed

Speakers 200

Prompts per speaker 200

Total utterances/Entr ies 40,000

Audio Hours 177

Sampling rate - kHz 22.05

Recording channels 4

List Pr ice USD 12,500

Brief Descript ion

• This is a 200 speaker microphone recorded database

• Each speaker read 200 utterances:

o 100 - command and control type items

o 100 - phonetically rich sentences

• Fully transcribed to SpeechDAT type conventions

• Database is accompanied by a pronunciation lexicon containing all transcribed words

• Lexicon - 7,316 unique headwords

• Total audio length - 177 hours

Contact Appen for further information

Page 54: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

54

Databases

Language Italian

DB Name ITA_ASR002

DB type 1 ASR

DB type 2 Microphone

Environments In-Car

Speakers 103

Prompts per speaker 350

Audio Hours 189

Total utterances/Entr ies 35,875

Sampling rate - kHz 48.00

Recording channels 4

List Pr ice USD 19,500

Brief Descript ion

• This is a 205 session In-Car database

• Each speaker recorded 1or 2 sessions:

o Session 1 in a parked vehicle with the engine running

o Session 2 in a vehicle travelling at 60 mph (100 km/h).

• 350 prompts were read by each speaker (175) per session) including:

o Digits

o Street names

o Generic Command and Control items

o Phonetically rich Sentences and Words

• Fully transcribed to SpeechDAT type conventions

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 55: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

55

Databases

Language Italian

DB Name ITA_ASR003

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 200

Prompts per speaker

Total utterances/Entr ies

Audio Hours 72

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 30,000

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers

• Approximately 36 hours of conversation data (equivalent to 72 hours of single channel

audio)

• Database is fully transcribed and timestamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 56: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

56

Databases

Language Italian

DB Name ITA_ASR004

DB type 1 ASR

DB type 2 Voicemail Telephony

Environments Low background noise

Speakers 550

Prompts per speaker

Total utterances/Entr ies

Audio Hours 123

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 37,500

Brief Descript ion

• This is a 550 speaker voicemail telephony database

• Broad distribution of age, gender and landline/mobile coverage

• Provides good representation of key accents across Italy

• Approximately 123 audio hours of voicemail data

• The database covers speakers providing spontaneous voicemail type responses selected

from a pool of approx. 200 common voicemail scenarios (e.g. Leave a message to tell your

colleague that you are running late for a meeting)

• The database is fully transcribed and is accompanied by a pronunciation lexicon

containing all transcribed words

Contact Appen for further information

Page 57: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

57

Databases

Language Italian

DB Name ITA_TTS001

DB type 1 TTS

DB type 2 Microphone

Environments Studio

Speakers 1

Prompts per speaker 3,300

Total utterances/Entr ies 3,300

Audio Hours 3

Sampling rate - kHz 22.05

Recording channels 1

List Pr ice USD 11,500

Brief Descript ion

• This is a single speaker TTS speech database. The database comprises 3,300 phonetically

rich sentences recorded by a male Italian speaker in a studio environment. The database is

accompanied by a pronunciation lexicon containing an entry for each of the words spoken

in the database

Contact Appen for further information

Page 58: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

58

Databases

Language Japanese

DB Name JPN_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 144

Prompts per speaker

Total utterances/Entr ies 13,067

Audio Hours 33

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 144 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 59: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

59

Databases

Language Kannada

DB Name KAN_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Mixed

Speakers 1,000

Prompts per speaker

Total utterances/Entr ies

Audio Hours 30

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 45,000

Brief Descript ion

• This is a 1,000 speaker conversational telephony database

• Approximately 30 hours of conversation data (equivalent to 60 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 60: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

60

Databases

Language Korean

DB Name KOR_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 100

Prompts per speaker

Total utterances/Entr ies 8,107

Audio Hours 20

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 100 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 61: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

61

Databases

Language Mandarin

DB Name MAC_ASR001

DB type 1 ASR

DB type 2 Telephony

Environments Mixed

Speakers 2,000

Prompts per speaker 100

Total utterances/Entr ies 200,000

Audio Hours 115

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 45,000

Brief Descript ion

• This is a 2,000 speaker Mandarin mobile telephony speech data collection

• The database comprises 2,000 Mandarin speakers recorded on location in China

• 2,000 speakers recorded in China

• 50% male, 50% female

• 100% Mobile Telephony

• Broad distribution of age groups (16-60 years) Language Materials

• 100 prompts per speaker, including:

o Digits

o Natural Numbers

o Personal, place, and business names

o Confirmation items (yes, no + fuzzy)

o Generic Command and Control items

o Phonetically rich Sentences and Words

• Transcriptions

• Fully transcribed to SpeechDAT type conventions

• Lexicon

• Database is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed

words

Contact Appen for further information

Page 62: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

62

Databases

Language Mandarin

DB Name MAC_ASR002

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 132

Prompts per speaker

Total utterances/Entr ies 10,225

Audio Hours 26

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 132 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 63: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

63

Databases

Language Marathi

DB Name MAR_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Mixed

Speakers 1,000

Prompts per speaker

Total utterances/Entr ies

Audio Hours 30

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 45,000

Brief Descript ion

• This is a 1,000 speaker conversational telephony database

• Approximately 15 hours of conversation data (equivalent to 30 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 64: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

64

Databases

Language Pashto

DB Name PAS_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 967

Prompts per speaker

Total utterances/Entr ies

Audio Hours 111

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 65,000

Brief Descript ion

• This is a 967 speaker conversational** telephony database

• Approximately 55 hours of conversation data (equivalent to 111 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

• For the majority of calls, both speakers (in-line/out-line) were collected and transcribed. For

a smaller number of calls, only one half of the conversation was collected and transcribed

Contact Appen for further information

Page 65: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

65

Databases

Language Pashto

DB Name PAS_ASR002

DB type 1 ASR

DB type 2 Conversational microphone data

Environments Low background noise

Number of sessions 40

Average session length 120 minutes

Total utterances/Entr ies N/A

Audio Hours 80

Sampling rate - kHz 16.00

Recording channels 2

L ist Pr ice USD 75,000

Br ief Descript ion

• Each recording consists of a number of TransTAC style dialogues (monolingual 2-way

conversations). One speaker acts as an interviewer and the other as the interviewee

• The interviewer appears in more than one set of dialogues but the interviewee is unique for

each set

• Data collection scenarios are similar to TransTAC style (e.g. civil affairs, checkpoints etc.)

• Demographic information is as follows:

o Roughly 25% female and 75% male speakers

o Broad range of ages from 18 years – 55 years

o Broad distribution across two dialect regions in Afghanistan

• 40 hours of conversation data (equivalent to 80 hours of single channel audio)

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

• A full translation of the transcripts into French is also available as an optional additional

purchase

Contact Appen for further information

Page 66: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

66

Databases

Language Pashto

DB Name PAS_BRC001

DB type 1 Broadcast

DB type 2 Broadcast Data

Environments Broadcast Data

Speakers

Prompts per speaker

Total utterances/Entr ies

Audio Hours 51

Sampling rate - kHz 0.00

Recording channels 1

List Pr ice USD 22,500

Brief Descript ion

• Database contains 50 hours of Pashto broadcast data

• Database is largely speech only and does not include music or advertisements

• Data types include:

o Talk shows

o Interviews

o News broadcasts (excluding news reading by anchors)

• Database is fully transcribed and timestamped

Contact Appen for further information

Page 67: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

67

Databases

Language Polish

DB Name POL_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 99

Prompts per speaker

Total utterances/Entr ies 10,130

Audio Hours 25

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 99 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 68: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

68

Databases

Language Portuguese (Brazilian)

DB Name PTB_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 102

Prompts per speaker

Total utterances/Entr ies 10,417

Audio Hours 26

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 102 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 69: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

69

Databases

Language Portuguese (Brazilian)

DB Name PTB_ASR002

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 200

Prompts per speaker

Total utterances/Entr ies

Audio Hours 66

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 35,000

Brief Descript ion

• This is a 300 speaker conversational telephony database. For this project (some speakers

have participated in up to 2 calls)

• Approximately 33 hours of conversation data (equivalent to 66 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 70: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

70

Databases

Language Portuguese (European)

DB Name PTP_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 200

Prompts per speaker

Total utterances/Entr ies

Audio Hours 72

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 30,000

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers

• Approximately 36 hours of conversation data (equivalent to 72 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 71: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

71

Databases

Language Romanian

DB Name ROM_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 200

Prompts per speaker

Total utterances/Entr ies

Audio Hours 74

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 30,000

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project – 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers

• Approximately 37 hours of conversation data (equivalent to 74 hours of single channel

audio)

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 72: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

72

Databases

Language Russian

DB Name RUS_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 200

Prompts per speaker

Total utterances/Entr ies

Audio Hours 74

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 30,000

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers

• Approximately 37 hours of conversation data (equivalent to 74 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 73: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

73

Databases

Language Russian

DB Name RUS_ASR002

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 115

Prompts per speaker

Total utterances/Entr ies 12,205

Audio Hours 31

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 115 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 74: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

74

Databases

Language Somali

DB Name SOM_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 1,000

Prompts per speaker

Total utterances/Entr ies

Audio Hours 101

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 65,000

Brief Descript ion

• This is a 1,000 speaker conversational telephony database

• Approximately 50 hours of conversation data (equivalent to 101 hours of single channel

audio)

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 75: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

75

Databases

Language Sorani (Kurdish)

DB Name SOR_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 170

Prompts per speaker

Total utterances/Entr ies

Audio Hours 11

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 30,000

Brief Descript ion

• This is a 170 speaker conversational** telephony database

• Approximately 5 hours of conversation data (equivalent to 11 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

• For a large proportion of calls, only one half of the conversation was collected and

transcribed

Contact Appen for further information

Page 76: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

76

Databases

Language Spanish (European)

DB Name ESP_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Mixed

Speakers 200

Prompts per speaker 200

Total utterances/Entr ies 40,000

Audio Hours 159

Sampling rate - kHz 22.05

Recording channels 4

List Pr ice USD 12,500

Brief Descript ion

• This is a 200 speaker microphone recorded database

• Each speaker read 200 utterances:

o 100 - command and control type items

o 100 - phonetically rich sentences

• Fully transcribed to SpeechDAT type conventions

• Database is accompanied by a pronunciation lexicon containing all transcribed words

• Lexicon - 6,367 unique headwords

• Total audio length - 159 hours

Contact Appen for further information

Page 77: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

77

Databases

Language Spanish (European)

DB Name ESP_ASR002

DB type 1 ASR

DB type 2 Voicemail Telephony

Environments Low background noise

Speakers 512

Prompts per speaker

Total utterances/Entr ies

Audio Hours 97

Sampling rate - kHz 8.00

Recording channels 1

List Pr ice USD 37,500

Brief Descript ion

• This is a 512 speaker voicemail telephony database

• Broad distribution of age, gender and landline/mobile coverage

• Provides good representation of key accents across Spain

• Approximately 97 audio hours of voicemail data

• The database covers speakers providing spontaneous voicemail type responses selected

from a pool of approx. 200 common voicemail scenarios (e.g. Leave a message to tell your

colleague that you are running late for a meeting)

• The database is fully transcribed and is accompanied by a pronunciation lexicon

containing all transcribed words

Contact Appen for further information

Page 78: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

78

Databases

Language Spanish (European)

DB Name ESP_TTS001

DB type 1 TTS

DB type 2 Microphone

Environments Studio

Speakers 1

Prompts per speaker 1,787

Total utterances/Entr ies 1,787

Audio Hours 1

Sampling rate - kHz 22.05

Recording channels 1

List Pr ice USD 6,000

Brief Descript ion

• This is a single speaker TTS speech database. The database comprises 1,786 phonetically

rich sentences recorded by a male Spanish speaker in a studio environment. The database

is accompanied by a pronunciation lexicon containing an entry for each of the words

spoken in the database

Contact Appen for further information

Page 79: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

79

Databases

Language Spanish (Latin America)

DB Name ESL_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 100

Prompts per speaker

Total utterances/Entr ies 6,898

Audio Hours 17

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 100 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 80: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

80

Databases

Language Swedish

DB Name SWE_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 98

Prompts per speaker

Total utterances/Entr ies 11,816

Audio Hours 30

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 98 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 81: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

81

Databases

Language Thai

DB Name THA_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 98

Prompts per speaker

Total utterances/Entr ies 14,039

Audio Hours 35

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 98 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 82: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

82

Databases

Language Turkish

DB Name TUR_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Low background noise

Speakers 200

Prompts per speaker

Total utterances/Entr ies

Audio Hours 83

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 30,000

Brief Descript ion

• This is a 200 speaker conversational telephony database

• 200 telephony conversations are recorded for this project - 100 speakers make 2 calls

each (1 from landline, 1 from mobile), to a pool of 100 call receivers

• Approximately 41 hours of conversation data (equivalent to 83 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 83: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

83

Databases

Language Turkish

DB Name TUR_ASR002

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 100

Prompts per speaker

Total utterances/Entr ies 6,950

Audio Hours 17

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 100 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 84: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

84

Databases

Language Urdu

DB Name URD_ASR001

DB type 1 ASR

DB type 2 Conversational Telephony

Environments Mixed

Speakers 1,000

Prompts per speaker

Total utterances/Entr ies

Audio Hours 95

Sampling rate - kHz 8.00

Recording channels 2

List Pr ice USD 45,000

Brief Descript ion

• This is a 1,000 speaker conversational telephony database recorded by native Urdu

speakers in Pakistan (700 speakers) and India (300 speakers)

• Approximately 47 hours of conversation data (equivalent to 95 hours of single channel

audio).

• Database is fully transcribed and time stamped

• Database is accompanied by a pronunciation lexicon containing all transcribed words

Contact Appen for further information

Page 85: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Data

base

s - D

etai

led

85

Databases

Language Vietnamese

DB Name VIE_ASR001

DB type 1 ASR

DB type 2 Microphone

Environments Home/office

Speakers 129

Prompts per speaker

Total utterances/Entr ies 18,842

Audio Hours 47

Sampling rate - kHz 16.00

Recording channels 1

List Pr ice EUR 3,600

Brief Descript ion

• This is a 129 speaker microphone recorded database, Global Phone, developed in

collaboration with the Karlsruhe Institute of Technology (KIT)

• Each speaker reads a number of phonetically rich sentences

• The read texts were selected from national newspaper articles available from the web to

cover a wide domain with large vocabulary

• Gender 50% male, 50% female

• Broad distribution of speakers in the age group of 18-70 years

• All speakers were recorded in a home/office type environment

• Database is fully transcribed and the transcription is available both in original script and in

Romanized form (where applicable)

Contact Appen for further information

Page 86: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Lexica

86

Lexica

OverviewAppen Butler Hill has considerable experience in providing a variety of lexicon types. These include

• Pronunciat ion Lexica providing phonemic representation, syllabification, and stress (primary andsecondary as appropriate)

• Part-of-speech tagged Lexica providing grammatical and semantic labels• Other reference text based materia ls including spelling/mis-spelling lists, spell-check

dictionaries, mappings of colloquial language to standard forms, orthographic normalisation lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages (please see language list below).

Domain CoverageTypical domains covered in our off-the-shelf holdings for a given language include:

• General Vocabulary• Geographical Names e.g. Place Names (City, State, Suburb)• Numbers (0-10,000)• Person Names (both Given and Family)

Lexica can be developed from a wordl ist provided by the cl ient or by Appen Butler Hi l l . I f acl ient requires vocabulary of a specif ic nature or to cover a specif ic domain, this cantypical ly be provided under the same license and pricing terms as our pre-exist ing (off- the-shelf) holdings.

Lexicon Structure• Our Lexica are usually created using a SAMPA phone set for the language which aligns SAMPA

symbols with IPA equivalents. We can convert to most other machine readable formats on request• We also include documentation files which include phone set definitions, statistical notes about phone

coverage within a given Lexicon, and may include background information on data quality andvalidation.

Lexica are typically delivered as text files consisting of three or four tab-delimited fields:Field 1 - HeadwordField 2 - SAMPA pronunciationField 3 - Variant Rank (0 = preferred pronunciation; 1 = also heard, less common)Field 4 - Label e.g. (FAMILY_NAME, GIVEN_NAME, COMMON_WORD…etc.)

In addition to the phonemic mark-up, our Lexica are marked up for primary and secondary stress and forsyllabification where applicable. They will also include pronunciation variants where relevant.

LexiconCategory

Brief Descript ionLicense Priceper headword

(USD)

1 Most languages using Latin based orthographies USD 0.335

2Languages requiring tone mark-up (e.g. Mandarin, Cantonese) and languages

requiring multiple representational forms in the orthography (e.g. Japanese)USD 0.415

3 Languages requiring full diacritization/vowelization (e.g.Arabic) USD 0.460

Pric ing for special ized Languages and Part-of-Speech Tagged Lexica can be provided on

request.

Lexica

Overview Appen has considerable experience in providing a variety of lexicon types. These include:

• Pronunciat ion Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

• Part-of-speech tagged Lexica providing grammatical and semantic labels • Other reference text based materia ls including spelling/mis-spelling lists, spell-check

dictionaries, mappings of colloquial language to standard forms, orthographic normalisation lists.

Over a period of 15 years, Appen has generated a significant volume of licensable material for a wide range of languages (please see language list below). Domain Coverage Typical domains covered in our off-the-shelf holdings for a given language include:

• General Vocabulary • Geographical Names e.g. Place Names (City, State, Suburb) • Numbers (0-10,000) • Person Names (both Given and Family)

Lexica can be developed from a wordl ist provided by the cl ient or by Appen. I f a cl ient requires vocabulary of a specif ic nature or to cover a specif ic domain, th is can typical ly be provided under the same l icense and pric ing terms as our pre-exist ing (off- the-shelf) holdings. Lexicon Structure

• Our Lexica are usually created using a SAMPA phone set for the language which aligns SAMPA symbols with IPA equivalents. We can convert to most other machine readable formats on request

• We also include documentation files which include phone set definitions, statistical notes about phone coverage within a given Lexicon, and may include background information on data quality and validation.

Lexica are typically delivered as text files consisting of three or four tab-delimited fields: Field 1 - Headword Field 2 - SAMPA pronunciation Field 3 - Variant Rank (0 = preferred pronunciation; 1 = also heard, less common) Field 4 - Label e.g. (FAMILY_NAME, GIVEN_NAME, COMMON_WORD…etc.) In addition to the phonemic mark-up, our Lexica are marked up for primary and secondary stress and for syllabification where applicable. They will also include pronunciation variants where relevant.

Lexicon Category

Brief Descript ion

License Price per headword

(USD)

1 Most languages using Latin based orthographies USD 0.335

2 Languages requiring tone mark-up (e.g. Mandarin, Cantonese) and languages

requiring multiple representational forms in the orthography (e.g. Japanese) USD 0.415

3 Languages requiring full diacritization/vowelization (e.g.Arabic) USD 0.460

Page 87: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Lexica

87

Lexica

Number of headwords

New offerings are frequently added. For holdings information in a given language or to discuss any

customized development efforts, please contact:

[email protected]

appen.com

0   5,000   10,000   15,000   20,000   25,000   30,000   35,000   40,000   45,000   50,000  

English  (Canadian)  

English  (Australian)  

Dutch  

Dari  

Danish  

Czech  

Croa>an  

Catalan  

Cantonese  

Bulgarian  

Bengali  

Basque  

Bahasa  Malay  

Bahasa  Indonesia  

Arabic  (UAE)  

Arabic  (Syrian)  

Arabic  (South  Levan>ne)  

Arabic  (Pales>nian)  

Arabic  (North  Levan>ne)  

Arabic  (MSA)  

Arabic  (Maghrebi)  

Arabic  (Iraqi)  

Arabic  (Gulf)  

Arabic  (Egyp>an)  

Arabic  (Algerian)  

Assamese  

>75,000  

>55,000  

>100,000  

>110,000  

>75,000  

>70,000  

Page 88: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Lexica

88

Lexica

Number of headwords

New offerings are frequently added. For holdings information in a given language or to discuss any

customized development efforts, please contact:

[email protected]

appen.com

0   5,000   10,000   15,000   20,000   25,000   30,000   35,000   40,000   45,000   50,000  

Norwegian  

Marathi  

Mandarin  

Malayalam  

Korean  

Kannada  

Japanese  

Italian  

Hungarian  

Hindi  

Hebrew  

Hausa  

Greek  

German  (Switzerland)  

German  (Austria)  

German  

French  (Switzerland)  

French  (Luxembourg)  

French  (European)  

French  (Canadian)  

French  (Belgian)  

Finnish  

English  (US)  

English  (UK)  

English  (New  Zealand)  

English  (Indian)  

>155,000  

>85,000  

>60,000  

>110,000  

>55,000  

>190,000  

>260,000  

>100,000  

>115,000  

>200,000  

Page 89: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Lexica

89

Lexica

Number of headwords

New offerings are frequently added. For holdings information in a given language or to discuss any

customized development efforts, please contact:

[email protected]

appen.com

>250,000  

>100,000  

>100,000  

>90,000  

>115,000  

>100,000  

>50,000  

0   5,000   10,000   15,000   20,000   25,000   30,000   35,000   40,000   45,000   50,000  

Xiang  

Wu  

Vietnamese  

Urdu  

Ukrainian  

Turkish  

Thai  

Telugu  

Tamil  

Tagalog  

Sylhe?  

Swedish  

Swahili(Kenya)  

Spanish  (Mexican)  

Spanish  (EU  -­‐  Cas?lian)  

Spanish  (American  -­‐  US)  

Spanish  (All  La?n  America)  

Sorani  (Kurdish)  

Somali  

Serbian  

Russian  

Romanian  

Portuguese  (EU)  

Portuguese  (Brazil)  

Polish  

Persian/Farsi  

Pashto  

Oriya  

>100,000  

>100,000  

Page 90: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Oth

er R

esou

rces

90

Other Language Resources  Apart from speech databases and lexica, Appen also has a range of other language resources

available for license, which can be found in this section. These resources include:

1. Text Corpora — We have a wide variety of text collections in different languages

available for license. Apart from the Vowelized Arabic Corpus, Appen also has a range of

Named Entity annotated texts. These are corpora of 500,000 words of news text that have

been annotated for persons, titles, quantities, geopolitical entities, locations, facilities, etc.

2. Morphological Analyzers — Our morphological analyzers are designed to generate

grammatically acceptable words using tagged stem dictionaries and information on

inflectional affixes and their combinations. They can manipulate text from languages with

non-Latin scripts and currently generate Urdu and Persian, including informal written

variants of affixes.

3. Thesaurus — Appen can undertake thesaurus development in several ways: from first

principles, as an extension to existing work or as validation of an existing thesaurus, with

consistency and coverage an important focus. Because each language is subtly different

and requires deep grammatical analysis to produce a quality product, native speakers are

always used to build a thesaurus. Appen can produce thesauri to client specifications as a

licensable database which is supplied in a standard XML format or to client specifications.

4. Language Analysis Documentat ion — Appen can provide comprehensive language

analysis documents under license for all languages of interest. These documents support

system and application developers and include phonological features and processes,

analysis of

Romanization schemes (where applicable), regional and dialectal differences and

population statistics of speakers. Appen can also provide analysis and recommendations

on specific collections for a nominated language.

Page 91: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Oth

er R

esou

rces

91

Language Analysis Documents

Language DB Name List Pr ice Brief Descript ion Arabic (Iraqi) ARB_LAN001

USD 2,500

(per language)

The key topics that are typically covered in the language analysis document include:

• General Information about the

country • General Information about the

language • Language classification of the

language • Other Languages spoken in the

country • History of the language (where

relevant) including changes due to immigration etc

• Dialects of the language • maps indicating dialect regions • discussion of dialects –

distribution,

features etc.

• recommendations on a dialect distribution that would be feasible to use in a speech data collection

• Sound System of the language • Relevant Phonological Processes

prevalent in the language/country • Orthographic Conventions for the

language • Communications

Arabic

(North

Levantine)

ARB_LAN002

Bahasa

Indonesia BAH_LAN001

Brazilian

Portuguese PTB_LAN001

Croatian CRO_LAN001

Dari DAR_LAN001

English (US) ENG_LAN001

Farsi/Persian FAR_LAN001

French

(Canadian) FRC_LAN001

German DEU_LAN001

Hebrew HEB_LAN001

Japanese JAP_LAN001

Korean KOR_LAN001

Mandarin MAC_LAN001

Pashto PAS_LAN001

Russian RUS_LAN001

Serbian SRB_LAN001

Sorani (Kurdish) SOR_LAN001

Thai THA_LAN001

Urdu URD_LAN001

Page 92: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Oth

er R

esou

rces

92

NER Corpora

Language DB Name Words List Pr ice Brief Descript ion

Arabic ARB_NER001

500,000

(per language)

USD 7,500

(per language)

Corpora containing text material collected from a variety of sources.

Each Text Corpus contains approximately

500,000 words and is

tagged for the following Named

Entities:

- Person

- Organization

- Location

- Nationality

- Religion

- Facility

- Geo-Political Entity

- Titles

English ENG_NER001

Farsi/

Persian FAR_NER001

Japanese JPY_NER001

Korean KOR_NER001

Mandarin MAC_NER001

Russian RUS_NER001

Urdu URD_NER001

Page 93: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Oth

er R

esou

rces

93

Text Corpora Language Arabic (MSA)

DB Name ARB_THE001

DB type 2 Thesaurus

Words 28,000

List Pr ice Provided on request

Br ief Descript ion:

• The thesaurus contains 28,000 headwords

• For each headword, the following information is provided:

o Detailed Part-Of-Speech information including Verb (Intransitive/Transitive),

• Adverb, Noun, Adjective

o A broad definition in English

o Synonyms

o Antonyms

o A broad definition of the antonym group linked to the sense group

Page 94: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Oth

er R

esou

rces

94

Text Corpora Language Arabic (MSA)

DB Name ARB_TXT001

DB type 2 Vowelized text corpus

Words 450,000

List Pr ice USD 9,500

Brief Descript ion:

• This vowelised corpus is made up of 450,000 words of Arabic news text

• The text has been 100% manually vowelised and checked

Page 95: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Oth

er R

esou

rces

95

Text Corpora Language Farsi/Persian

DB Name FAR_MOR001

DB type 2 Morphological Database

Words 0

List Pr ice USD 32,500

Brief Descript ion:

• The Farsi/Persian morphological database comprises six files in text format:

-­‐ a stems dictionary;

-­‐ a dictionary of inflectional prefixes;

-­‐ a dictionary of inflectional suffixes; and

-­‐ three compatibility tables, which define the grammatically acceptable combinations

of stems, prefixes and suffixes for any given stem in the stems dictionary (prefix-

suffix; prefix-stem; suffix-stem).

• The format of the six files corresponds to the input format required by the Buckwalter

AraGen generation program. This program uses the input file to output the complete set of

potential words defined by the stem and affix dictionaries and compatibility tables

• All words and affixes in the six files are in a Romanized form (converted using an Appen

conversion table). Each word and affix is shown with and without short vowels. The form

with short vowels (the vowelized form) reflects the pronunciation of the word or affix

• SUMMARY OF CONTENTS

-­‐ Stems in stem dictionary - 18,364 (including stem alternations)

-­‐ Stems in stem dictionary - 16,492 (excluding stem alternations)

-­‐ Number of suffixes: 506 (including zero suffix and variants of suffixes with

and without the zero width non-joiner character)

-­‐ Number of prefixes: 14 (including zero prefix)

-­‐ Number of unique words generated: 1,608,559

Page 96: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

Oth

er R

esou

rces

96

Text Corpora Language Urdu

DB Name URD_MOR001

DB type 2 Morphological Database

Words 0

List Pr ice USD 32,500

Brief Descript ion:

• The Urdu morphological database comprises six files in text format:

-­‐ a stems dictionary;

-­‐ a dictionary of inflectional prefixes;

-­‐ a dictionary of inflectional suffixes; and

-­‐ three compatibility tables, which define the grammatically acceptable combinations

of stems, prefixes and suffixes for any given stem in the stems dictionary (prefix-

suffix; prefix-stem; suffix-stem)

• The format of the six files corresponds to the input format required by the Buckwalter

AraGen generation program. This program uses the input file to output the complete set of

potential words defined by the stem and affix dictionaries and compatibility tables.

• All words and affixes in the six files are in a Romanized form (converted using an Appen

conversion table). Each word and affix is shown with and without short vowels. The form

with short vowels (the vowelized form) reflects the pronunciation of the word or affix.

• SUMMARY OF CONTENTS

-­‐ Stems in stem dictionary - 13,267 (including stem alternations)

-­‐ Stems in stem dictionary - 13,116 (excluding stem alternations)

-­‐ Number of suffixes: 115 (including zero suffix)

-­‐ Number of prefixes: 1 (zero prefix)

-­‐ Number of unique words generated: 31,109

Page 97: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)
Page 98: Contact details - cdn.appenresources.com details Appen Pty Ltd Level 6 9 Help Street ... • Assamese • Bahasa Indonesia ... (Iran) • Korean (North, South)

TM

Contact detailsAppen Pty Ltd

Level 69 Help Street

Chatswood, SydneyNSW 2067 Australia

Enquiries:

Sydney office: +61-2-9468-6335US sales enquiries: +1-315-335-4020

Europe: +31-622-799-535Japan & Korea: +1-202-765-7106

China: +61-2-9468-6310

[email protected]

www.appen.com

LanguageResources Catalog