1
5aSC8. Transcrip-on and forced alignment of the Digital Archive of Southern Speech Margaret E. L. Renwick Michael L. Olsen Rachel Miller Olsen Joseph A. Stanley [email protected] [email protected] [email protected] [email protected] Code Meaning {D: } Doubt: e.g. {D: doubted words} {X} Unintelligible {B} Beep: added to anonymize audio recordings available to the public {C: } Comment: e.g. {C: tape distor@on} {NW} Non-word: e.g. laugh, cough {NS} Non-speech: e.g. dog barking, door closing 7. REFERENCES &ACKNOWLEDGMENTS This research is supported by: NSF BCS #1625680 to co-PIs Kretzschmar and Renwick, the UGA Graduate School, and the American Dialect Society. [1] Boersma, P., and Weenink, D. (2015). Praat: Doing phonePcs by computer [Computer program], Version 5.4.08. Retrieved from hOp://www.praat.org [2] Boudahmane, K., Manta, M., Antoine, F., Galliano, S., and Barras, C. (1998). Transcriber [Computer program], Version 1.5.2. Retrieved from hOp://trans.sourceforge.net/ [3] Fromont, R., and Hay, J. (2012). “LaBB-CAT,” Proceedings of the Australasian Language Technology Workshop, 10, 113–117. [4] Gorman, K., Howell, J., and Wagner. M. (2011). “Prosodylab-Aligner: A Tool for Forced Alignment of Laboratory Speech,” Canadian Acous@cs, 39(3), 192–193. [5] Kretzschmar, W. A. J. (2011). LinguisPc Atlas Project, Linguis@c Atlas Project. Retrieved from hOp://www.lap.uga.edu/ [6] Kretzschmar, W. A., Bounds, P., HeOel, J., Pederson, L., Juuso, I., Opas-Hänninen, L. L., and Seppänen, T. (2013). “The Digital Archive of Southern Speech (DASS),” Southern Journal of Linguis@cs, 37(2), 17–38. [7] Pederson, L., McDaniel, S. L., and Adams, C. M. (Eds.) (1986). LinguisPc Atlas of the Gulf States, University of Georgia Press, Athens, Georgia, Vols. 1-7. [8] Reddy, S., and Stanford, J.N. (2015). “Toward completely automated vowel extrac@on: Introducing DARLA,” Linguis@cs Vanguard, 1(1), 15–28. doi:10.1515/lingvan-2015-0002 [9] Rosenfelder, I., Fruehwald, J., Evanini, K., and Yuan, J. (2011). FAVE (Forced Alignment and Vowel ExtracPon) program suite [Computer program]. Retrieved from hOp://fave.ling.upenn.edu 1. THE DIGITAL ARCHIVE OF SOUTHERN SPEECH (DASS) v Subset of the Linguis@c Atlas of the Gulf States (LAGS) [7] v 64 speakers recorded in sociolinguis@c interviews from 1968 1983 in 8 U.S. Gulf States v 30 female, 34 male; born 1886 1965; mean age 61 years v 4 speakers for each of 16 LAGS geographical sectors [6] (Fig 1) v 1 African American (AA) speaker, and 3 European American (EA) speaker “Types” v 372 hours of audio (2.5 10 hours per interview; µ = 5.75 hours) v .wav files filtered in Praat [1] to remove ar@factual noise above 17 kHz Speaker Type Descrip-on 1 “Folk” Older, less educated, less connected 2 “Common” Younger, beOer educated & connected 3 “Cul@vated” Most educated, culturally aware, connected to the community AA African American 3. GOALS v To offer methods for transcrip@on and automa@c phone@c analysis of a large speech corpus v To automa@cally extract as much good acous@c data as possible from these legacy recordings v To use this rich historical data to explore the sociophone@cs of Southern speech CODING: § Codes (below) are employed in curly brackets { } within Transcriber § Each code is transcribed on its own line (Fig 3, top) TEXTGRID OUTPUT: § Time alignments from Transcriber are mapped to TextGrid intervals § Each speaker (interviewee, interviewer) receives a separate @er § Only interviewee @er is phone@cally analyzed § Intervals containing { } are excluded from phone@c analysis ENSURING CONSISTENCY ACROSS TRANSCRIPTIONS: § In-house transcrip@on guidelines (e.g. spelling, punctua@on) § Dic@onary of non-standard words (e.g. uh-huh, gonna) Figure 3. Transcriber sotware graphical user interface 4. METHODS FOR LARGE-SCALE TRANSCRIPTION v Transcriber sotware [2] is used for orthographic transcrip@on (Fig 3) v Facilitates user-friendly, precisely @me-aligned, mul@-@er transcrip@on v Approximately 40 undergraduates are assigned a single speaker each, and transcribe 1 reel at a @me. Mean reel length: 54 minutes. 6. FORCED ALIGNMENT AND FORMANT EXTRACTION v TextGrids and .wav files submiOed to Dartmouth Linguis@c Automa@on (DARLA) [8] for forced alignment (Fig 4) and vowel extrac@on v DARLA filters data, and by default does not return measurements for every token v We are tes@ng DARLA against three non-filtering formant extrac@on techniques (Fig 5) v In-house Praat script: extracts all data, but formant tracking is errorful for back vowels v FAVE [9]: extracts all tokens, but its Bayesian formant tracking algorithm is not specialized for Southern speech; training data come from many U.S. varie@es v Modified FAVE: extracts all tokens; Bayesian algorithm trained on mean formant values from 4 fully-transcribed DASS interviews; requires an extra step for data extrac@on v Modified FAVE appears to perform best: it provides a clean, well-separated vowel space similar to DARLA’s output, but without data loss due to filtering v At present, 18 interviews are fully transcribed, and 20+ are in progress v Use our QR code to interact with this dataset in your web browser! v Visit poster 5aSC9 for further acous@c analysis methods and results! 1 ST LISTEN : Orthographically record who said what, when. 2 ND LISTEN : Correct spelling, ensure that transcrip@on is properly @me-aligned & in-house conven@ons are followed. 3 RD LISTEN : 2-3 graduate students check all transcrip@ons to ensure consistency across the corpus. 3-LISTEN SYSTEM: Figure 1. DASS speakers by LAGS sector and type 2. MOTIVATION FOR TRANSCRIBING DASS v Within the Linguis@c Atlas Project [5], a limited number of target lexical items were impressionis@cally transcribed in LAGS Protocols (Fig 2), with no acous@c analysis v Maximum of 1031 transcribed items per speaker; liOle intraspeaker varia@on represented v Transcribing full DASS interviews is expected to yield a searchable corpus of 1.5 million words, @me-aligned to the audio, with corresponding acous@c data. Figure 2. Example of LAGS Speaker Protocol transcrip@ons Figure 4. Force-aligned TextGrid returned by DARLA DH IY0 K AE1 T S M EY1 D IH1 T S AO1 R T AH0 S K EH1 R IY0 the cats made it sorta scary Time (s) 0 1.801 Figure 5. Comparison of vowel formant extrac@on methods Spot-checked for consistency in Transcriber (3 rd listen) 5. TRANSCRIPTION AND DATA PROCESSING WORKFLOW Transcrip-on by undergraduate using Transcriber, including double- checking Automa-c phone-c analysis! 1 hour audio 12.5 hours of work (2 listens) .trs (.xml) à .txt .trs à .TextGrid File conversion via LaBB-CAT [3] scripts Dallas Austin Houston Little Rock New Orleans Shreveport Jackson Memphis Nashville Knoxville Atlanta Macon Birmingham Montgomery Jacksonville Orlando Miami Key West Type 1 2 3 AA

5aSC8 Transcrip-on and forced alignment of the Digital ...faculty.franklin.uga.edu/.../Renwick-EtAl-ASA173_DASS-Methods-Post… · v Visit poster 5aSC9 for further acous@c analysis

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 5aSC8 Transcrip-on and forced alignment of the Digital ...faculty.franklin.uga.edu/.../Renwick-EtAl-ASA173_DASS-Methods-Post… · v Visit poster 5aSC9 for further acous@c analysis

5aSC8.Transcrip-onandforcedalignmentoftheDigitalArchiveofSouthernSpeechMargaret E. L. Renwick � Michael L. Olsen � Rachel Miller Olsen � Joseph A. Stanley

[email protected][email protected][email protected][email protected]

Code Meaning{D:} Doubt:e.g.{D:doubtedwords}{X} Unintelligible{B} Beep:addedtoanonymizeaudiorecordingsavailabletothepublic{C:} Comment:e.g.{C:tapedistor@on}{NW} Non-word:e.g.laugh,cough{NS} Non-speech:e.g.dogbarking,doorclosing

7.   REFERENCES&ACKNOWLEDGMENTSThisresearchissupportedby:NSFBCS#1625680toco-PIsKretzschmarandRenwick,theUGAGraduateSchool,andtheAmericanDialectSociety.[1]Boersma,P.,andWeenink,D.(2015).Praat:DoingphonePcsbycomputer[Computerprogram],Version5.4.08.Retrievedfrom

hOp://www.praat.org[2]Boudahmane,K.,Manta,M.,Antoine,F.,Galliano,S.,andBarras,C.(1998).Transcriber[Computerprogram],Version1.5.2.Retrievedfrom

hOp://trans.sourceforge.net/[3]Fromont,R.,andHay,J.(2012).“LaBB-CAT,”ProceedingsoftheAustralasianLanguageTechnologyWorkshop,10,113–117.[4]Gorman,K.,Howell,J.,andWagner.M.(2011).“Prosodylab-Aligner:AToolforForcedAlignmentofLaboratorySpeech,”CanadianAcous@cs,

39(3),192–193.[5]Kretzschmar,W.A.J.(2011).LinguisPcAtlasProject,[email protected]://www.lap.uga.edu/[6]Kretzschmar,W.A.,Bounds,P.,HeOel,J.,Pederson,L.,Juuso,I.,Opas-Hänninen,L.L.,andSeppänen,T.(2013).“TheDigitalArchiveof

SouthernSpeech(DASS),”SouthernJournalofLinguis@cs,37(2),17–38.[7]Pederson,L.,McDaniel,S.L.,andAdams,C.M.(Eds.)(1986).LinguisPcAtlasoftheGulfStates,UniversityofGeorgiaPress,Athens,Georgia,

Vols.1-7.[8]Reddy,S.,andStanford,J.N.(2015).“Towardcompletelyautomatedvowelextrac@on:IntroducingDARLA,”Linguis@csVanguard,1(1),15–28.

doi:10.1515/lingvan-2015-0002[9]Rosenfelder,I.,Fruehwald,J.,Evanini,K.,andYuan,J.(2011).FAVE(ForcedAlignmentandVowelExtracPon)programsuite[Computer

program].RetrievedfromhOp://fave.ling.upenn.edu

1.   THEDIGITALARCHIVEOFSOUTHERNSPEECH(DASS)v SubsetoftheLinguis@cAtlasoftheGulfStates(LAGS)[7]v 64speakersrecordedinsociolinguis@cinterviewsfrom1968– 1983in8U.S.GulfStates

v 30female,34male;born1886–1965;meanage61yearsv 4speakersforeachof16LAGSgeographicalsectors[6](Fig1)

v 1AfricanAmerican(AA)speaker,and3EuropeanAmerican(EA)speaker“Types”v 372hoursofaudio(2.5–10hoursperinterview;µ=5.75hours)

v  .wavfilesfilteredinPraat[1]toremovear@factualnoiseabove17kHz

SpeakerType Descrip-on

1“Folk” Older,lesseducated,lessconnected

2“Common” Younger,beOereducated&connected

3“Cul@vated”Mosteducated,culturallyaware,connectedtothe

communityAA AfricanAmerican

3.   GOALSv Tooffermethodsfortranscrip@onandautoma@cphone@canalysisofalargespeechcorpusv Toautoma@callyextractasmuchgoodacous@cdataaspossiblefromtheselegacyrecordingsv Tousethisrichhistoricaldatatoexplorethesociophone@csofSouthernspeech

CODING:§  Codes(below)areemployedincurlybrackets{}withinTranscriber§  Eachcodeistranscribedonitsownline(Fig3,top)

TEXTGRIDOUTPUT:§  TimealignmentsfromTranscriberaremappedtoTextGridintervals§  Eachspeaker(interviewee,interviewer)receivesaseparate@er§  Onlyinterviewee@erisphone@callyanalyzed§  Intervalscontaining{}areexcludedfromphone@canalysis

ENSURINGCONSISTENCYACROSSTRANSCRIPTIONS:§  In-housetranscrip@onguidelines(e.g.spelling,punctua@on)§  Dic@onaryofnon-standardwords(e.g.uh-huh,gonna)

Figure3.Transcribersotwaregraphicaluserinterface

4.   METHODSFORLARGE-SCALETRANSCRIPTIONv Transcribersotware[2]isusedfororthographictranscrip@on(Fig3)

v Facilitatesuser-friendly,precisely@me-aligned,mul@-@ertranscrip@onv Approximately40undergraduatesareassignedasinglespeakereach,

[email protected]:54minutes.

6.   FORCEDALIGNMENTANDFORMANTEXTRACTIONv TextGridsand.wavfilessubmiOedtoDartmouthLinguis@cAutoma@on(DARLA)[8]for

forcedalignment(Fig4)andvowelextrac@onv DARLAfiltersdata,andbydefaultdoesnotreturnmeasurementsforeverytoken

v Wearetes@ngDARLAagainstthreenon-filteringformantextrac@ontechniques(Fig5)v  In-housePraatscript:extractsalldata,butformanttrackingiserrorfulforbackvowelsv FAVE[9]:extractsalltokens,butitsBayesianformanttrackingalgorithmisnot

specializedforSouthernspeech;trainingdatacomefrommanyU.S.varie@esv ModifiedFAVE:extractsalltokens;Bayesianalgorithmtrainedonmeanformantvalues

from4fully-transcribedDASSinterviews;requiresanextrastepfordataextrac@onv ModifiedFAVEappearstoperformbest:itprovidesaclean,well-separatedvowelspace

similartoDARLA’soutput,butwithoutdatalossduetofiltering

v Atpresent,18interviewsarefullytranscribed,and20+areinprogressv UseourQRcodetointeractwiththisdatasetinyourwebbrowser!v Visitposter5aSC9forfurtheracous@canalysismethodsandresults!

1STLISTEN:Orthographicallyrecordwhosaidwhat,when.2NDLISTEN:Correctspelling,ensurethattranscrip@onisproperly@me-aligned&in-houseconven@onsarefollowed.

3RD LISTEN: 2-3 graduate students check all transcrip@ons toensureconsistencyacrossthecorpus.

3-LISTENSYSTEM:

Figure1.DASSspeakersbyLAGSsectorandtype

2.   MOTIVATIONFORTRANSCRIBINGDASSv Within the Linguis@c Atlas Project [5], a limited number of target lexical items were

impressionis@callytranscribedinLAGSProtocols(Fig2),withnoacous@canalysisv Maximumof1031transcribeditemsperspeaker;liOleintraspeakervaria@onrepresented

v Transcribing full DASS interviews is expected to yield a searchable corpus of 1.5 millionwords,@me-alignedtotheaudio,withcorrespondingacous@cdata.

Figure2.ExampleofLAGSSpeakerProtocoltranscrip@ons

Figure4.Force-alignedTextGridreturnedbyDARLA

DHIY0K AE1 T S M EY1 DIH1 T SAO1R T AH0 S K EH1 R IY0

the cats made it sorta scary

Time (s)0 1.801

Figure5.Comparisonofvowelformantextrac@onmethods

Spot-checkedforconsistencyinTranscriber(3rdlisten)

5.   TRANSCRIPTIONANDDATAPROCESSINGWORKFLOW

Transcrip-onbyundergraduate

usingTranscriber,includingdouble-

checking

Automa-cphone-canalysis!

1houraudio

12.5hoursofwork(2listens) .trs(.xml)à.txt

.trsà.TextGrid

FileconversionviaLaBB-CAT[3]scripts

Dallas

AustinHouston

Little Rock

New Orleans

Shreveport Jackson

Memphis

Nashville Knoxville

Atlanta

Macon

Birmingham

Montgomery

Jacksonville

Orlando

Miami

Key West

Type 1 2 3 AA