13
Natural Language Processing Politecnico di Milano Polo di Como Prof. Licia Sbattella --- Student: Lorenzo Monni Sau Matr.: 771378 AA 2012/2013 Assignment: Text & Speech Analysis

Text and Speech Analysis

Embed Size (px)

DESCRIPTION

The objective of this work is to provide a complete analysis of a piece of conversation, carrying out the following features: - Phonologic features of dialogue and a brief statistical analysis;

Citation preview

Page 1: Text and Speech Analysis

NaturalLanguage

Processing

Politecnico di MilanoPolo di Como

Prof. Licia Sbattella---

Student: Lorenzo Monni Sau Matr.: 771378AA 2012/2013

Assignment: Text & Speech Analysis

Page 2: Text and Speech Analysis

Indice generale1. Introduction: Goals of the Assignment and used tools................................................................22. Choice of the dialogue and text to speech alignment with SPPAS..............................................33. Editing the dialogue tiers in Praat and writig a Script for Processing.........................................44. POS Tagging................................................................................................................................55. Semantic Analysis with JWNL....................................................................................................56. Results and main statistics...........................................................................................................57. Conclusions..................................................................................................................................78. Appendix: Lines of Code. ...........................................................................................................8

1. Introduction: Goals of the Assignment and used tools

The objective of this work is to provide a complete analysis of a piece of conversation,

carrying out the following features:

• phonologic features of dialogue and a brief statistical analysis;

• A subdivision in dialogue acts using the DAMSL model;

• the POS tagging of the dialogue;

• a brief Semantic Analysis;

• a Graphical Representation of the results.

Given these goals, the first step has been the choice of the right dialogue for the purpose of

analysis. The audio file of the dialogue together with the written transcription was taken as

input to SPPAS (Automatic Phonetic Annotation of Speech), which is a tool for operations

of alignment between audio and text, with tokenization and phonetization features.

The result of SPPAS analysis got the text aligned with the audio file and it was used as

input to PRAAT, which is a tool to capture audio features of speech such as Pitch,

Intensity and Formants. The alignment was manually edited in Praat to provide the best

match between transcription and audio, and then a Praat script was created to append

some audio features and further annotations to the words in the .txt file.

The POS Tagging part of the project was carried out by using the POS Tagger of the

Stanford University. After this phase the txt with the data looked like a table with audio,

dialogue and syntactic features associated with each word of the conversation.

The last part of the project involved the semantic analysis of dialogue, leveraging the

JWNL java library to query the WordNet lexical database.

Graphical results has been made importing the final .txt file in Microsoft Excel.

Page 3: Text and Speech Analysis

2. Choice of the dialogue and text to speech alignment with SPPAS

The choice of the suitable dialogue for the analysis was probably the hardest step in the

assignment, due to the constraints given by the SPPAS limited capabilities of processing.

My first idea was to get an artistically relevant dialogue, so I started with an excerpt from

the film Eyes Wide Shut by Stanley Kubrick, and I tried to get the best results in terms

of alignments.

SPPAS (version 1.4.8) doesn't perform so well with

• audio files longer than 2 minute;

• excerpts of films, which usually show a relevant background noise;

• realistic and natural dialogues, due to superpositions of more voice, non-words

phonemes and other imperfections.

The Bill and Victor Dialogue had both these three characteristics, so it was almost

impossible to obtain a sufficient result in the alignment, even for a following editing

provided in Praat. I tried to remove some noise and underline only the speech parts of the

audio file using a simple matlab script (See appendix for code), but it didn't work.

The second attempt was the dialogue from the italian film Il Divo by Paolo Sorrentino,

in which the speech seemed more clear and fluid than the previous. SPPAS also allows

processing of italian language dialogues. Unfortunately this audio file showed the same

drawbacks of the previous, though I also tried to divide processing in shorter fragments of

the audio file, as you can see in the folder.

The last attempt was for a linear english educational dialogue between two girls, which

worked really good for SPPAS processing. Despite his simpleness and linear dialogue

interaction, it had a good level of emotive speaking and it was enough expressive for the

purpose of the assignment.

To enable a correct alignment with SPPAS I put in the .txt file also the the hashes to signal

the moments of pause in the dialogue. This is another limit of SPPAS, since without the

silence tracing in the .txt it couldn't provide a precise alignment. The resulting files are

shown in the folder of project “SPPAS Processing”.

Page 4: Text and Speech Analysis

3. Editing the dialogue tiers in Praat and writig a Script for

Processing

Since the process of alignment in SPPAS was not precise, a further editing in Praat was

needed, moving boundaries and tokens in the right positions when needed. The results of

this editing were saved in the TextGrid file “dialogue-flat-phon_palign”, in the folder

“Editing in Praat”.

Two more tiers have been added in the TextGrid file, indicating the class of dialogue act

(using the theory of dialogue acts classifcation proposed in DAMSL model) and the

speaker.

The final TextGrid file featured the following tiers:

• PhonAlign Tier;

• PhnTokAlign Tier;

• TokensAlign Tier;

• DialogueAct Tier;

• Speaker.

In the consequent phase I passed from the Praat Editor View to the Praat scripting

language, to extract required audio features associated to each word token in the dialogue.

The Praat Script “features.praat” takes the Wave file and the TextGrid file as input and

produces a txt file which shows:

• Word token;

• Mean Pitch of token;

• Mean Intensity of token;

• DialogueAct;

• Speaker.

The results were saved in the .txt file “conversation-audio” in the folder “Editing in Praat”.

Page 5: Text and Speech Analysis

4. POS Tagging

To come up with the part-of-speech tagging of each word in the dialogue the tool

Stanford POSTAGGER was used (version 3.2.0). The result of the tagging operation has

been stored in the file “conversation-tagged.txt”. A pretrained model has been used to

assign part of speech tags to unlabeled text, the adopted model was “wsj-0-18-left3words-

distsim”, included in the package of the Stanford-postagger.

After the POS-tagging processing I noticed some mistakes of the tagger, i.e. some noun

terms were recognized as verbs and viceversa, but the majority of words had the right tag.

5. Semantic Analysis with JWNL

JWNL is a Java API (Application Programming Interface) to access and query WordNet

database. In this context JWNL was used to find the domains of each word token. I used

version 2.0 of WordNet, version 1.4 of JWNL and Eclipse as IDE with Java 1.7 SDK and

JRE 7 (Java Runtime Environment).

To find the domains of each token I leveraged the CATEGORY pointer type, and when no

related domains were found I wrote a function which recorsively search the root

hypernym. The Java Project reads as .txt input file “conversation-tagged” in the folder

“POS tagging”, and writes the .txt file “dialogue-audio-pos-domains” as output file.

One issue in this operation was due to the fact that the CATEGORY pointer didn't work for

so many tokens, and recursive search for hypernyms returned base classes like “entity” or

“abstraction”, too general for the purpose of a semantic domain search.

The final results of all processing are stored in the excel file “Dialogue Data” and in the flat

.txt file “dialogue-audio-pos-domains-def”.

6. Results and main statistics

Data of dialogue analysis were all imported in the excel file “Dialogue Data”, which include

four different sheets:

– General Data: table with all fields and values;

– Speaker Pitch-Intensity: Pitch & Intensity Data and graphics;

– Dialogue Acts: Analysis of Dialogue Acts;

– Domains: Analysis of Domains.

Page 6: Text and Speech Analysis

In the analysis non-word utterances were not taken into account since there is only a not-

word token in the conversation.

Pitch Trend By Speaker

0,00

100,00

200,00

300,00

400,00

500,00

600,00

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81

Token Number

Pitc

h (H

z)

Amanda Karen

Intensity Trend By Speaker

0,0010,0020,0030,0040,0050,0060,0070,0080,0090,00

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81

Token Number

Inte

nsity

(dB

)

Amanda Karen

Page 7: Text and Speech Analysis

7. Conclusions

Due to the difficulties in SPPAS processing, the chosen dialogue is a very simple type of

conversation, so the DAMSL analysis and the domain analysis did not show sensitive

results. The topic of conversation is general, so there is not a particular trend in semantic

domains of word tokens. The conversation is equally distributed such that the two speakers

have almost the same number of tokens. The conversation shows slight variations in pitch

and the fundamental frequency of Amanda's voice is quite different than Karen's, showing

the different timber of the two speakers, though always maintaining a pitch in the range of

common female values. In average pitch results there is a significant pitch outlier

associated to the Amanda's expression “on friday”: the values of 97 and 107 Hz sound a

little bit irrealistics if associated to female voice. The average intensity of tokens underlines

that the volume of dialogue remains constant during the conversation, there's not softly

speaking and the two speakers talk at the same volume (only 2 dB of difference).

The PRAAT analysis is probably the most reliable analysis together with POS tagging,

whereas the analysis carried out with JWNL shows evident limits in recognizing the correct

domains of speech. Most of the domains found are clearly wrong if associated to the kind

of dialogue, and the reason relies upon the fact that a knowledge of the context in which

word token resides should be mandatory to reach the right semantic domain.

The kind of conversation between Amanda and Karen is a Q & A conversation, so it's not a

surprise that a high percentage of dialogue acts falls in the Answer and Info-Request types.

More pleasant expressions seems to have higher level of pitch and intensity, whereas

action-directive, open-options and offers show a lower pitch and sometimes lower

intensity, meaning that when the speaker launches a proposal wants probably to give a

feeling of modesty, to avoid the feeling of an imposition.

Page 8: Text and Speech Analysis

8. Appendix: Lines of Code.

MATLAB CODE

function [y_n] = remove_noise(y,win_len,mean_val, atten) % This functions performs a background noise attenuation, provided that the% loudness difference between noise and original signal is high enough.% y = signal with noise% win_len = frame length to calculare noise impact% mean_val = threshold which discriminates between noise and signal% atten = attenuation value to cut noise for n = 1:(length(y)-win_len) if (sum(abs( y(n:(n+win_len-1) ) )) < mean_val*win_len & max(abs(y(n:n+win_len-1)))< mean_val) for m = n:n+win_len-1 y(m) = y(m)*atten; end endend y_n = y; endPRAAT CODE

##### Script to extract features for each token #####

##print columns of the table##

echo Token MeanPitch Intens. DialogueAct Speaker

select all

#sound file & TextGrid file to be analyzed#

s = selected("Sound")

tg = selected("TextGrid")

select tg

numIntervals = Get number of intervals... 3

### calculate Pitch and Intensity of Speech ###

select s

To Pitch... 0.0 75 600

select s

To Intensity... 75 0.0

plus Pitch dialogue-flat

Page 9: Text and Speech Analysis

pitch = selected ("Pitch")

intensity = selected("Intensity")

space$ = " "

for cont from 1 to numIntervals

select TextGrid dialogue-flat-phon_palign

token$ = Get label of interval... 3 cont

tstart = Get starting point... 3 cont

tend = Get end point... 3 cont

dialogueActNum = Get interval at time... 4 tstart+0.01

dialogueAct$ = Get label of interval... 4 dialogueActNum

speakerNum = Get interval at time... 5 tstart+0.01

speaker$ = Get label of interval... 5 speakerNum

# for each not-silence token extract mean pitch & mean intensity #

if !startsWith (token$, "#")

select pitch

pitchMean = Get mean... tstart tend Hertz

select intensity

intensityMean= Get mean... tstart tend dB

### configure layout ###

lenStr = length(token$)

spaceNum = 15 - lenStr

print 'token$'

for lung from 1 to spaceNum

print 'space$'

endfor

Page 10: Text and Speech Analysis

print 'pitchMean:2' 'intensityMean:2'

lenStr2 = length(dialogueAct$)

spaceNum2 = 20 - lenStr2

### configure layout ###

print 'dialogueAct$'

for lung from 1 to spaceNum2

print 'space$'

endfor

print 'speaker$'

printline

endif

endfor

### Save data in txt file ###

appendFile ("conversation-audio.txt", info$ ())

JWNL CODEpackage wordnet;import java.io.*;public class WordSem {

public static void main(String[] args) throws JWNLException, IOException, JWNLRuntimeException {

// Initialize JWNL with the properties file to point to dictionary files

JWNL.initialize(new FileInputStream("file_properties.xml")); // Dictionary object Dictionary wordnet;

//After initialization create a Dictionary object that can be queried wordnet = Dictionary.getInstance(); // read text file and extract words to be searched on WordNet String read_path = "D:\\Ultimo semestre\\Natural Language

Processing\\ASSIGNMENT\\conversation\\POS tagging\\conversation-tagged.txt"; //Open file reader stream (will read file with POS Tagging) FileReader fr = new FileReader(read_path); BufferedReader br = new BufferedReader(fr); //Open file writer stream (will write txt file with "Token POS Domain"

Page 11: Text and Speech Analysis

// lines for each token String write_path = "D:\\Ultimo semestre\\Natural Language

Processing\\ASSIGNMENT\\conversation\\dialogue-audio-pos-domains.txt";File file = new File(write_path);FileWriter file_write = new FileWriter(file);

String read_linea = ""; //line string variable, read line from sourcefile String wordn = ""; //takes token words from source file String word_POS = ""; // takes POS tags from source file POS wnPOS; // POS tag in WordNet format String strdomain = ""; //takes domain string related to word token // While there are lines in source file take word token and POS tag while(true) { read_linea = br.readLine(); if(read_linea==null) break; String [] splits = read_linea.split("_"); //this is separator

between word and tag in source file wordn = splits[0]; System.out.println(wordn); word_POS = splits[1]; System.out.println(word_POS); //begin write line in output txt file StringBuilder write_appnd = new StringBuilder();

write_appnd.append(wordn).append(" ").append(word_POS).append(" ");

// translate from POS tag to WordNet word type wnPOS = getWordNetPOS(word_POS);

//WordNet analysis: will check for word domain, and for hypernyms if (wnPOS != null && wordn != null) { //An IndexWord is a single word and part of speech. Lookup a

SynSet object. IndexWord w = wordnet.lookupIndexWord(wnPOS, wordn); if (w != null) { Synset[] senses = w.getSenses(); int domainlen = senses.length; Pointer[] domain = new Pointer[domainlen]; for (int i=0; i<senses.length; i++) { // CATEGORY is the pointer type for the domains domain =

senses[i].getPointers(PointerType.CATEGORY); Synset[] syndomain = new Synset[domain.length]; for (int l=0; l<domain.length; l++) { //obtain synset from domain and then an

associated word string syndomain[l] =

domain[l].getTargetSynset(); Word rootWord = syndomain[l].getWord(0); strdomain = rootWord.getLemma(); // add to outputtxt file write_appnd.append(strdomain);

Page 12: Text and Speech Analysis

} } //get to root hypernym if (wnPOS == POS.NOUN)

{ strdomain = getRootHypernym(w); write_appnd.append(strdomain); }

} } //finish to write line, and then skip to another write_appnd.append("\r\n");

String write_linea = write_appnd.toString(); file_write.write(write_linea);

} file_write.close(); br.close(); }

//translate from POS tag to WordNet word type public static POS getWordNetPOS(String wPOS) { POS wordNetPos; switch (wPOS) { case "NN": case "NNS": case "NNP": wordNetPos = POS.NOUN; break; case "VB": case "VBD": case "VBG": case "VBN": case "VBP": case

"VBZ": wordNetPos = POS.VERB; break; case "JJ": case "JJR": case "JJS": wordNetPos = POS.ADJECTIVE;

break; case "RB": case "RBR": case "RBS": wordNetPos = POS.ADVERB; break; default: wordNetPos = null; } return wordNetPos; } // search for root hypernym public static String getRootHypernym(IndexWord synsetw) throws JWNLException { String stringdomain =""; Synset syndomain = null; Synset[] senses = synsetw.getSenses();

int domainlen = senses.length;Pointer[] domain = new Pointer[domainlen];for (int i=0; i<senses.length; i++){

domain = senses[0].getPointers(PointerType.HYPERNYM);if (domain.length > 0){

syndomain = domain[0].getTargetSynset(); while(syndomain.toString() != null){

domain = syndomain.getPointers(PointerType.HYPERNYM);

if (domain.length > 0) syndomain =

Page 13: Text and Speech Analysis

domain[0].getTargetSynset(); else break;

}}

}Word rootWord = syndomain.getWord(0); stringdomain = rootWord.getLemma(); System.out.println(stringdomain);return stringdomain;

}

}