Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Learning Communities

Maastricht - September, 18th, 2008

Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Learning Communities

Traian Rebedea1, Stefan Trausan-Matu1,2, Costin Chiru1

1 “Politehnica” University of Bucharest, Department of Computer Science and Engineering

2 Research Institute for Artificial Intelligence of the Romanian Academy

{traian.rebedea, stefan.trausan, costin.chiru} @ cs.pub.ro

September, 18th, 2008Extraction of Socio-Semantic Data from Chat Conversations in

Collaborative Learning Communities Maastricht –

Overview1. Introduction2. Theoretical background3. Implementation

Detecting conversation’s topics Assessing learners’ competencies Discovering implicit voices Conversation graph

4. Conclusions



Context Computer-assisted learning

Developing tools to support the learning process Evaluation of these tools (and of the learning process) Determining the learners’ performances

Computer Supported Collaborative Learning – CSCL Main idea: “rather than speaking about ‘acquisition of

knowledge,’ many people prefer to view learning as becoming a participant in a certain discourse” (Sfard, 2000)

Focus on studying interactions between the participants in chat conversations in small groups



Objectives Automatic extraction of useful social and semantic

information from conversations Determining relationships between utterances Utterances that have influenced the further development

of the conversation The performance / competency of each participant

Designing an interface for the visualisation of a conversation

Applied both to chats, discussion forums, etc and face-2-face discussions



Experiments Languages: English (advantages: existing NLP tools) and

Romanian Computer Science – HCI, NLP and Algorithm Design

courses in “Politehnica” University of Bucharest Small groups of 4-5 students – all of the students must be

graded (over 100 students / course) The conversations have well-determined subjects

Collaborative, team work Competitive

Also used chat transcripts from Virtual Math Teams, Drexel University, Philadelphia, US





4. Conclusions



Socio-cultural Paradigm the role of socially established artefacts in

communication and learning (Vygotsky) Bakhtin focuses on the role of language and

discourse, and especially of speech and dialog: “… Any true understanding is dialogic in nature.”

Lotman considers text as a „thinking device”



Bakhtin’s Dialogism Bakhtin’s ideas

Dialogism Polyphony Inter-animation of voices

Bakhtin: “The specific totality of ideas, thoughts and words is everywhere passed through several unmerged voices, taking on a different sound in each” – referring to Dostoevsky’s novels

Dual nature of voices: community and individuality



Voices in Chats Utterances should be the units of analysis An utterance contains at least one voice – the

one of the participant that issued it Most of the utterances contain multiple voices The inter-animation of the voices – discussion

threads of the conversation



Discussion Threads





4. Conclusions



Foreword Transcript chats are read from HTML or

XML files ConcertChat environment (Fraunhofer)

Advantages for collaborative work Enables the use of explicit references to previous

utterances or a whiteboard Implementation in C#.NET



Techniques Tokenization Stop-words, emoticons and usual abbreviations

( :) , :D , brb, thx, …) are eliminated WordNet for identifying synonyms Misspells are searched using the Google API The ontology can be with words discovered in the

chat, specific to the conversation’s domain Pattern analysis



Detecting the Topics Each word in the chat becomes a candidate concept

Synset list Frequency

Clustering algorithm for the concepts’ unification If the synsets of two concepts have a common word

The two synset lists are merged The frequency of the resulting concept = sum of the

frequencies of the unified concepts The resulting concepts – the main topics of the

conversation



Detecting the Topics (2)



1. Introduction2. Theoretical background3. Implementation

Detecting conversation’s topics Assessing the learners’ competencies Discovering implicit voices Conversation graph

4. Conclusions

Overview



Assessing the Competencies Graphics – evaluates the competency of each participant

starting from the chat topics (concepts represented as synsets) Uses other criteria like the nature of the utterances: questions,

agreements, references, etc. are treated different Parameters:

Factors for references Bonuses for agreements, penalties for disagreements O minimum value that is awarded to any line in the chat Penalties for (dis-)agreement, as they present less originality



The value of each utterance is computed by reporting it to an abstract utterance

Abstract utterance – built from the most important concepts identified in the chat; we only consider the concepts that have a frequency greater than a given threshold

Every utterance in the chat is scaled in the interval 0 – 100, by comparison to the abstract utterance

Synsets are used for every word An utterance with 0 score does not contain any concept from

the abstract one, and an utterance with an 100 score contains all the concepts from the abstract one

Value of an Utterance



Computing the Competencies At the start of the conversation, each participant has a null

competency. For each utterance in the chat, the value of the competencies are

modified accordingly: The participant that issued the current utterance receives the its score,

eventually downgraded, if it is an (dis-)agreement; All the participants that are literally present in the current utterance are

rewarded with a percentage of its value; The participant that issued the utterance referred by the current one is

rewarded for an agreement and penalized for a disagreement, with a constant value;

The participant that issued the utterance referred by the current one and is not a (dis-)agreement is rewarded with a fraction of the value of this utterance;

if the current utterance has a score of 0, the issuer will receive a minimum score (for participation).



Competencies’ Graphics Oy axis – Value of competency Ox axis – The number of the utterance





4. Conclusions



Discovering Implicit Voices We have explicit references We want to discover more references Why ? Haste and lack of attention The method

List of patterns that consist of a set of words (expressions) and a local subject called the referred word

If an utterance matches one of the patterns, we determine what word in the utterance is the referred word (e.g. “I don’t agree with your assessment”)

we search for this word in a predetermined number of the most recent previous utterances

If we can find this word in one of these utterances, then we have discovered an implicit relationship between the two utterances, the current one referring to the identified one

During the identification process, the synsets of the words are used



Discovering Implicit Voices (2) There are a number of empirical methods Examples

Short agreement / disagreement, then B refers AA – I think wikis are the bestB – I disagree

REF A, REF B – explicit and B – short (dis)agreement, then C implicitly refers A (transitivity)

A – I think wikis are the best (…)B – I disagree REF A(…)C – Maybe we should talk about them anyway REF B





4. Conclusions



Conversation is a graph Vertices = utterances Edges = references between utterances

The graph is directed and acyclic – can be topologically sorted

Using the graph: Segmentation of the chat in discussion threads Determining the strength of an utterance Graphical representation of the conversation

Conversation Graph



Utterances’ Strength The importance of an utterance in a conversation can be computed

using: Length The importance of the words

Another approach: an utterance is important if it influences the further evolution of the conversation

An important utterance – referenced by many further utterances Thus, the importance can be considered as a measure of the

strength of the utterance The utterance is strong if it influences the rest of the conversation

(like a breaking news at TV) Computed recurrently: Utterance strength = 1 + param1 * number references + param2 *

sum of the references’ strength



Visual Representation



Conclusions Social-semantic data extracted from conversations:

Discovery and visualisation of the discourse Determining important utterances Assessing the competencies Searching for references between utterances

Successfully integrated ideas and techniques from: Socio-cultural and dialogic paradigm Classical cognitive paradigm – ontologies and

knowledge-based processing Natural language processing



Conclusions (2) Machine learning for the automatic discovery of the

rules that define implicit references A chat annotation tool has been built Started creating a annotated chat corpus to be used as a

golden standard Improving the method used to compute the

competences – integrating SNA techniques Use domain ontologies and/or pLSA Current and further work is part of LTfLL FP7

project



Thank You!

Education

Extraction of Socio-Semantic Data from Chat Conversations in Collaborative Learning Communities