10
Proceedings of the 3rd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 85–94, San Diego, California, June 16, 2016. c 2016 Association for Computational Linguistics Generating Clinically Relevant Texts: A Case Study on Life-changing Events Mayuresh Oak 1 , Anil K. Behera 3 , Titus P. Thomas 1 , Cecilia Ovesdotter Alm 2 , Emily Prud’hommeaux 2 , Christopher Homan 3 , Raymond Ptucha 1 1 Kate Gleason College of Engineering 2 College of Liberal Arts 3 Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY 14623, USA {mso4106 |akb2701 |tpt7797 |coagla |emilypx |cmh ξ |rwpeec @rit.edu ξ @cs.rit.edu Abstract The need to protect privacy poses unique challenges to behavioral research. For in- stance, researchers often can not use exam- ples drawn directly from such data to ex- plain or illustrate key findings. In this re- search, we use data-driven models to synthe- size realistic-looking data, focusing on dis- course produced by social-media participants announcing life-changing events. We com- paratively explore the performance of distinct techniques for generating synthetic linguistic data across different linguistic units and top- ics. Our approach offers utility not only for reporting on qualitative behavioral research on such data, where directly quoting a partici- pant’s content can unintentionally reveal sen- sitive information about the participant, but also for clinical computational system devel- opers, for whom access to realistic synthetic data may be sufficient for the software devel- opment process. Accordingly, the work also has implications for computational linguistics at large. 1 Introduction Behavioral research using personal data, such as that from social media or clinical studies, must continu- ally balance insights gained with respect for privacy. Ethical and legal demands also come into play. De- identification involves removing information such as named entities, address-specific information and so- cial security numbers. However, naive approaches are often prone to privacy attacks. Such de-identified data will often still contain information that, when combined with other data from different resources, can point to the individual who generated it. For example, if a de-identified dataset contains detailed demographic information, it could then be possible to extract a small list of people matching this infor- mation and to identify a specific person using other, publicly available data. One approach that strikes a good balance is to synthesize realistic-looking data with the same sta- tistical properties as actual data. Our contribution is to compare different techniques for synthesizing behavioral data. Specifically, we explore this prob- lem in a case study with social media texts that in- volve social media participants making announce- ments about life-changing events, which are per- sonal in nature and which also may affect, positively or negatively, a person’s well-being. Two immediate applications to clinical research that motivate this approach are: qualitative results reporting involving textual data and data access is- sues for software development purposes. Neither readers of scientific reports nor software developers need access to the original data as long as realistic looking synthetic data is available. 2 Related Work In the clinical setting, data privacy is important. Anonymization aims to ensure that data is untrace- able to an original user, whereas de-identification may allow the data to be traced back to a user with third-party information. Szarvas et al. (2007) developed a model for anonymizing personal health information (PHI) from discharge records. The model identifies PHIs 85

Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

Proceedings of the 3rd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 85–94,San Diego, California, June 16, 2016. c©2016 Association for Computational Linguistics

Generating Clinically Relevant Texts:A Case Study on Life-changing Events

Mayuresh Oak1, Anil K. Behera3, Titus P. Thomas1, Cecilia Ovesdotter Alm2,Emily Prud’hommeaux2, Christopher Homan3, Raymond Ptucha1

1Kate Gleason College of Engineering2College of Liberal Arts

3Golisano College of Computing and Information SciencesRochester Institute of Technology

Rochester, NY 14623, USA{mso4106†|akb2701†|tpt7797†|coagla†|emilypx†|cmhξ|rwpeec†

†@rit.edu ξ@cs.rit.edu

Abstract

The need to protect privacy poses uniquechallenges to behavioral research. For in-stance, researchers often can not use exam-ples drawn directly from such data to ex-plain or illustrate key findings. In this re-search, we use data-driven models to synthe-size realistic-looking data, focusing on dis-course produced by social-media participantsannouncing life-changing events. We com-paratively explore the performance of distincttechniques for generating synthetic linguisticdata across different linguistic units and top-ics. Our approach offers utility not only forreporting on qualitative behavioral research onsuch data, where directly quoting a partici-pant’s content can unintentionally reveal sen-sitive information about the participant, butalso for clinical computational system devel-opers, for whom access to realistic syntheticdata may be sufficient for the software devel-opment process. Accordingly, the work alsohas implications for computational linguisticsat large.

1 Introduction

Behavioral research using personal data, such as thatfrom social media or clinical studies, must continu-ally balance insights gained with respect for privacy.Ethical and legal demands also come into play. De-identification involves removing information such asnamed entities, address-specific information and so-cial security numbers. However, naive approachesare often prone to privacy attacks. Such de-identifieddata will often still contain information that, when

combined with other data from different resources,can point to the individual who generated it. Forexample, if a de-identified dataset contains detaileddemographic information, it could then be possibleto extract a small list of people matching this infor-mation and to identify a specific person using other,publicly available data.

One approach that strikes a good balance is tosynthesize realistic-looking data with the same sta-tistical properties as actual data. Our contributionis to compare different techniques for synthesizingbehavioral data. Specifically, we explore this prob-lem in a case study with social media texts that in-volve social media participants making announce-ments about life-changing events, which are per-sonal in nature and which also may affect, positivelyor negatively, a person’s well-being.

Two immediate applications to clinical researchthat motivate this approach are: qualitative resultsreporting involving textual data and data access is-sues for software development purposes. Neitherreaders of scientific reports nor software developersneed access to the original data as long as realisticlooking synthetic data is available.

2 Related Work

In the clinical setting, data privacy is important.Anonymization aims to ensure that data is untrace-able to an original user, whereas de-identificationmay allow the data to be traced back to a user withthird-party information.

Szarvas et al. (2007) developed a model foranonymizing personal health information (PHI)from discharge records. The model identifies PHIs

85

Page 2: Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

Figure 1: Top level view of the proposed anonymiza-tion system. Data is fed to a model which here is acharacter-based Long-Short Term Memory (LSTM).The LSTM generates new tweets based on the inputdata.

in several steps and labels all entities which can betagged from the text structure. It then queries foradditional PHI phrases in the text with help fromtagged PHI entities.

Bayardo and Agrawal (2005) present improved k-anonymity methods and provide efficient algorithmsfor data dimensionality reduction. However, evenif information such as names of people or providersor quasi-identifiers (QIs) are removed, there are stillways to compare the de-identified data with other

records having these identifiers.In contrast to traditional anonymization and de-

identification methods, generation of synthetic datacan handle various aspects of hiding individuals, byaggregating and severing data from individual users,yet maintaining the statistical properties of the dataused to train generation models. For this paper weexplore several forms of data generation, using so-cial media (Twitter) data about life-changing eventsas a case study. For example, Twitter data has beenused for studying important life-changing events(De Choudhry et al., 2013; Li et al., 2014). Otherstudies present methods for anonymizing Twitterdatasets. Terrovitis et al. (2008) model social me-dia as an undirected, unlabeled graph which does re-tain privacy of social media users. Daubert et al.(2014) discuss the different methods for anonymiza-tion of Twitter data. However, there is a lack of workthat addresses synthetic data creation using machinegeneration models.

This paper compares traditional statistical lan-guage models and Long Short Term Memory(LSTM) models to learn models from a training setof Twitter data to generate synthetic tweets. LSTMsare recurrent neural networks designed to learn bothlong and short term temporal sequences. These net-works were introduced by Hochreiter and Schmid-huber (1997), with several improvements over theyears, the most common of which include individualgating elements (Graves and Schmidhuber, 2005).LSTMs have been shown to perform at state-of-the-art levels for many tasks, including handwritingrecognition and generation, language modeling, andmachine translation (Greff et al., 2015).

3 Data

Twitter is a microblogging platform used by peo-ple to post about their lives. If harnessed properly,tweets can be used for analysis and research of be-havioral patterns as well as in studying health infor-mation.

We collected tweets using Twitter’s streamingAPI along with customized query strings. Thesequeries targeted the life-changing events of birth,death, marriage, and divorce. The tweet collec-tion process suggested that users were more likelyto share joyful news about marriage and birth, and

86

Page 3: Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

Table 1: Birth patterns

birth of baby/brother/son/daughter/brother/sisterparents of baby/son/daughter/boy/girl/angelarrival of baby/brother/son/daughter/sister/angeljust gave birth to baby/son/daughter/boy/girlweigh/weighing #Number lbs/poundsits a boy/girlpregnant/c-section

Table 2: Marriage patterns

I’m/we are getting/sister/brother/mother marriedfriend/uncle/aunt is getting marriedI/we/sister/brother/friend/uncle/aunt got married

Table 3: Death patterns

RIP mom/mama/dad/father/grandmother/brother/RIP grandpa/grandfather/sister/friendhe/mom/mama/dad/father passed awaygrandfather/grandpa/grandma passed awaybrother/sister/friend passed away

less likely to share difficult news about death and di-vorce. Tweets on divorce were particularly scarce,so this event was ignored as the study continued.

The pool of tweets came from a collection oftweets from a mid-sized city in the US North East in2013 as well as streaming tweets irrespective of lo-cation from early 2016. Roughly 18 million tweetswere collected, including tweets for the three afore-mentioned categories of birth, death, and marriage.Only the text of the tweets was utilized for this study.

After inspecting the data, we formulated a set oflexical keywords, phrases and regular expressionsto collect tweets by category. These reflected top-ical patterns, such as announcements of marriage orbirth in the family, the weight of the newborn babyor whether it is a girl or a boy, or the passing of afriend or family member. Table 1 shows the patternsused to extract tweets about birth. Similarly, Table2 shows the patterns for marriage, and Table 3 fordeath. We attempted to remove tweets about celebri-

ties, TV shows, news stories, and jokes. After filter-ing, we selected and hand-annotated for each cate-gory a set of 2000 tweets. For comparison’s sake wealso chose randomly 2000 (unlabeled) tweets fromthe data, and call this the general category. Notethat any tweet could be present in this category, in-cluding those from the first three categories.

We replaced Twitter usernames with the token@USER, while URL links, retweets, and emoti-cons were replaced with the keywords URL, RT, andEMOT, respectively. We removed the pound signsfrom hashtags to make it look more like generalwritten language and to reduce the dictionary sizeof the word-based language models.

For the character-based models, we performed thefollowing further steps. We separated each characterin the input data by a space and replaced the usualspace characters with <space>. We considered thetags introduced in the earlier pre-processing phase(e.g. - @USER) to be unique characters. On output,we replaced all space characters with the null stringand replace the space tag <space> with the spacecharacter.

Tables 4 through 6 show samples of collectedtweets.

Table 4: Birth tweets

She gave birth to the baby aww congratsloulou @USERbirth of Baby Tyler (They picked my babyname suggestion )

Table 5: Death tweets

my grandpa passed away today All I hope isthat things get better@USER my grandma passed away

Table 6: Marriage tweets

me and @USER just got marriedwe getting married

87

Page 4: Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

4 Methods

4.1 Long-Short Term MemoryRecurrent neural networks (RNN) are popular mod-els that have shown great potential in many naturallanguage processing (NLP) tasks. LSTMs (Hochre-iter and Schmidhuber, 1997; Graves and Schmidhu-ber, 2005) are a specific subset of RNNs that havebeen modified to be especially good at condition-ing on both long and short term temporal sequences.LSTMs modify the standard design of neural net-works in several ways: they eliminate the strict re-quirement that neurons only connect to other neu-rons in succeeding layers (adding recurrence), con-vert the standard neuron into a more complex mem-ory cell, and add non-linear gating units which serveto govern the information flowing out of and recur-sively flowing back into the cell (Greff et al., 2015).The memory cell differentiates itself from a simpleneuron by including the ability to remember its stateover time; this coupled with gating units gives theLSTM the ability to recognize important long-termdependencies while simultaneously forgetting unim-portant collocations.

Figure 2: A single LSTM memory block. The threegates govern the input node and memory cell to al-low long term memory. The function ϕ is the tanhfunction and the function σ is the sigmoid function.

The LSTM we use here, as implemented byKarpathy (2015) modifies the original architectureby removing peephole connections. The intuitiveunderstanding of the components in an LSTM mem-ory block can be summarized as:

1. Input node: Also known as input modulationgate or new memory gate, takes the input and

the past hidden state to summarize the new in-put in light of the past context from ht−1.

2. Input gate: Also known as write gate, takes theinput and the past hidden state to determine theimportance of the current input as it effects thecell.

3. Forget gate: Also known as reset gate, takesthe input and the past hidden state and givesthe provision for the hidden layer to discard orforget the historical data.

4. Output gate: Takes the input and the past hid-den state and determines what parts of the celloutput ct need to be present in the new hiddenstate ht for the next timestep.

5. Memory cell: Takes advice from the forgetgate and governed Input Node to determine theusefulness of the previous memory ct−1 to pro-duce the new memory ct.

The functionality above describes only how a sin-gle LSTM memory block works, analogous to a sin-gle neuron in a regular neural network. To create anLSTM which learns, hundreds of these blocks arecombined in a single layer (analogous to hundredsof nodes in a hidden layer), with the hidden output,ht,ct of one block feeding into the input of another.Further complexity (and learning power) is added byincluding multiple layers of LSTM memory blocks.The final output of LSTM memory blocks (or inputsfrom one layer to the next) are provided by calculat-ing yt = Wyf(ht), where Wy is an output weightmatrix to learn and f(·) is an activation functionwhich can vary depending on use case.

The input, xt, to an LSTM memory block differsdepending on implementation and use case. Whenusing LSTMs for NLP, the input can be word orcharacter-based. The LSTM used in this research(Karpathy, 2015), takes as input a vector represent-ing an individual data item (character/word) and pre-dicts the most probable data item given the currentdata item and the LSTM’s previous states. Training,therefore, is done by taking an example sequence ofdata items, predicting the next data item using thecurrent weights, calculating the difference betweenwhat was predicted and what should have been pre-dicted, and back propagating this difference to up-

88

Page 5: Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

date the weights. All LSTM models were trained for500 epochs and sequence length of 50, where the se-quence length is the length of time the LSTM cell isunrolled per iteration. Two LSTM layers were usedto train the model on the input data. Each LSTMlayer had 512 hidden nodes. Language generationcan be performed after training, in which the LSTMis given either a starting sequence of data items (or itcalculates the most probable sequence to start with),and then generates new data items based on its ownpredictions in previous time steps.

4.2 Standard N-gram Language Models

In order to demonstrate the particular utility ofLSTMs for generating realistic tweets, the outputof our character- and word-based LSTM methodswas compared to that of standard n-gram backofflanguage models. Such models are widely used tomodel the probability of word sequences for manyNLP applications, including machine translation,automatic speech recognition, and part-of-speechtagging. The SRI Language Modeling Toolkit(SRILM) was used to build 4-gram word- andcharacter-based language models (Stolcke, 2002).Using these models, we then generate synthetictweets using the OpenGRM Ngram library (Roarket al., 2012).

4.3 Experimental Design

For each event category, we divided the dataset of2000 tweets into 1800 training and 200 testing in-stances. We used the machine translation qualitymetric BLEU (Papineni et al., 2002) to measure thesimilarity between machine generated tweets andthe held out tests sets. For each model, we gener-ated ten sets of 200 tweets. We calculated BLEUscores (without the brevity penalty) using the full200-tweet test set as the reference for each candidatetweet and report the average of the BLEU scores ofall ten sets of tweets generated by a given model.

To gain further insight into the effectiveness of themachine generated data, we asked human annota-tors to evaluate the generated tweets. We selected800 tweets by randomly sampling: 400 human gen-erated tweets (100 from each category), and 400 ma-chine generated tweets. The 400 machine gener-

Table 7: Mean BLEU scores and their standard de-viation over ten generated test sets of 200 tweets permodel, by topic, model, and linguistic unit.

Topic Model BLEU

Birth

LSTM char 34.61 ±2.53word 32.36 ±2.21

LM char 12.15 ±0.63word 32.01±0.96

Marriage

LSTM char 31.14±2.30word 26.22±0.77

LM char 12.54±1.08word 32.26±0.96

Death

LSTM char 20.16±2.78word 17.84±9.45

LM char 6.04±0.62word 16.93±0.67

General

LSTM char 40.55±4.27word 17.62±2.17

LM char 5.46±0.60word 44.74±1.33

ated tweets consisted of 25 tweets for each combi-nation of model (LM-char, LM-word, LSTM-char,LSTM-word) and category (birth, marriage, death,general). For each tweet, the annotators indicatedif they thought the tweet was generated by a humanor machine, and they rated the quality of the tweeton the basis of syntax and semantics. Also, they in-dicated which topic category they thought the tweetbelonged to.

5 Results

BLEU, a measure of n-gram precision widely usedto evaluated machine translation output, was usedto objectively evaluate the similarity between thehuman-generated tweets and the synthetic tweetsproduced by our models. Table 7 shows the BLEUscores for each combination of topic, model, andlinguistic unit. The character-based LSTM mod-els and the word-based LM models both performvery strongly, with each reporting the highest BLEUscore in two of the four topics. We further note thatthe character-based LSTM always outperforms theword-based LSTM. Although it might be surprisingthat a character-based model would produce highervalues for a word n-gram precision metric such as

89

Page 6: Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

Table 8: The percent of instances where the four human annotators (A1 - A4) were deceived into thinking asynthetic tweet was human generated. The values in bold are the best performing models for each categoryby annotator. (B = birth, D = death , M = marriage , B = general ).

Model A1 A2 A3 A4B D M G B D M G B D M G B D M G

LM-char 14 0 0 0 14 0 20 0 40 52 28 16 16 8 8 12LM-word 18 8 21 10 18 8 50 40 80 80 88 72 32 32 48 44LSTM-char 45 25 44 0 36 25 56 22 60 72 68 64 40 44 40 44LSTM-word 33 38 30 11 44 25 40 11 76 72 48 44 36 36 24 28

BLEU, we suspect this is due to the fact that thelarge feature space of the word-based model in com-bination with the relatively small number of trainingtweets (roughly 1800) is not optimal for learning anLSTM model.

5.1 Human evaluation

A randomized set of 800 tweets, both real and syn-thetic, from all four topic categories was submittedto a panel of annotators (co-authors). Each annota-tor was asked to decide whether the tweet was real

Figure 3: Syntax score by annotator. Higher scoressuggest more satisfactory generation of syntacticstructures (median = red line, mean = dashed line).

Figure 4: Semantics score by annotator. Higherscores suggest more satisfactory generation of se-mantic contents (median = red line, mean = dashedline).

(i.e., produced by a human) or synthetic (i.e., gener-ated by one of the LSTM or n-gram language mod-els). Each tweet was also rated in terms of its syntaxand semantics on a five point Likert scale. In addi-tion, the annotators were asked to select the intendedtopic category (birth, death, marriage, or general) ofthe tweet.

Figure 5 shows the ability of human annotatorsto accurately identify a tweet’s topic. In general,the annotators were able to identify the topic of thehuman tweets, with the weakest performance in thegeneral category. Identifying the intended topic ofthe synthetic tweets was more challenging for the

90

Page 7: Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

Figure 5: Ability of four human annotators (A1 - A4) to predict the topic category of the data from whicha tweet was generated, per model. The top left panel reflects results for human-composed tweets, whereasthe top right panel shows results across synthetic tweets, corresponding to the four models in subsequentpanels. (Dark blue = birth , light blue = death , green = marriage , yellow = general ).

annotators, but accuracy was quite high in all topicsother than general. We note that the general cat-egory was not filtered to remove tweets that couldhave belonged to the other topics, which could ex-plain this discrepancy.

Figures 3 and 4 show the distribution of eachannotator’s syntax and semantics scores for eachmodel. These boxplots show that there was signif-icant variance in the annotators’ evaluation of thesyntactic and semantic quality of the tweets. Wenote, however, that the models yielding the high-est BLEU scores, char-LSTM and word-LM, tendedto receive more favorable scores for syntactic andsemantic quality. The character-based LM model,whose BLEU scores were significantly lower thanother models, consistently received the most unsatis-factory evaluation of syntactic and semantic qualityby all four annotators. It also seems that the LSTMmodels produce output that is more consistent in its

semantic and syntactic quality, with smaller annota-tor to annotator variance than the LM models.

With regard to Figure 5, Annotators 1 and 2 rated283 (selected randomly) tweets, while Annotators3 and 4 rated all 800 tweets; and with regard toFigures 3, 4, and Table 8, all annotators rated 283tweets. Annotators 1 and 2 have an academic back-ground in linguistics, while the other two annota-tors do not have prior linguistic training, perhaps ex-plaining why annotators 1 and 2 generally were bet-ter able to identify the topic category. Annotators 1and 2 tended to have similar distributions of seman-tic and syntactic quality scores across models, whichagain is likely related to their previous training inlinguistics and linguistic annotation. Annotator 4may have been less forgiving about non-standardlanguage use in the human-composed tweets, whileannotator 3 was more tolerant of the syntax and se-mantics of machine-generated tweets.

91

Page 8: Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

Table 9: Synthetic tweets marked as human gener-ated by all four annotators.

Congrats to @USER and her husband on thebirth of their son Welcome to the Cyclonefamily, Eally Kinglan URL URL (CharLSTM Generated)@USER congratulations on birth of yourson,20 days,ago,URL (Word LM Generated)@USER @USER @USER,looks like we’regetting hitched in June URL (Word LMGenerated)Im getting married in 17 days death (CharLSTM Generated)RT @USER rip grandma 2 8 16 (Word LMGenerated)

Table 10: Synthetic tweets marked as synthetic byall four annotators.

RT @USER The new part prigials give birthto bely son Junt and I’m delined a hapeproud (Char LSTM Generated)I’m so sorry for your loss and world harrygotting to my funeral it was without URL(Word LM Generated)

Table 8 shows the percent of instances a humanannotator marked a synthetic tweet as human gener-ated. Table 9 shows some of the tweets that weregenerated by language models but were identifiedby all four annotators as human generated. A fewexample tweets that were correctly identified by allfour annotators as synthetic tweets are displayed inTable 10.

6 Conclusion

We have discussed generating synthetic data in thecontext of readers of scientific reports or softwaredevelopers. In addition, one potential clinical appli-cation might be to apply this to patient transcriptsso that they could be shown to other patients suf-fering from similar problems, e.g., for anonymizedvirtual group therapy. Such an approach might beespecially useful in rural and developing regions,where clinical resources are sparse. Anonymiza-tion of data in research is often necessary to pro-tect patient or user identity. This research explores

data-driven models to generate realistic-looking dis-course with the same statistical properties as a train-ing corpus. Specifically, this research explores thesynthetic generation of tweets, contrasting LM andLSTM models, character-based and word-based lin-guistic units, and the topic categories of birth, death,and marriage. Based on the results from objectiveBLEU scores and subjective human evaluation, theword-based LM and char-based LSTM models per-formed well, deceiving annotators 41 and 43 per-cent of the time on average into thinking a synthetictweet was human generated. This research showspromising evidence that the synthetic generation ofuser data may be preferred to existing techniquesof naive anonymization which can potentially leadto user identification through combination of demo-graphic data mining and ancillary metadata.

ReferencesPuneet Agarwal, Rajgopal Vaithiyanathan, Saurabh

Sharma, and Gautam Shroff. 2012. {Catchingthe Long-Tail: Extracting Local News Events fromTwitter}. Proceedings of the Sixth International AAAIConference on Weblogs and Social Media.

Roberto J Bayardo and Rakesh Agrawal. 2005. Data pri-vacy through optimal k-anonymization. In Data Engi-neering, 2005. ICDE 2005. Proceedings. 21st Interna-tional Conference on, pages 217–228. IEEE.

Edward Benson, Aria Haghighi, and Regina Barzilay.2011. Event discovery in social media feeds. In Pro-ceedings of the 49th Annual Meeting of the Associa-tion for Computational Linguistics: Human LanguageTechnologies-Volume 1, pages 389–398. Associationfor Computational Linguistics.

Deepayan Chakrabarti and Kunal Punera. 2011. EventSummarization Using Tweets. ICWSM, 11:66–73.

J. Daubert, L. Bock, P. Kikirasy, M. Muhlhauser, andM. Fischer. 2014. Twitterize: Anonymous Micro-blogging. In Computer Systems and Applications(AICCSA), 2014 IEEE/ACS 11th International Confer-ence on, pages 817–823, Nov.

Munmun De Choudhry, Scott Counts, and Eric Horvitz.2013. Major life changes and behavioral markers insocial media: Case of childbirth. In Proceedings ofthe 16th ACM Conference on Computer Supported Co-operative Work (San Antonio, TX, USA, Feb 23-27,2013). CSCW 2013. ACM.

Munmun De Choudhury, Scott Counts, and Eric Horvitz.2013. Major life changes and behavioral markers insocial media: case of childbirth. In Proceedings of the

92

Page 9: Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

2013 conference on Computer supported cooperativework, pages 1431–1442. ACM.

Barbara Di Eugenio, Nick Green, and Rajen Subba.2013. Detecting life events in feeds from twitter. In2013 IEEE Seventh International Conference on Se-mantic Computing, pages 274–277. Ieee.

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar-rama, Marcus Rohrbach, Subhashini Venugopalan,Kate Saenko, and Trevor Darrell. 2015. Long-termrecurrent convolutional networks for visual recogni-tion and description. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition,pages 2625–2634.

Lilian Edwards and Andrea M Matwyshyn. 2013. Twit-ter (R) evolution: privacy, free speech and disclosure.In Proceedings of the 22nd international conferenceon World Wide Web companion, pages 745–750. Inter-national World Wide Web Conferences Steering Com-mittee.

Rumi Ghosh, Tawan Surachawala, and KristinaLerman. 2011. Entropy-based classificationof’retweeting’activity on twitter. arXiv preprintarXiv:1106.0346.

Alex Graves and Jurgen Schmidhuber. 2005. Framewisephoneme classification with bidirectional LSTM andother neural network architectures. Neural Networks,18(5):602–610.

Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnık,Bas R Steunebrink, and Jurgen Schmidhuber. 2015.LSTM: A Search Space Odyssey. arXiv preprintarXiv:1503.04069.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation, 9(8):1735–1780.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 3128–3137.

Andrej Karpathy. 2015. Char-RNN: Multi-layer re-current neural networks (LSTM, GRU, RNN) forcharacter-level language models in torch. https://github.com/karpathy/char-rnn. Accessed:2015-07-17.

Ninghui Li, Tiancheng Li, and Suresh Venkatasubrama-nian. 2007. t-closeness: Privacy beyond k-anonymityand l-diversity. In Data Engineering, 2007. ICDE2007. IEEE 23rd International Conference on, pages106–115. IEEE.

Jiwei Li, Alan Ritter, Claire Cardie, and Eduard H Hovy.2014. Major Life Event Extraction from Twitterbased on Congratulations/Condolences Speech Acts.In EMNLP, pages 1997–2007.

Shirin Nilizadeh, Apu Kapadia, and Yong-Yeol Ahn.2014. Community-enhanced de-anonymization of on-line social networks. In Proceedings of the 2014 ACMSIGSAC Conference on Computer and Communica-tions Security, pages 537–548. ACM.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic eval-uation of machine translation. In Proceedings of the40th annual meeting on association for computationallinguistics, pages 311–318. Association for Computa-tional Linguistics.

Rene Pickhardt, Thomas Gottron, Martin Korner,Paul Georg Wagner, Till Speicher, and Steffen Staab.2014. A generalized language model as the combi-nation of skipped n-grams and modified kneser-neysmoothing. arXiv preprint arXiv:1404.3377.

Jacob Ratkiewicz, Michael Conover, Mark Meiss,Bruno Goncalves, Alessandro Flammini, and FilippoMenczer. 2011. Detecting and Tracking PoliticalAbuse in Social Media. In ICWSM.

Brian Roark, Richard Sproat, Cyril Allauzen, MichaelRiley, Jeffrey Sorensen, and Terry Tai. 2012. TheOpenGrm open-source finite-state grammar softwarelibraries. In Proceedings of the ACL 2012 SystemDemonstrations.

Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo.2010. Earthquake Shakes Twitter Users: Real-timeEvent Detection by Social Sensors. In Proceedings ofthe 19th International Conference on World Wide Web,WWW ’10, pages 851–860, New York, NY, USA.ACM.

Pierangela Samarati and Latanya Sweeney. 1998. Gen-eralizing data to provide anonymity when disclosinginformation. In PODS, volume 98, page 188.

Priya Sidhaye and Jackie Chi Kit Cheung. 2015. Indica-tive Tweet Generation: An Extractive SummarizationProblem? Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing,pages 138–147.

Amardeep Singh, Divya Bansal, and Sanjeev Sofat.2014. An approach of privacy preserving based pub-lishing in twitter. In Proceedings of the 7th Interna-tional Conference on Security of Information and Net-works, page 39. ACM.

Richard Socher, Milad Mohammadi, and Rohit Mundra.Spring 2015. Cs 224d: Deep learning for NLP.http://cs224d.stanford.edu/lecture notes/LectureNotes4.pdf.

Andreas Stolcke. 2002. SRILM – An Extensible Lan-guage Modeling Toolkit. In Proceedings of the Inter-national Conference on Spoken Language Processing,volume 2, pages 901–904.

Gyorgy Szarvas, Richard Farkas, and Robert Busa-Fekete. 2007. State-of-the-art anonymization of med-

93

Page 10: Generating Clinically Relevant Texts: A Case Study on Life ...hollingk.github.io/CLPsych/pdf/CLPsych9.pdf · cial media (Twitter) data about life-changing events as a case study

ical records using an iterative machine learning frame-work. Journal of the American Medical InformaticsAssociation, 14(5):574–580.

Manolis Terrovitis, Nikos Mamoulis, and Panos Kalnis.2008. Privacy-preserving anonymization of set-valueddata. Proceedings of the VLDB Endowment, 1(1):115–125.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-HaoSu, David Vandyke, and Steve Young. 2015. Seman-tically conditioned lstm-based natural language gen-eration for spoken dialogue systems. arXiv preprintarXiv:1508.01745.

Jian Xu, Wei Wang, Jian Pei, Xiaoyuan Wang, BaileShi, and Ada Wai-Chee Fu. 2006. Utility-basedAnonymization Using Local Recoding. In Proceed-ings of the 12th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining, KDD’06, pages 785–790, New York, NY, USA. ACM.

94