Conversation Modeling on Reddit Using a Graph-Structured LSTM · 2018. 4. 1. · Figure 1 shows an example conversa-tion on Reddit, where bigger nodes indicate higher ... Transactions

Conversation Modeling on Reddit Using a Graph-Structured LSTM

Victoria ZayatsElectrical Engineering Department

University of [email protected]

Mari OstendorfElectrical Engineering Department

University of [email protected]

Abstract

This paper presents a novel approach for mod-eling threaded discussions on social mediausing a graph-structured bidirectional LSTM(long-short term memory) which representsboth hierarchical and temporal conversationstructure. In experiments with a task ofpredicting popularity of comments in Redditdiscussions, the proposed model outperformsa node-independent architecture for differentsets of input features. Analyses show a bene-fit to the model over the full course of the dis-cussion, improving detection in both early andlate stages. Further, the use of language cueswith the bidirectional tree state updates helpswith identifying controversial comments.

1 Introduction

Social media provides a convenient and widely usedplatform for discussions among users. When thecomment-response links are preserved, those con-versations can be represented in a tree structurewhere comments represent nodes, the root is theoriginal post, and each new reply to a previous com-ment is added as a child of that comment. Someexamples of popular services with tree-like struc-tures include Facebook, Reddit, Quora, and Stack-Exchange. Figure 1 shows an example conversa-tion on Reddit, where bigger nodes indicate higherupvoting of a comment.1 In services like Twitter,

1The tool https://whichlight.github.io/reddit-network-vis was used to obtain this visualiza-tion.

Figure 1: Visualization of a sample thread on Reddit.

tweets and their retweets can also be viewed as form-ing a tree structure. When time stamps are avail-able with a contribution, the nodes of the tree canbe ordered and annotated with that information. Thetree structure is useful for seeing how a discussionunfolds into different subtopics and showing differ-ences in the level of activity in different branches ofthe discussion.

Predicting popularity of comments in social me-dia is a task of growing interest. Popularity hasbeen defined in terms of the volume of the re-sponse, but when the social media platform hasa mechanism for readers to like or dislike com-ments (or, upvote/downvote), then the difference inpositive/negative votes provides a more informativescore for popularity prediction. This definition of

121

Transactions of the Association for Computational Linguistics, vol. 6, pp. 121–132, 2018. Action Editor: Ani Nenkova.Submission batch: 11/2016; Revision batch: 3/2017; Published 2/2018.

c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

(a) Forward hierarchical and timing structure (b) Backward hierarchical and timing structure

Figure 2: An example of model propagation in a graph-structured LSTM. Here, the node name are shown in a chrono-logical order, e.g. comment t1 was made earlier than t2. 2(a) Propagation of graph-structured LSTM in the forwarddirection. Blue arrows represent hierarchical propagation, green arrows represent timing propagation.2(b) Backwardhierarchical (blue) and timing (green) propagation of graph-LSTM.

popularity, which has also been called communityendorsement (Fang et al., 2016), is the task of inter-est in our work on tree-structured modeling of dis-cussions.

Previous studies found that the time when thecomment/post was published has a big impact onits popularity (Lakkaraju et al., 2013). In addition,the number of immediate responses can be predic-tive of the popularity, but some comments with ahigh number of replies can be either controversialor have a highly negative score. Language should beextremely important for distinguishing these cases.Indeed, community style matching is shown to becorrelated to comment popularity in Reddit (Tranand Ostendorf, 2016). However, learning useful lan-guage cues can be difficult due to the low frequencyof these events and the dominance of time, topicand other factors. Thus, in several prior studies, au-thors constrained the problem to reduce the effectof those factors (Lakkaraju et al., 2013; Tan et al.,2014; Jaech et al., 2015). In this study, we have nosuch constraints, but attempt to use the tree structureto capture the flow of information in order to bettermodel the context in which a comment is submitted,including both the history it responds to as well asthe subsequent response to that comment.

To capture discussion dynamics, we introduce anovel approach to modeling the discussion usinga bidirectional graph-structured LSTM, where eachcomment in the tree corresponds to a single LSTM

unit. In one direction, we capture the prior his-tory of contributions leading up to a node, and inthe other, we characterize the response to that com-ment. Motivated by prior findings that both responsestructure and timing are important in predicting pop-ularity (Fang et al., 2016), the LSTM units includeboth hierachical and temporal components to the up-date, which distinguishes this work from prior tree-structured LSTM models. We assess the utility ofthe model in experiments on popularity predictionwith Reddit discussions, comparing to a neural net-work baseline that treats comments independentlybut leverages information about the graph contextand timing of the comment. We analyze the resultsto show that the graph LSTM provides a useful sum-mary representation of the language context of thecomment.

As in Fang et al. (2016), but unlike other work (Heet al., 2016), our model makes use of the full discus-sion thread in predicting popularity. While knowl-edge of the full discussion is only useful for post-hoc analysis of past discussions, it is reasonable toconsider initial responses to a comment, particularlygiven that many responses occur within minutes ofsomeone posting a comment. Comments are oftenpopular because of witty analogies made, which re-quires knowledge of the world beyond what is cap-tured in current models. Responses to these com-ments, as well as to controversial comments, canimprove popularity prediction. Responses of others

122

clearly influence the likelihood of someone to like ordislike a comment, but also whether they even read acomment. By introducing a forward-backward tree-structured model, we provide a mechanism for lever-aging early responses in predicting popularity, aswell as a framework for better understanding the rel-ative importance of these responses.

The main contributions of this paper include: anovel approach for representing tree-structured lan-guage processes (e.g., social media discussions)with LSTMs; evaluation of the model on the pop-ularity prediction task using Reddit discussions; andanalysis of the performance gains, particularly withrespect to the role of language context.

2 Method

The proposed model is a bidirectional graph LSTMthat characterizes a full threaded discussion, assum-ing a tree-structured response network and account-ing for the relative order of the comments. Eachcomment in a conversation corresponds to a node inthe tree, where its parent is the comment that it is re-sponding to and its children are the responding com-ments that it spurs ordered in time. Each node in thetree is represented with a single recurrent neural net-work (RNN) unit that outputs a vector (embedding)that characterizes the interim state of the discussion,analogous to the vector output of an RNN unit whichcharacterizes the word history in a sentence. In theforward direction, the state vector can be thought ofas a summary of the discussion pursued in a partic-ular branch of the tree, while in the backward di-rection the state vector summarizes the full responsesubtree that followed a particular comment. Thestate vectors for the forward and backward direc-tions are concatenated for the purpose of predictingcomment karma. The RNN updates – both forwardand backward – incorporate both temporal and hier-archical (tree-structured) dependencies, since com-menters typically consider what has already beensaid in response to a parent comment. Hence, werefer to it as a graph-structured RNN rather than atree-structured RNN. Figures 2(a) and 2(b) show anexample of the state connections associated with hi-erarchical and timing structures for the forward andbackward RNNs, respectively.

The supervision signal in training will impact the

character of the state vector, and the forward andbackward state sub-vectors are likely to capture dif-ferent phenomena. Here, the objective is to predictquantized comment karma. We anticipate that theforward state will capture relevance and informa-tiveness of the comment, and the backward processwill capture sentiment and richness of the ensuingdiscussion.

The specific form of the RNN used in this work isan LSTM. The detailed implementation of the modelis described in the sections to follow.

2.1 Graph-structured LSTM

Each node in the tree is associated with an LSTMunit. The input xt is an embedding that can incorpo-rate both comment text and local submission contextfeatures associated with thread structure and timing,described further in section 2.2. The node state vec-tor ht is generated using a modification of the stan-dard LSTM equations to include both hierarchicaland timing structures for each comment. Specifi-cally, we use two forget gates - one for the previous(or subsequent) hierarchical layer, and one for theprevious (or subsequent) timing layer.

In order to describe the update equations, weintroduce notation for the hierarchical and timingstructure. In Figure 2, the nodes in the tree are num-bered in the order that the comments are contributedin time. To characterize graph structure, let π(t) de-note the parent of t and κ(t) its first child. Timestructure is represented only among a set of siblings:p(t) is the sibling predecessor in time, and s(t) is thesibling successor. The pointers κ(t), p(t) and s(t)are set to ∅ when t has no child, predecessor, or suc-cessor, respectively. For example, in Figure 2(a), thenode t2 will have π(t2) = t1, κ(t2) = t4, p(t2) = ∅and s(t2) = t3, and the node t3 will have π(t3) = t1,κ(t3) = ∅, p(t3) = t2 and s(t3) = t5.

Below we provide the update equations for theforward process, using the subscripts i, f, g, c,and o for the input gate, temporal forget gate, hier-archichal forget gate, cell, and output, respectively.The vectors it, ft, and gt are the weights for new in-formation, remembering old information from sib-lings, and remembering old information from theparent, respectively. σ is a sigmoid function, and◦ indicates the Hadamard product. If p(t) = ∅, then

123

hp(t) and cp(t) are set to the initial state value.

it = σ(Wixt + Uihp(t) + Vihπ(t) + bi)

ft = σ(Wfxt + Ufhp(t) + Vfhπ(t) + bf )

gt = σ(Wgxt + Ughp(t) + Vghπ(t) + bg)

ct =Wcxt + Uchp(t) + Vchπ(t) + bc

ct = ft ◦ cp(t) + gt ◦ cπ(t) + it ◦ ctot = σ(Woxt + Uohp(t) + Vohπ(t) + bo)

ht = ot ◦ tanh(ct)

When the whole tree structure is known, we cantake advantage of the full response subtree to bet-ter represent the node state. To that end, we definea backward LSTM that has a similar set of updateequations except that only the first child will passthe hidden state to its parent. Specifically, the updateequations are the same except that π(t) is replacedwith κ(t), p(t) is replaced with s(t), and a differentset of weight matrices and bias vectors are learned.

Let + and − indicate forward and backward em-beddings respectively. On top of the LSTM unit, theforward and backward state vectors are concatenatedand passed to a softmax layer to predict 8 quantizedkarma levels:

P (yt = j|x, h) = exp(W js [h

+t ;h

−t ])∑8

k=1 exp(Wks [h

+t ;h

−t ])

where x and h correspond to the set of input featuresand state vectors (respectively) for all nodes in thediscussion.

2.2 Input Features

The full model includes two types of features in theinput vector, including non-textual features associ-ated with the submission context and the textual fea-tures of the comment at that node.

The submission context features are extractedfrom the graph and metadata associated with thecomment, motivated by prior work showing thatcontext factors such as the forum, timing and au-thor of a post are very useful in predicting popular-ity. The submission context features include:

• Timing: time since root, time since parent (inhours), number of later comments, and numberof previous comments

• Author: a binary indicator as to whether the au-thor is the original poster, and number of com-ments made by the author in the conversation

• Graph-location: depth of the comment (dis-tance from the root), and number of siblings

• Graph-response: number of children (directreplies to the comment), height of the sub-tree rooted from the node, size of that subtree,number of children normalized for each thread(2 normalization techniques), subtree size nor-malized for each thread (2 normalization tech-niques).

Two methods are used to normalize the subtree sizeand number of children to compensate for variationassociated with the size of the discussion, specif-ically: i) subtract the mean feature value in thethread, and ii) divide by the square root of the rankof the feature value in the thread.

These features are a superset of those used in Fanget al. (2016). The subvector including all these fea-tures is denoted xst .

The comment text features, denoted xct , are gen-erated using a simple average bag-of-words repre-sentation learned during the training:

xct =1

N

N∑

i=1

W ie

where W ie is an embedding of the i-th word in the

comment, and N is the number of words in thecomment. Comments longer than 100 words weretruncated to reduce noise associated with long com-ments, assuming that the early portion carries themost information. The percentage of the commentsthat exceed 100 words is around 11%− 14% for thesubreddits used in the study. In all experiments, theword embedding dimension is d = 100, and the vo-cabulary includes only words that occurred at least10 times in the dataset.

The input vector xt is set to either xst or [xst ;xct ],

depending on whether the experiment uses text.

2.3 PruningOften the number of comments in a single subtreecan be large, which leads to high training costs. Alarge percentage of the comments are low karma and

124

minimally relevant for predicting karma of neigh-bors, and many can be easily identified with simplegraph and timing features (e.g. having no replies orcontributed late in the discussion). Therefore, weintroduce a preprocessing step that identifies com-ments that are highly likely to be low karma to de-crease the computation cost. We then assign thesenodes to be level 0 and prune them out of the tree,but retain a count of nodes pruned for use in a count-weighted bias term in the update to capture informa-tion about response volume.

For detecting low karma comments, we train asimple SVM classifier to identify comments at the0 karma level based on the submission context fea-tures. If a pruned comment leads to a disconnectedgraph (e.g., an internal node is pruned but not itschildren), then the comment is retained in the tree.In testing, all pruned comments are given a predictedlevel of 0 and accounted for in the evaluation.

The state updates have an additional bias term forany nodes that have subsequent sibling or childrencomments pruned. For example, consider Figure 2,if nodes {t5, t6, t7, t9} are pruned, then t8 will havea modified forward update, and t3, t4 will have amodified backwards update. At node t, define Mκ

t

to be the number of levels pruned below it, Mpt as

the number of immediately preceeding commentspruned in its subgroup (responding to the same par-ent), andM s

t as the number of subsequent commentspruned in its subgroup plus the non-initial commentsin the associated subtrees. In the example above,Mκ

3 = 1, M s3 = 2, M s

4 = 1, Mp8 = 1, and all

other M∗t = 0. The pointers are updated reflect the

structure of the pruned tree, so p(8) = 4, s(4) =8, s(3) = ∅. The bias vectors rκ, rp and rs are as-sociated with the different sets of nodes pruned.

Let + and − indicate forward and backward em-beddings, respectively. The forward update has anadjusted predecessor contribution (h+p(t) + Mp

t rp).The backward update adds M s

t rs +Mκt rκ to either

h−s(t) or h−κ(t), depending on whether it is a time orhierarchical update, respectively.

2.4 TrainingThe objective function is minimum cross-entropyover the quantized levels. All model parame-ters are jointly trained using the adadelta optimiza-tion algorithm (Zeiler, 2012). Word embeddings

subreddit comments threads vocab sizeaskwomen 0.8M 3.5K 32Kaskmen 1.1M 4.5K 35Kpolitics 2.2M 4.9K 55K

Table 1: Data statistics.

subreddit Prec Rec % prunedaskwomen 67.9 72.4 36.9askmen 60.1 75.3 36.1politics 49.6 60.3 47.5

Table 2: Precision and recall of the pruning classifier andpercentage of comments pruned.

are initialized using word2vec skip-gram embed-dings (Mikolov et al., 2013) trained on all com-ments from the corresponding subreddit. Thecode is implemented in Theano (Team et al.,2016) and is available at https://github.com/vickyzayats/graph-LSTM.We tune themodel over different dimensions of the LSTM unit,and use the performance on the development set asa stopping criteria for the training.

3 Experiments

3.1 Data

Reddit2 is a popular discussion forum platform con-sisting of a large number of subreddits focusing ondifferent topics and interests. In our study, we exper-imented with 3 subreddits: askwomen, askmen, andpolitics. All the data consists of discussions made inthe period between January 1, 2014 and January 31,2015. Table 1 shows the total amount of data usedfor each of the subreddits. For each subreddit, thethreads were randomly distributed between training,development (dev) and test sets with the proportionsof 6:2:2. The performance of the pruning classifieron the dev set is presented in Table 2.

3.2 Task and Evaluation Metrics

Reddit karma has a Zipfian distribution, highlyskewed toward the low-karma comments. Since therare high karma comments are of greatest interest inpopularity prediction, Fang et al. (2016) proposes a

2https://reddit.com

125

task of predicting quantized karma (using a nonlin-ear head-tail break rule for binning) with evaluationusing a macro average of the F1 scores for predict-ing whether a comment exceeds each different level.Experiments reported here use this framework.

Specifically, all the comments with karma lowerthan 1 are assigned to level 0, and each subsequentlevel corresponds to karma less than or equal to themedian karma in the rest of the comments basedon the training data statistics. Each subreddit has 8quantized karma levels based on its karma distribu-tion. There are 7 binary subtasks (does the commenthave karma at level j or higher for j = 1, . . . , 7), andthe scoring metric is the macro average of F1(j).For tuning hyperparameters and as a stopping cri-terion, we use a linearly weighted average of F1scores to increase the weight on high karma com-ments, which gives slightly better performance forthe high karma cases but has only a small effect onthe macro average.

3.3 Baseline and Contrast Systems

We compare the graph LSTM to a node-independentbaseline, which is a feedforward neural networkmodel consisting of input, hidden and softmax lay-ers. This model is a simplification of the graph-LSTM model where there is no connection betweennodes. The node-independent model characterizes acomment without reference to either the text of thecomment that it is responding to or the commentsreacting to it. However, the model does have in-formation on the size of the response subtree viathe submission context input features. Both node-independent and graph-structured models are trainedwith the same cost function and tuned over the sameset of hidden layer dimensions.

We contrast performance of both architectureswith and without using the text of the comment it-self. As shown in Fang et al. (2016), simply us-ing submission context features (graph, timing, au-thor) gives a strong baseline. In order to evaluatethe role of each direction (forward or backward)in the graph-structured model, we also present re-sults using only the forward direction graph-LSTMfor comparison to the bidirectional model. In addi-tion, in order to evaluate the importance of the lan-guage of the comment itself vs. the language usedin the rest of the tree, we perform an interpolation

Model Text askwomen askmen politicsindep no 53.2 48.3 46.6graph no 54.6 52.1 47.9indep yes 52.8 50.7 47.4interp mix 54.7 52.1 48.2

graph(f) yes 55.0 53.3 49.9graph yes 56.4 54.8 50.4

Table 3: Average F1 score of karma level prediction fornode-independent (indep) vs. graph-structured (graph)models with and without text features; interp correspondsto an interpolation of the graph-structured model with-out text and the node-independent model with text; andgraph(f) corresponds to a graph-structured model con-tains forward direction only.

between the graph-LSTM with no language featuresand the node-independent model with language fea-tures. The relative weight for the two models istuned on the development set.

3.4 Karma Level Prediction

The results for the average F1 scores on the test setare presented in Table 3. In experiments for all thesubreddits, graph-structured models outperform thecorresponding node-independent models both withand without language features. Language featuresalso give a greater performance gain when used inthe graph-LSTM models. The fact that the forwardgraph improves over the interpolated models showsthat it is not simply the information in the currentnode that matters for karma of that node. Finally,while the full model outperforms the forward-onlyversion for all the subreddits, the gain is smaller thanthat obtained by the forward direction alone over thenode-independent model, so the forward directionseems to be more important.

The karma prediction results (F1 score) at the dif-ferent levels is shown in Figure 3. While in askmenand askwomen subreddits the overall performancedecreases for higher levels, the politics subreddit hasan opposite trend. This may be due in part to thelower pruning recall in the politics subreddit, butFang et al. (2016) also observe higher performancefor high karma levels in the politics subreddit.

126

Figure 3: F1 scores as a function of the quantized levels for different model configuration.

4 Analysis

Here, we present analyses aimed at better under-standing the behavior of the graph-structured modeland the role of language in prediction. All analysesare performed on the development set. The anal-yses are motivated by considering possible scenar-ios that are exceptions to the easy cases, which are:i) comments that are contributed early in the discus-sion and spawn large subtrees, likely to have highkarma, and ii) comments with small subtrees thattypically have low karma. We hypothesized threescenarios where the bidirectional graph-LSTM withtext might be useful. One case is controversial com-ments, which have large subtrees but do not havehigh karma because of downvotes; these tend to haveoverprediction of karma when using only submis-sion context. The other two scenarios involve un-derprediction of karma when using only submissioncontext. Early comments associated with few chil-dren and a more narrow subtree (see the downwardchain in Figure 1) may spawn popular new threadsand benefit from the popularity of other commentsin the thread (more readers attracted), thus having

higher popularity than the number of children sug-gests. Lastly, comments that are clever or humor-ous discussion endpoints might have high popularitybut small subtrees. These two cases tend to differ intheir relative timing in the discussion.

4.1 Karma Prediction vs. Time

The first study looked at where the graph-LSTMprovides benefits in terms of timing. We plot theaverage F1 score as a function of the contributiontime in Figure 4. As an approximation for time,we use the quantized number of comments madeprior to the current comment. The plots show thatthe graph-structured model improves over the node-independent model throughout the discussion. Rel-ative gains are larger towards the end of discussionswhere the node-independent performance is lower.A similar trend is observed when plotting averageF1 as a function of depth in the discussion tree.

While the use of text in the graph-LSTM seemsto help throughout the discussion, we hypothesizedthat there would be different cases where it mighthelp, and these would occur at different times. In-deed, 93% of the comments that are overpredicted

127

Figure 4: Average F1 scores as a function of time,approximated using the number of previous commentsquantized in increments of 20.

by more than 2 levels by the node-independentmodel without text (controversial comments) occurin the first 20% of the discussion. Comments that areunderpredicted by more than 2 occur throughout thediscussion and are roughly uniform (13-19%) overthe first half of the discussion, but then quickly rampdown. High-karma comments are rare at the end ofthe discussion; less than 5% of the underpredictedcomments are in the last 30%.

4.2 Importance of Responses

In order to see how the model benefits from using thelanguage cues in underpredicted and overpredictedscenarios, we look at the size of errors made by thegraph-LSTM model with and without text features.In Figure 5, the x-axis indicates the error betweenthe actual karma level and the karma level predictedby the graph-LSTM using submission context fea-tures only. The negative errors represent the over-predicted comments, and the positive errors repre-sent the underpredicted comments. The y-axis rep-resents the average error between the actual karmalevel and the karma level predicted by the modelusing both submission context and language fea-tures.The x=y identity line corresponds to no benefitfrom language features. Results are presented for

the politics subreddit; other subreddits have similartrends but smaller differences for the underpredictedcases.

We compare two models – bidirectional and for-ward direction graph-structured LSTM – in order tounderstand the role of the language of the replies vs.the comment and its history. We find that, for thebidirectional graph-LSTM model, language is help-ing identify overpredicted cases more than underpre-dicted ones. The forward direction model also out-performs the node-independent model, but has lessbenefit in overpredicted cases, consistent with ourintuition that controversy is identifiable based on theresponses. Although the comment text input is sim-ply a bag of words, it can capture the mixed senti-ment of the responses.

While it is not represented in the plot, larger er-rors are much less frequent. Looking at averageF1 as a function of the number of children (directresponses), we found that the graph-LSTM mainlybenefits nodes that have a small number of children,consistent with the two underprediction scenarioshypothesized. However, many underpredicted casesare not impacted, since errors due to pruning con-tribute to 15-40% of the underpredicted cases, de-pending on the subreddit (highest for politics). Thisexplains the smaller gains for the positive side ofFigure 5.

4.3 Language Use AnalysisTo provide insights into what the model is learningabout language, we looked at individual words asso-ciated with different categories of comments, as wellas examples of the different error cases.

For the word level analysis, we classified wordsin two different ways, again using the politics sub-reddit. First, we associate words in comments withzero or positive karma. For each word in the vocab-ulary, we calculate the probability of a single-wordcomment being level zero using the trained modelwith a simplified graph structure (a post and a com-ment) where all the inputs were set to zero except thecomment text. The lists of positive-karma and zero-karma correspond to the 300 words associated withthe lowest and highest probability of zero-karma, re-spectively. We identified 300 positive-karma andzero-karma reply words in a similar fashion, using asimplified graph with individual words us as inputs

128

Figure 5: The error between the actual karma level and the karma level predicted by the model using both submissioncontext and language features. Negative errors correspond to over-prediction; positive errors correspond to under-prediction.

for the reply while predicting the comment karma.Second, we identified words that may be indica-

tive of comments that are over- and underpredictedby the graph-structured model without text and forwhich the graph-LSTM model with text reduced theerror by more than 2 levels. Specifically, we choosethose words w in comments having the highest ratior = p(w|t)/p(w), where t indicates an over- or un-derpredicted comment, subject to minimum occur-rence constraints (5 for overpredicted comments, 15for underpredicted comments). The 50 words withthe highest ratio were chosen for each case and anywords in both over- and underpredicted sets wereeliminated, leaving 47 words. Again, this was re-peated for words in replies to over vs. underpre-dicted comments, but with a minimum count thresh-old of 20, resulting in 45 words.

The lists are noisy, similar to what is often foundwith the topic model, and colored by the languageof the subreddit community, but a few trends can beobserved. Looking at the list of words associatedwith replies to positive-karma comments we noticedwords that indicate humor (“LOL”, “hilarious”),positive feedback (“Like”, “Right”), and emotion in-dicators (“!!”, swearing). Words in comments andreplies associated with overpredicted (controversial)cases are related to controversial topics (sexual, reg-ulate, liberals), named political parties, and men-tions of downvoting or indication that the commenthas been edited with the word “Edit.”

Since the two sets of lists were generated sepa-rately, there are words in the over/under-predictedlists that overlap with the zero/non-zero karma lists(12 in the reply lists, 20 in the comment lists). Themajority of the overlap (26/32 words) is consistent

Figure 6: The mapping of the words in the commentsto the shared space using t-SNE in politics subred-dit. Shown are the words that are highly associatedwith positive-karma, negative-karma, underpredicted andoverpredicted comments.

with the intuition that words on the underpredictedlist should be associated with positive-karma, andwords on the overpredicted list might overlap withthe zero-karma list.

Rather than providing word lists, many neural net-work studies illustrate trends using word embed-ding visualization. The embeddings of the wordsfrom the union of lists for positive-karma, zero-karma, underpredicted and overpredicted commentsand replies were together used to learn a t-SNE map-ping. The results are plotted for comments in Fig-ure 6, which shows that the words that are as-sociated with underpredicted comments (red) arealigned with positive-karma words (green) for bothcomment text and text in replies. Words associated

129

with overpredicted comments (blue) are more scat-tered, but they are somewhat more like the zero-karma words (yellow). The trends for words inreplies are similar.

Table 4 lists examples of the different error sce-narios with the reference karma and predictionsof different models (node-independent without text,feedforward graph-LSTM with text, and the fullbiLSTM). The first two examples are overpredicted(controversial) cases, where ignoring text leads to ahigh karma prediction, but the reference is zero. Inthe first case, the forward model incorrectly predictshigh karma because “Republican” tends to be asso-ciated with positive karma. The model leveragingreply text correctly predicts the low karma. In thesecond case, the forward model captures reduces theprediction, but again having the replies is more help-ful. The next two cases are examples of underpredic-tion due to small subtrees. Example 3 is incorrectlylabeled as level 0 by the forward and no-text models,but because the responses mention “nice joke” and“accurate analogy,” the bidirectional model is ableto identify it as level 7. Example 4 has only onechild, but both models using language correctly pre-dict level 7, probably because the model has learnedthat references to “Colbert” are popular. The nexttwo examples are underpredicted cases from earlyin the discussion, many of which expressed an opin-ion that in some way provided multiple perspectives.Finally, the last two examples represent instanceswhere neither model successfully identifies a highkarma comment, which often involve analogies. Un-like the “titanic” analogy, these did not have suffi-cient cues in the replies.

5 Related Work

The problem of predicting popularity in social me-dia platforms has been the subject of several studies.Popularity as defined in terms of volume of responsehas been explored for shares on Facebook (Chenget al., 2014) and Twitter (Bandari et al., 2012) andTwitter retweets (Tan et al., 2014; Zhao et al., 2015;Bi and Cho, 2016). Studies on Reddit predict karmaas popularity (Lakkaraju et al., 2013; Jaech et al.,2015; He et al., 2016) or as community endorsement(Fang et al., 2016). Popularity prediction is a diffi-cult task where many factors can play a role, which

is why most prior studies control for specific factors,including topic (Tan et al., 2014; Weninger et al.,2013), timing (Tan et al., 2014; Jaech et al., 2015),and/or comment content (Lakkaraju et al., 2013).Controlling for specific factors is useful in under-standing the components of a successful post, but itdoes not reflect a realistic scenario. Studies that donot include such constraints have looked at Twitterretweets (Bi and Cho, 2016) and Reddit karma (Heet al., 2016; Fang et al., 2016).

The work in (He et al., 2016) uses reinforcementlearning to identify popular threads to track giventhe past comment history, so it is learning languagecues relevant to high karma but it does not explicitlypredict karma. In addition, it models relevance viaan inner-product of past and new comment embed-dings, and uses an LSTM to model inter-commentdependencies among a collection of comments irre-spective of their sibling-parent relationship, whereasthe LSTM in our work is over a graph that accountsfor this relationship.

The work most closely related to our study is Fanget al. (2016). The node-independent baseline im-plemented in our study is equivalent to their feed-forward network baseline, but the results are not di-rectly comparable because of differences in training(we use more data) and input features. The most im-portant difference in our approach is the representa-tion of textual context using a bidirectional graph-LSTM, including the history behind and responsesto a comment. Other differences are: i) Fang etal. (2016) use an LSTM to characterize comments,while our model uses a simple bag-of-words ap-proach, and ii) they learn latent submission contextmodels to determine the relative importance of tex-tual cues, while our approach uses a submission con-text SVM to prune low karma comments (ignoringtheir text). Allowing for differences in baselines, wenote that the absolute gain in performance from us-ing text features is larger for our model, which rep-resents language context.

Tree LSTMs are a modification of sequentialLSTMs that have been proposed for a variety ofsentence-level NLP tasks (Tai et al., 2015; Zhu et al.,2015; Zhang et al., 2016; Le and Zuidema, 2015).The architecture of tree LSTMs varies depending onthe task. Some options include summarizing overthe children, adding a separate forget gate for each

130

Ex karma Comment1 0 7 7 0 Republicans are fundamentally dishonest. (politics, id:1x9pcx)2 0 7 4 0 That is rape. She was drunk and could not consent. Period. Any of the supposed evidence

otherwise is nothing but victim blaming. (askwomen, id:2h8pyh)3 7 0 0 7 The liberals keep saying the titanic is sinking but my side is 500 feet in the air. (politics,

id:1upfgl)4 7 3 7 7 I miss your show, Stephen Colbert. (askmen, id:2qmpzm)5 7 3 7 7 that is terrifying. they were given the orders to bust down the door without notice to the

residents, thereby placing themselves in danger. and ultimately, placing the lives of theresidents in danger (who would be acting out of fear and self-defense) (politics, id:1wzwg6)

6 7 0 5 6 It’s something, and also would change the way that Police unions and State Prosecutorswork. I don’t fundamentally agree with the move, since it still necessitates abuse by theState, but it’s something. (politics, id:27chxr)

7 6 0 0 0 Chickenhawks always talk a big game as long as someone else is doing the fighting. (poli-tics, id:1wbgpd)

8 6 0 0 0 [They] use statistics in the same way that a drunk uses lampposts: for support, rather thanillumination. -Andrew Lang. (politics, id:1yc2fj)

Table 4: Example comments and karma level predictions: reference, no text, graph(f), graph.

child (Tai et al., 2015), recurrent propagation amongsiblings (Zhang et al., 2016), or use of stack LSTMs(Dyer et al., 2015). Our work differs from thesestudies in two respects: the tree structure here char-acterizes a discussion rather than a single sentence;and our architecture incorporates both hierarchicaland temporal recursions in one LSTM unit.

6 Conclusion

In summary, this paper presents a novel approachfor modeling threaded discussions on social mediausing a graph-structured bidirectional LSTM whichrepresents both hierarchical and temporal conversa-tion structure. The propagation of hidden state in-formation in the graph provides a mechanism forrepresenting contextual language, including the his-tory that a comment is responding to as well asthe ensuing discussion it spawns. Experiments onReddit discussions show that the graph-structuredLSTM leads to improved results in predicting com-ment popularity compared to a node-independentmodel. Analyses show that the model benefits pre-diction over the extent of the discussion, and thatlanguage cues are particularly important for distin-guishing controversial comments from those that arevery positively received. Responses from even asmall number of comments seem to be useful, so itis likely that the bidirectional model would still beuseful with a short-time lookahead for early predic-

tion of popularity.

While we evaluate the model on predicting thepopularity of comments in specific forums on Red-dit, it can be applied to other social media platformsthat maintain a threaded structure or possibly to ci-tation networks. In addition to popularity predic-tion, we expect the model would be useful for othertasks for which the responses to comments are in-formative, such as detecting topic or opinion shift,influence or trolls. With the more fine-grained feed-back increasingly available on social media plat-forms (e.g. laughter, love, anger, tears), it may bepossible to distinguish different types of popularityas well as levels, e.g. shared sentiment vs. humor.

In this study, the model uses a simple bag-of-words representation of the text in a comment; moresophisticated attention-based models and/or featureengineering may improve performance. In addition,performance of the model on underpredicted com-ments appears to be limited by the pruning mecha-nism that we introduced. It would be useful to ex-plore the tradeoffs of reducing the amount of prun-ing vs. using a more complex classifier for prun-ing. Finally, it would be useful to evaluate per-formance using a short window lookahead for re-sponses, rather than the full discussion tree.

131

Acknowledgments

This paper is based on work supported by theDARPA DEFT Program. Views expressed are thoseof the authors and do not reflect the official policyor position of the Department of Defense or the U.S.Government. We thank the reviewers for their help-ful feedback.

ReferencesRoja Bandari, Sitaram Asur, and Bernardo Huberman.

2012. The pulse of news in social media: Forecast-ing popularity. In Proc. ICWSM.

Bin Bi and Junghoo Cho. 2016. Modeling a retweetnetwork via an adaptive bayesian approach. In Proc.WWW.

Justin Cheng, Lada Adamic, P. Alex Dow, Jon MichaelKleinberg, and Jure Leskovec. 2014. Can cascades bepredicted? In Proc. WWW.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, andNoah A. Smith. 2015. Recurrent neural networkgrammars. In Proc. NAACL.

Hao Fang, Hao Cheng, and Mari Ostendorf. 2016.Learning latent local conversation modes for predict-ing community endorsement in online discussions. InProc. SocialNLP.

Ji He, Mari Ostendorf, Xiaodong He, Jianshu Chen, Jian-feng Gao, Lihong Li, and Li Deng. 2016. Deepreinforcement learning with a combinatorial actionspace for predicting popular Reddit threads. In Proc.EMNLP.

Aaron Jaech, Victoria Zayats, Hao Fang, Mari Ostendorf,and Hannaneh Hajishirzi. 2015. Talking to the crowd:What do people react to in online discussions? InProc. EMNLP.

Himabindu Lakkaraju, Julian J. McAuley, and JureLeskovec. 2013. What’s in a name? Understandingthe interplay between titles, content, and communitiesin social media. In Proc. ICWSM.

Phong Le and Willem Zuidema. 2015. Compositionaldistributional semantics with long short term memory.In Proc. *SEM.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. In Proc. ICLR.

Kai Sheng Tai, Richard Socher, and Christopher D Man-ning. 2015. Improved semantic representations fromtree-structured long short-term memory networks. InProc. ACL.

Chenhao Tan, Lillian Lee, and Bo Pang. 2014. The ef-fect of wording on message propagation: Topic- and

author-controlled natural experiments on twitter. InProc. ACL.

The Theano Development Team, Rami Al-Rfou, Guil-laume Alain, Amjad Almahairi, Christof Anger-mueller, Dzmitry Bahdanau, Nicolas Ballas, FredericBastien, Justin Bayer, Anatoly Belikov, et al. 2016.Theano: A python framework for fast computa-tion of mathematical expressions. arXiv preprintarXiv:1605.02688.

Trang Tran and Mari Ostendorf. 2016. Characterizingthe language of online communities and its relation tocommunity recognition. In Proc. EMNLP.

Tim Weninger, Xihao Avi Zhu, and Jiawei Han. 2013.An exploration of discussion threads in social newssites: A case study of the Reddit community. In Proc.ASONAM.

Matthew D Zeiler. 2012. Adadelta: an adaptive learningrate method. arXiv preprint arXiv:1212.5701.

Xingxing Zhang, Liang Lu, and Mirella Lapata. 2016.Top-down tree long short-term memory networks. InProc. NAACL.

Qingyuan Zhao, Murat A Erdogdu, Hera Y He, AnandRajaraman, and Jure Leskovec. 2015. Seismic: Aself-exciting point process model for predicting tweetpopularity. In Proc. SIGKDD.

Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2015.Long short-term memory over recursive structures. InProc. ICML.

132

Documents

Conversation Modeling on Reddit Using a Graph-Structured LSTM · 2018. 4. 1. · Figure 1 shows an example conversa-tion on Reddit, where bigger nodes indicate higher ... Transactions