Explain like I am a Scientist: The Linguistic Barriers of ...taugust/files/August_CHI2020.pdfExplain like I am a Scientist: The Linguistic Barriers of Entry to r/science Tal August

Explain like I am a ScientistThe Linguistic Barriers of Entry to rscience

Tal August1 Dallas Card2 Gary Hsieh3 Noah A Smith14 Katharina Reinecke1

1Paul G Allen School of CSE University of Washington 2Stanford University 3Human Centered Designand Engineering University of Washington 4Allen Institute for Artificial Intelligence

taugust nasmith reineckecsuwedu dcardstanfordedu garyhsuwedu

ABSTRACTAs an online community for discussing research findingsrscience has the potential to contribute to science outreachand communication with a broad audience Yet previous worksuggests that most of the active contributors on rscience arescience-educated people rather than a lay general public Onepotential reason is that rscience contributors might use adifferent more specialized language than used in other subred-dits To investigate this possibility we analyzed the languageused in more than 68 million posts and comments from 12subreddits from 2018 We show that rscience uses a special-ized language that is distinct from other subreddits Transient(newer) authors of posts and comments on rscience use lessspecialized language than more frequent authors and thosethat leave the community use less specialized language thanthose that stay even when comparing their first commentsThese findings suggest that the specialized language used inrscience has a gatekeeping effect preventing participationby people whose language does not align with that used inrscience By characterizing rsciencersquos specialized languagewe contribute guidelines and tools for increasing the numberof contributors in rscience

Author KeywordsScience communication online communities language style

CCS ConceptsbullHuman-centered computing rarr Empirical studies in col-laborative and social computing Empirical studies in HCI

INTRODUCTIONScience communication is important for both scientists andthe public as it allows communicating and discussing researchfindings with a broad audience [36] While scientific findingshave traditionally been curated by journalists science com-munication has become more scalable and democratized withthe advent of the Internet which has enabled sharing of sci-entific findings in blog posts social networks or other onlinecommunities

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page Copyrights for components of this work owned by others than ACMmust be honored Abstracting with credit is permitted To copy otherwise or republishto post on servers or to redistribute to lists requires prior specific permission andor afee Request permissions from permissionsacmorgCHI rsquo20 April 25ndash30 2020 Honolulu HI USA2020 Association of Computing MachineryACM ISBN 978-1-4503-6708-02004 $1500httpdxdoiorg10114533138313376524

rscience a sub forum (ldquosubredditrdquo) on the social news aggre-gation platform Reddit and shown in Figure 1 has emerged asone of the largest platforms for disseminating and discussingscientific findings outside of the social circles of the scientiststhemselves However past work has shown that active con-tributors in rscience are largely already involved in scientificactivity and broader public engagement is lacking [17]

A possible reason why rscience does not attract broader con-tribution is that the community might have developed spe-cific norms that deter some users from actively participatingClashes between user values and community norms or usersfailing to adopt community norms can lead to negative com-munity feedback and reduced engagement [10 28] One suchnorm is the specialized language members might use to con-tribute to an online community [8] Commonly used insiderwords [8] politeness or formality [2 27] and even pronounusage [35] are all examples of linguistic behaviors developedwithin a community that characterize its language Just as withother normative behaviors the language is important for newusers to adopt users who donrsquot adopt it receive fewer and lesssupportive responses [16 32] and are more likely to leave anonline community [8] Given that science communication isoften hindered by specialized language (eg jargon) [29] it ispossible that the rscience community has developed languagenorms that prevent a diverse public from engaging

To explore whether rscience contains specialized languagethat may present a barrier to entry we analyzed differencesin the language of 68560317 publicly available posts andcomments on Reddit We compared the language used inrscience to the language used by contributors of 11 othersubreddits using language models a commonly used techniquefor measuring language differences [8]

We found our hypothesis to be true the language used inrscience posts and comments differs from the language usedin other large or topically related subreddits Many of the dis-tinctions in language in rscience reflect rsciencersquos rules forlanguage norms such as hedging (qualifying statements withwords such as lsquosuggestrsquo and lsquopossiblyrsquo) impersonal languageand scientific terminology Our results show that transientcontributors (ie those who only post once) on rscience failto adapt to this specialized language more often than more ex-perienced authors suggesting that rsciencersquos language normsdo not always come naturally to people

Figure 1 The main page of rscience with top posts on the left and adescription of the community and its rules on the right

Our findings indicate that rsciencersquos guidelines and commu-nity norms while useful to maintain a high standard of rigorand discourse have the side effect of limiting contributionsfrom a broader audience by enforcing a specialized languagethat people wanting to post and comment first have to learnWhile our analysis was on rscience our methodology can helpresearchers identify if there are similar gatekeeping effects ofspecialized language in other scientific communication suchas blogs and social media platforms Our work also providesseveral design implications such as how to guide newcomersin contributing and following language norms and ways torelax some rules without sacrificing scientific rigor

RELATED WORKRelated to the current paper are (1) research on science commu-nication (2) work on online community norms and guidelinesand (3) studies of specialized language in online communities

Science CommunicationScience communication is the use of appropriate skills mediaactivities and dialogue to produce one or more of the fol-lowing personal responses to science awareness enjoymentinterest opinions and understanding [3] Over the past twentyyears the focus of science communication has shifted fromdissemination to dialog and participation [14] The goal isno longer simply to provide the public with more information

to overcome any information deficits but to increase publicengagement through two-way interactions

Advances in social media and social technologies are offeringnovel and hybrid forums to support interactions between sci-entists and different publics Not only are people turning toblogs and online-only media sources for science informationthe social and interactive features of Web 20 are also allowingpeople to produce and discuss science online [1] One suchpopular example is rscience a subreddit on Reddit A study ofrsciencersquos 2016 posts suggests that it is a vibrant communityattracting a variety of science information and discussions [17]The most frequent comments posted include questions aboutthe research or related work extending applying or reasoningabout the research personal questions or stories or responsesto such and offering an educational response

However one concern identified in the study of rscienceis that despite its potential for supporting dialogue it mayonly attract and sustain participation from an interested andknowledgeable public [17] A noted tension was betweenfacilitating broad science dialogues and ensuring high qualityinformation rscience ensures high quality information witha list of relatively strict guidelines on their front page (seebottom right corner in Figure 1) such as requiring all posts tolink to peer-reviewed research and disallowing sensationalizedpost headlines This is especially evident when an rsciencepost becomes popular enough to reach rall (a default pageshowing a digest of the most popular posts from a userrsquossubscriptions) During these events Reddit users who arenot familiar with rscience norms can become exposed torscience content However when these new rscience visitorspost comments their comments are often not aligned withrsciencersquos norms resulting in many of the comments gettingdeleted[17] This could be discouraging for newcomers andthe lay public and inhibit them from further participation ndash ahypothesis that we seek to investigate in this paper

Online Community Norms and GuidelinesOnline communities develop norms and guidelines that usersfollow to be productive members of a community Theseguidelines are often decided as a rough consensus of membersaround what behavior is or is not acceptable in the commu-nity [18] There are some community standards governing theentire Reddit platform (ldquoreddiquetterdquo) but it is common forindividual subreddits to have their own set of specialized rulesand norms [9] Chandrasekharan et al [5] characterized Red-dit community rules into three levels Macro (ie rules sharedacross all of Reddit) Meso (ie rules shared across groups ofsubreddits) and Micro (ie subreddit specific rules) Morepopular subreddits usually have more structured norms due tohaving to handle and socialize large influxes of new memberswithout losing community values [9] Moderators also playa strong role in developing and enforcing community normswhich can range from being highly active in a communitysuch as posting content often to only being involved when amember seriously violates a rule [31]

Although community norms are often publicly displayed new-comers can be overwhelmed by these rules risking communityrejection by unknowingly violating norms [12 18] or turned

off from the community by guidelines clashing with their ownvalues [28] Research suggests that Wikipediarsquos sharp declinein retention of desirable new editors (ie not vandals) fromaround 40 in 2003 to less than 10 in 2010 can partly beattributed to inflexible rules [11] Community guidelines canalso deter some users from joining a community due to an un-intended clash in values such as StackOverflowrsquos rule of ldquoNothank yoursquosrdquo that conflicts with many usersrsquo beliefs aroundhealthy community support [28]

Publicly displaying community norms supports new users ininteracting with the norms of the community which can helpincrease normative behavior in newcomers [22] Cialdini etal [7] characterized two ways norms influence behavior in-junctively where norms prescribe acceptable behaviours in thegroup and descriptively where member behaviour providesexamples of norms Morgan and Filippova [24] character-ized these injunctive and descriptive norms in Wikipedia sub-communities identifying injunctive norms as posted guide-lines and descriptive norms as active community threads Bothinjunctive and descriptive norms can increase normative be-haviour in online communities especially when injunctivenorms are reinforced with descriptive norms or vice versa

While many of the guidelines and moderation in online com-munities focus on acceptable behaviors like how peopleshould treat other members or what they can post about com-munities also develop unique language norms such as specificwords members use [8] that are important for new membersto follow in order to receive support from the community [816 32] In this paper we extend prior work by exploring thelanguage norms that rscience has developed and showing howthese norms can act as an additional gatekeeper

Community LanguageStudying community language norms has a long history insociolinguistic research Labov [20] studied fine-grained pho-netic differences in New York City showing that differentsocio-economic classes and even small peer groups withinthese classes use significantly different vocalizations suchas a different pronunciation of the r sound [19 20] Milroyand Milroy [23] showed in their seminal work exploring ver-nacular English in Belfast Northern Ireland that communitynetworks such as familial ties were a strong influence on anindividualrsquos linguistic variation

Research has drawn comparable findings for linguistic varia-tion in online communities [8 35 27] Cassell and Tversky [4]explored the formation of a new online community comprisedof children from around the world finding that these childrenfrom diverse cultural economic and geographic backgroundsconverged on a shared language style such as speaking inthe collective voice and topics of conversation Danescu-Niculescu-Mizil et al [8] had similar findings in online beerenthusiast communities showing that members adopt cer-tain words (eg ldquoaromardquo or fruit-related words) that becomewidespread in the community Tran and Ostendorf [35] charac-terized the language style of 8 subreddit communities showingthat stylistic features such as community-specific jargon andsentence structure led to close to 90 accuracy in identify-ing a community Zhang et al [37] built on this research

Table 1 Number of posts comments and subscribers of each subredditSubreddit Posts Comments Subscribersrscience 22157 604267 21015665rnews 254201 5878711 17972696rpolitics 211866 14754150 4925536rpics 101624 3806068 21313781rfunny 114272 4246111 23759930raskreddit 1096947 35665797 22153598raskhistorians 41203 72056 998325reverythingscience 6772 36901 165852rfuturology 20127 799176 14037974rtruereddit 4193 198108 439334rdataisbeautiful 8437 420762 13608622raskscience 12162 184249 17901914

exploring a larger subset of 300 subreddits and showing thatfrequently used words within a subreddit were also useful incharacterizing how the community was distinctive (differentfrom other communities) and dynamic (how quickly the com-munity shifted to new topics) They found that distinctive anddynamic communities are more likely to retain users

Adopting the language style of a community is importantfor being accepted by the community [2 8 16] For exam-ple users who did not adopt the specific words common inthe beer enthusiast communities mentioned above were morelikely to leave compared to members who did adopt thesenew words [8] Members of breast cancer online supportgroups use much more community-specific jargon and infor-mal language the longer they remain part of the communityindicating that language style is a reflection of communitysocialization [27] In addition members of mental health sup-port groups on Reddit are more likely to receive supportiveresponses if their language matches the style of the commu-nity [32] and Twitter users who match the language style oftheir followers receive more retweets [34] We build on thiswork by exploring the language norms of rscience and hownew members in rscience adopt or donrsquot adopt these norms

METHODWe conducted linguistic and quantitative analyses of 12 largesubreddits to answer our overarching question of whetherrsciencersquos potentially specialized language represents a bar-rier of entry More precisely our research questions are

RQ1 Do posts and comments on rscience contain special-ized language compared to other large or topically relatedsubreddits If so what are the characteristics of this spe-cialized language

RQ2 How does the language of users who stay and leavediffer in their posts and comments If there is a defineddifference this would suggest that new contributors experi-ence a language barrier and those who join rscience with astrongly differing language are less likely to stay than thosewhose language more closely aligns with rsciencersquos

DataIn addition to rscience we selected 11 subreddits withthe goal of comparing the language used in rscience to

the language used in other large or topically related com-munities rnews rpolitics rpics rfunny raskhistoriansrfuturology rtruereddit rdataisbeautiful rasksciencereverythingscience and raskreddit While there is no suchthing as ldquothe average Reddit userrdquo that we could compareagainst our list includes some of the largest subreddits thatshare a similar subscriber count to rscience (rnews rpoliticsrpics rfunny and raskreddit) and those with the largestoverlap in user participation with rscience (ie users com-menting and posting in rscience and the other subreddit)raskhistorians rfuturology rtruereddit rdataisbeautifulraskscience and reverythingscience) [21] All of these aresubreddits with community members that rscience shouldhave an interest in engaging which is why our aim was tocharacterize how different the language experience is for auser in these subreddits compared to rscience

For each subreddit we collected all comments and posts fromJan 2018 to Dec 2018 using Googlersquos Bigquery1 We ignoredcomments and posts that had been deleted by their originalauthor by moderators or that were shorter than 10 words fora total of 1893961 posts and 66666356 comments Table 1provides details of our dataset

AnalysisRQ1 rsciencersquos specialized languageWe analyzed whether rscience uses specialized language byconstructing language models trained on its posts and com-ments A language model is a conditional probability distri-bution over each word in a vocabulary given preceding words(left context) the distributions are estimated using a trainingcorpus of texts which we denote D A language model canbe used to assign probability mass to a sequence of words bytaking the product of conditional probabilities for individualwords given their left contexts (ie applying the chain rule ofprobability) The probability that a language model trainedon text dataset D which we denote LM(D) assigns to a newpiece of text depends very heavily on the training data D Forexample training a language model on judicial decisions willlikely lead to very low probability assignment for a Reddit postabout astronomy but higher probability for a similar-lengthlegal brief

Given a language model LM(D) we can calculate how differ-ent a text (word sequence ~w = 〈w1w2 wN〉) is from thetraining data of LM by calculating ~wrsquos cross entropy underLM

CE(~wLM(D)) =minus 1N

sumi=1

log pLM(D)(wi | wiminus1) (1)

where pLM(D)(wi | wiminus1) is the ldquobigramrdquo probability of wordwi given the preceding word wiminus1 according to the languagemodel LM(D) and w0 and wN are special ldquostartrdquo and ldquostoprdquotokens included by convention2 The higher the cross entropythe more divergent ~w is from LM(D)rsquos training data the less

1httpscloudgooglecombigquery2While bigram models are a relatively simple choice they suffice forour purposes and are relatively robust against overfitting for datasetsof the size we consider here

likely it is under pLM(D) and hence the more ldquosurprisingrdquo it isunder the distribution of language model LM(D)

Our language models are Katz-backoff bigram models withGood-Turing smoothing [6]mdasha commonly used technique toimprove probability estimates that past work in this spacehas employed [37]mdashtrained on posts and comments of activeauthors in rscience All language models used as a vocabularythe words seen during training We built our language modelsusing the SRILM toolkit [33]

For comments our language models were trained on 1000comments 5 comments sampled from 200 experienced com-menters for each month We defined experienced commentersas those who had commented at least 5 times in a given monthfollowing [37] For posts the language models were trained on1000 posts (5 posts sampled from 200 experienced posters forthe entire year) We treated experienced posters as those whohad posted at least 5 times in the year This provided roughlythe same percentage of users per year that our definition ofexperienced commenters did for a month We constructed100 language models for each month of comments and 100language models total for the yearrsquos posts resampling authorsand their comments and posts each time We used all languagemodels in each cross entropy calculation for a post or com-ment averaging all post or comment cross entropy within alanguage model

Past work has identified that longer posts and comments tendto exhibit higher cross entropy possibly due to the higher prob-ability of esoteric language in longer posts or comments [837] To avoid these length effects we followed past work [37]and only used the first ten words of each comment or post fortraining and cross entropy calculations inserting stop tokensat the end of these first ten words3

With language models trained on rscience posts and com-ments we analyzed whether text from other subreddits di-verged from the text of rscience by calculating the cross en-tropy of text sampled from other subreddits and comparing itto text sampled from rscience In particular we calculated thecross entropy of 50 comments 5 unique comments sampledfrom 10 experienced authors (different from those used to trainthe models) for each month from each subreddit using lan-guage models trained on rscience comments for that monthWe then conducted an ANOVA and Tukey post-hoc tests tocompare the cross entropy of rscience comments comparedto the comments of other subreddits averaged over all monthsWe conducted the same analysis for rscience posts samplingover the entire year 12 times to match the number of monthsamples

One limitation of using language models trained on rscienceis that they will overfit to rscience posts and comments Thisleads us to expect that they will find posts and comments ofother subreddits surprising (ie have higher cross entropy)All a higher cross entropy means is that something in the textof rscience is different from other subreddits but it doesnrsquot

3We obtained qualitatively similar results using the entire span ofeach post or comment for training

signal what that something is To characterize the actual dif-ferences between the language used in rscience we thereforeadditionally analyzed individual word frequencies using uni-and bigram language models for both posts and comments(separately) To do this we first estimated an empirical back-ground frequency for each token (counted as a unigram orbigram) using all other subreddits

pw = (cw +α)

sumVi=1(ci +α) (2)

where cw is the observed count of vocabulary token with indexw V is the size of vocabulary and α is a smoothing constantwhich we set to 01

To determine which tokens are unusually common or rare inrscience we used a χ2 test of independence with a Bonferronicorrection Among those that were significantly different weidentified the tokens that had the most improbably high or lowobserved counts by modeling them with independent binomialdistributions for each token

)= Binomial

(N(s) pw

where c(s)w is the observed count of token w in rscience andN(s) is the total number of tokens (uni- or bigrams) in rscience(posts or comments)

To summarize these findings we report the words whose ob-served counts had the lowest probability according to thesemodels separating them into those that were used more fre-quently and less frequently than would be expected We alsoused a similar analysis to compare the frequency of words usedby transient contributors those who only contributed once andexperienced contributors within rscience comments

To further evaluate how the language of rscience differs fromthe language of other subreddits we built a text classifier usingonly uni- and bigram features to classify posts and commentsas either in rscience or not In contrast to the language modelswhich show how surprising the language of other subreddits isto rscience a classifier will show how easily distinguishablerscience language is compared to those of other subredditsOur classifier is a support vector machine (SVM) using uni-and bigram features We use a term frequency inverse doc-ument frequency (tf-idf) transform to account for commonwords throughout all posts and comments We trained twoSVMs one for posts and one for comments We used oneversus many classification meaning the classifiers classifyposts and comments as either in rscience or not We reportthe F1 score for both classifiers Because the number of non-rscience posts and comments vastly outweigh the number ofrscience posts and comments they are sampled equally tohave a balanced training and test set

RQ2 Language used by transient vs returning authorsPast work has shown that new authors commenting or postingfor the first time do not necessarily adopt the language normsof the community immediately [8] and are more likely to leavethe community This is know as an acculturation gap [37]ie the difference in language between users who only everposted or commented once (transient users) and returning

authors and is predictive of long term engagement [8] To an-alyze whether there is a defined acculturation gap for rscience(which would indicate new usersrsquo difficulty in adapting tothe specialized language of the community) we calculatedthe difference in the cross entropy of posts and comments bytransient contributors (ie those who only contribute once) vsexperienced contributors Following Zhang et al [37] we useexperienced contributors as a proxy for community languageIt is also important to note that transient users might havecontributed to other subreddits or even read rscience postshowever they had not posted or commented in rscience Wecalculated the cross entropy of experienced contributors bysampling 5 comments for 50 experienced contributors fora total of 250 comments We then sampled 250 commentsfrom transient contributors The acculturation gap was thedifference between these two cross entropies We resampledexperienced and transient contributorsrsquo comments for eachmonth and 12 times on the entire year for posts We followedthis analysis with a deeper look into the common and rarewords used by transient users compared to experienced usersin rscience This allowed us a more nuanced perspective onnot only whether the language differed between transient andreturning users but also how it differed

While the acculturation gap is useful in identifying the distinc-tiveness of rsciencersquos language to first time users we soughtto further investigate whether language might act as a barrierto new users by examining if those who ultimately stayed inrscience matched the language of the community more closelyin their first contribution than those who only contributed onceand then left If so this would suggest that language is a factorfor deterring users from engaging with rscience beyond onepost or comment Because we only looked at data for a singleyear 2018 we were unable to determine whether users con-tributed before this cutoff We therefore ignored commentsand posts from January and February for this analysis Con-sidering that over 70 of users contribute for only one monthin rscience it is unlikely that users who did not post in thefirst two months of 2018 were active before that

We calculated the cross entropy of the 250 first posts or com-ments from 250 experienced authors comparing this to 250posts or comments from authors who only ever commented orposted once in the subreddit We resampled for each month ofcomments and 12 times on the entire year for posts

RESULTS

RQ1 The rscience community uses specialized languagecompared to other subredditsThe cross entropy of comments and posts significantly differedacross subreddits (posts F1114388) = 3995661 p lt 0001comments F(1114388) = 1615158) see Figure 2 rsciencehas the lowest cross entropy for both posts and commentssuggesting that there are unique language characteristics (egwords and phrases) in rsciencersquos posts and comments that donot occur in other subreddits The difference holds even forthose subreddits that are topically related (eg raskscienceand reverythingscience)

Figure 2 Cross entropies of posts and comments from experienced contributors from each subreddit calculated using rscience language models Barsshow bootstrapped 95 confidence intervals Asterisks denote t-test significance compared to rscience Similar results were found using Mann-WhitneyU tests for non-normal distributions All results are corrected for multiple hypothesis testing using Bonferroni-Holm lowastp lt 05 lowastlowastp lt 001

Table 2 Summary of the most unusually common and rare words in rscience posts and comments calculated with χ2 test of independence based onthe frequency of the words in rscience relative to other subreddit posts and comments respectively

Especially Common Especially RareTerminology (cells brain cancer disease) Pronouns (you your it youve my i he me)

Posts Reporting (study researchers scientists new) Questions (what whats if why how who)Hedge words (may according suggests likely) Opinions (serious best worst like favorite)Terminology (cells cancer species energy) Pronouns (he i his she my her him me)

Comments Reporting (study studies science research) Politics (trump mueller clinton hillary)Analysis (factors correlation likely link) Profanity

While it is not surprising that rscience has the lowest crossentropy since the language models were trained on it it is in-teresting to see how other subreddits relate to rscience in postand comments For example post cross entropies are moredifferent across subreddits than are comments This is due toposts containing more topic words (eg Trump cells) thancomment text Interestingly rpics posts have a similar crossentropy to rscience posts This may be because rpics alsohas stringent guidelines for post titles that discourage personalwords in the title (eg no memorial posts no posts askingfor assistance and no personal information) While rpoliticsalso has stringent post title guidelines the strong topical dif-ferences between it and rscience most likely contributed toits higher cross entropy

Table 2 summarizes the words and bigrams that comprisethe most improbably common and rare terms in rsciencecompared to the other subreddits for both posts and comments(ignoring conjunctions and prepositions) Looking at thesewords we can see that in rscience posts scientific terminologyreferences to scientific studies and hedge words (eg maysuggests likely) are all extremely common relative to othersubreddits By contrast personal pronouns question wordsand expressions of opinions are extremely uncommon Similarpatterns hold for rscience comments but we did not observesuch an extreme use of hedge words and the most notablyunderused words (besides personal pronouns) are profanityand terms related to politics (which have a high backgroundfrequency in comments in other subreddits) Looking at the

bigrams reveals similar findings science posts and commentscontain more references to scientific studies (eg (researchershave) (study finds)) and hedge phrases (eg (according to)(likely to))) and fewer questions and personal references (eg(what is) (do you) (when i) For a list of the top 50 commonand rare uni- and bigrams in rscience see the appendix Whilethere are topical differences in some word usage comparison(eg Trump as a common word outside of rscience) there arealso many examples of stylistic differences (eg hedging andimpersonal language) in words and phrases This indicatesthat rscience differs in style and topic from other subredditsespecially with hedging and impersonal speech

The classifiers achieved mixed results for identifying posts andcomments as from rscience or not The comment classifierobtained a test F1-score of 73 while the post classifier reached84 Similar to the cross entropy findings the classifier scoressuggest that posts are easier to differentiate across subredditsthan comments Considering the minimal features used to trainboth classifiers (simple uni and bigram features with a tf-idftransform) the scores reflect the distinctiveness of rscienceposts scoring well above random chance

RQ2 rscience users that donrsquot match the communityrsquoslanguage are more likely to leaveThe majority (57) of users in rscience only post or commentonce and never return We found a pronounced and significantdifference in cross entropy of these transient authors versusexperienced authors in rscience for comments (mean = 737

Table 3 Posts and comments from transient and experienced users in rscience Sampled from all contributionsTransient Contributors Experienced ContributorsThe definition of the kilogram might be about to change forthe better

Sulfur isotope has helped reveal surprising information aboutboth the origins of life on Earth

Posts How Reddit (and the rest of the internet) is good (and bad) foryou

Negative experiences on social media carry more weight thanpositive interactions []

Same with adderall in my case Whenever Irsquom on it im nolonger constantly hungry

Yeah I would have wanted a control group just to confirm howfmri changes when you were just exposed to it

Comments Irsquom convinced that any mouse with a strong background inscience could make itself immortal

Well no this would be enough to be revolutionary if you couldbuild say MRI machines with it Itrsquos much cheaper to run afridge than to keep something chilled with liquid helium

Figure 3 Sampled distribution of cross entropies between transientand experienced contributors (all posts and comments) Difference incomment cross-entropy is significant (p lt 001) though not for posts (in-dependent samples t-test corrected for mutliple hypothesis testing us-ing Bonferroni-Holm) Considering that posting is speaking to all ofrscience this is most likely due to people following rsciencersquos postingrules more closely when they first post than when they first comment

sd = 013 vs mean=716 sd = 014) (t1199 = 4085 p lt0001 d = 167) and posts (mean = 810 sd = 015 vsmean = 805 sd = 016) (t1199 = 807 p lt 0001 d = 033)Figure 3 plots the distributions of the cross entropies for postsand comments

Looking at common words and phrases in these transient usercomments we found that personal words (eg i my feel) aresignificantly more common in transient user comments thanexperienced user comments of rscience and words discussingscientific findings (eg abstract journal evidence) are signif-icantly rarer The common and rare bigrams for transient usersreflect similar differences with personal and anecdotal phrases(eg (i was) (when i)) common while references to scientificfindings (eg (linked academic) (press release)) rare Theseresults mirror those found between rscience and other sub-reddits (see Table 2) suggesting that users from other popularsubreddits while possibly matching the language in these sub-reddits are faced with a more pronounced language barrier in

Figure 4 Sampled distribution of cross entropies between first timecomments of users who leave and who stay Difference is significant(p lt 001)

rscience Table 3 provides examples of posts and commentsfrom transient and experienced contributors on rscience

rscience also stands out as having lower new user retentionthan the majority of other subreddits (see Figure 5) withan average of 10 (sd = 328) of new users returningafter posting or commenting for the first time in the previousmonth (F(11120) = 13818 p lt 0001) Interestingly many ofthe subreddits related to rscience such as raskscience andreverythingscience have similarly low retention

Our methods for measuring language differences did not allowquantitative comparisons of differences between transient andreturning authors in rscience compared to other subredditssince this would have involve comparing significance valuesand cross entropies calculated from different language modelsboth of which are improper comparisons We instead ran ourword frequency tests on comments of transient users fromsubreddits topically related to rscience and with similarlylow user retention raskscience and reverythingscience Ifthe common and rare words fell into similar categories asin rscience (eg personal words and scientific findings) fortransient contributors this would suggest that the low retentionin these communities is related new users failing to adapt tothe same specialized language

Common words for transient users in raskscience andreverythingscience included more personal words words (egi my your) while scientific words (eg species particlesgenetic) were rare for transient contributors However therewere also noticeable differences in these words compared torscience raskscience contained many more question words

(eg what please thank) which makes sense considering thepurpose of the subreddit is to ask questions These differencesfall along the differences in the purposes of the subredditswhile the similarities between these subreddits (a focus onscientific terminology and away from personal words) sug-gests that this type of language is more difficult for the generalReddit user to adapt to possibly contributing to the lower userretention they share

These differences suggest that transient users in rsciencethose who only post or comment once use significantly differ-ent language than those who are returning contributors in thecommunity To delve deeper into this difference we exploredhow the language of the first post or comment of contribu-tors in rscience who would return differed from the posts orcomments made by transient users

Users who end up becoming experienced contributors ofrscience matched the language in rscience in their first com-ment more closely than those who only contributed onceand then left (mean = 730 sd = 13 for experienced usersvs mean = 740 sd =11 for transient users) (t1199 = 1903p lt 0001 d = 078) Figure 4 plots this difference showingsimilar distributions as found in Figure 3 for comments Wedid not find this to be the case with posts the cross entropyof experienced usersrsquo first posts was not significantly lowerthan the cross entropy for transient usersrsquo posts Consideringthat posting is speaking to all of rscience this is probablydue to people focusing more on what they are writing ndash andfollowing rsciencersquos posting rules more closely ndash when theyfirst post compared to when they first comment These resultsshow that users who leave after commenting once divergefrom the language of rscience significantly more than userswho ultimately stay suggesting that language is a factor fordeterring some users from commenting in rscience

Past work has shown that users who are more likely to stay inan online community write differently such as using more col-lective identity words than those who are just passing througheven in their first post or comment [13] However other workhas also shown that users begin writing differently such asusing more collective identity words the longer they stay ina community [27] This highlights a debate in social comput-ing on whether highly active users in a community are bornmeaning there is something inherently different about theseusers or made meaning users are slowly drawn into a com-munity [15] While Figure 4 suggests that returning users inrscience are born rather than made when commenting wedecided to further explore this concept of born versus made byanalyzing how comment cross entropy changed as a functionof how long the user had been part of rscience To do thiswe sampled 1000 experienced users and numbered all theircomments from 1 (the userrsquos first comment in our data) ton (their last) and calculated cross entropy for all commentsWe used comment number rather than actual timestamp asa our measure of time in the community because each usercontributes to the community at a different rate making physi-cal time an inaccurate measure for observing change acrossusers [8] We ignored comments numbered greater than 50

Figure 5 New user retention across all subreddits lowastp lt 05 independentsamples t-test compared to rscience Corrected for mutliple hypothesistesting using Bonferroni-Holm

Figure 6 Cross entropy of contributors as they contribute more torscience Cross entropies are binned into 10 equal groups of com-ments Correlation is significant using Spearmanrsquos rank-order corre-lation (ρ =minus0027 p lt 01)

since this represented a tiny minority of contributors (less thanhalf of 1) and would encourage overfitting on a small subset

We found that over time users will match the language ofrscience more closely Figure 6 plots cross entropy as a func-tion of comment number for contributors in rscience As thefigure shows there is a slight negative correlation betweencomment number and cross entropy (ρ = minus027 p lt 01using Spearmanrsquos rank-order correlation) showing that thelonger a user contributes to rscience the closer their languagematches the language of other experienced contributors in thecommunity This is in line with prior work on socialization inonline communities [8 27] and suggests that although experi-enced users may begin by matching rsciencersquos language morethan transient users (supporting a born hypothesis) they willalso match the language of rscience more as they continue tocomment (supporting a made hypothesis) [15]

DISCUSSIONScience communication can help stimulate awareness andimprove understanding of science creating a society that ap-preciates and supports science and science literacy [3] Inan ideal world people of various walks of life would cometogether and talk about science Online forums such as the

science communication subreddit rscience have the potentialto make this happen In theory anyone with Internet accesscan participate on rscience and engage in science discussionsHowever our results suggest that language use can act as abarrier for participation We found that those who activelycontribute on rscience frequently use scientific terminologyimpersonal speech and hedge words such as may suggestsor likely The language used on rscience is distinctively dif-ferent from other popular and topically related subreddits asevidenced by our analyses Our results support past researchon specialized language as a hurdle to science communicationin general [29] and extend it by showing how specialized lan-guage gatekeeps in social discussions of science This worksprovides an exciting step towards exploring the gatekeeping ef-fect of language in scientific communication both on rscienceand in other media such as blogs and news articles making itmore equitable and accessible

While the specialized language on rscience may be a natu-ral extension of the scientific expertise community membershave [17] our findings suggest that not everyone is able orwilling to learn rsciencersquos specialized language Those whocontribute with strongly diverging language from rsciencemight still passively consume the content or contribute throughupvotes and downvotes of content but ultimately refrain fromposting or commenting again The result is especially notewor-thy because it highlights a lost potential in encouraging thosewho posted or commented once to become active communitymembers If peoplersquos language used in a post and comment isindicative of diversity (which prior work suggests [26]) thenthis also signifies a lost opportunity to involve a broad rangeof people with varying scientific familiarity

Interestingly rsciencersquos community rules on the front page(Figure 1) reflect much of this specialized language in explicitlanguage norms For example rsciencersquos rules against sensa-tionalized titles and personal anecdotes4 have clear parallelsto the abundance of hedge words and rarity of personal andsubjective words In addition existing posts and comments actas a descriptive influence of norms [7] by providing languageexamples that have been accepted by the community

One interpretation of our results is that the community rulesare effective and functioning as intended First rsciencersquosspecialized language is clearly geared toward maintainingrigorous scientific discussion in the community [17] Secondit could be that the language of rscience promotes stability inreadership and trust among passive users by having a high barfor who gets to talk people feel most of what they hear is worthlistening to This can encourage a readership that lurks butdoes not contribute trusting in those who speak the languageto contribute to the discussion We see some evidence of thisin the massive number of subscribers in rscience (over 21million fifth overall on Reddit at the time of writing) versusan order of magnitude fewer posts and comments compared toother subreddits with similar subscriber counts

However another perspective is that there are opportunities tohelp broaden participation and socialize newcomers While it4httpswwwredditcomrsciencewikiruleswiki_submission_rules

may not be necessary for everyone to have the same knowledgeabout or interest in science science communication should beaccessible to everyone [3] Our findings indicate that simplyposting community rules on the front page has not been suf-ficient in guiding first time and transient contributors to usethe specialized language expected on rscience The rsciencecommunity can use these findings and our methodology toexplore ways of socializing newcomers without sacrificingcommunity norms such as by identifying and defining spe-cialized terms in the guidelines or moderator posts

We noticed qualitative evidence of this same tension betweenmaintaining rigorous standards and broader participation inrscience comments during our period of analysis (Januaryto December 2018) When searching for comments that con-tained lsquothis subredditrsquo in rscience comments we found thatthere was a small number of comments on both sides of thisdebate While some comments bemoaned the increase inclickbait titles and pop media others criticized the practice ofdeleting less serious contributions and those from a broader au-dience We look forward to presenting our results to rsciencemoderators as a way of furthering this discussion and identify-ing possible ways to support new users without sacrificing thestandards of rscience Our findings and methods provide theexciting first step for tools to support users in their first contri-butions to scientific communication communities and ways tolower the barrier of participation in the first place making sci-entific communication more equitable and accessible Belowwe discuss opportunities to make this happen

Design ImplicationsAcculturating new users to the language norms of rscienceOne way of encouraging broader participation while maintain-ing rsciencersquos specialized language is a more effective way ofwelcoming newcomers to the community Posting guidelinescan be helpful in enforcing norms but if norms are difficult fornewcomers to understand or follow they will leave Provid-ing feedback on first contributions can help new users followcommunity norms more effectively [10] This feedback haspreviously been given by veteran users willing to take time andhelp new contributors but our findings suggest that automaticmeasures can capture and convey language norms in a com-munity Our approach in this paper of using language modelsand word frequency distributions provides an effective wayof identifying diverging language and highlighting commonand rare words for the community For example we identifiedcommon hedging words that users who are more likely to stayin rscience will use Language models like ours can provideautomatic just in time feedback for posts and comments andoffer a scalable way of helping new users contribute

Reduce or explain scientific jargonWhile jargon is important in communicating to other re-searchers focused on the same problem it can overwhelmusers with less domain experience Scientific blogs for read-ers outside of the research community often take great careto reduce or explain scientific jargon [30] which can keepthem faithful to the research without overwhelming their audi-ence Our methods of identifying common and rare words canhelp rscience moderators and scientific writers identify jargon

that can act as a linguistic barrier to new users construct-ing a guide to new users on commonly used scientific wordsand their plain English explanations Taking inspiration fromtools that encourage simpler language (eg xkcdrsquos SimpleWriter [25]) language models like ours can identify unusualwords to reduce jargon when users are posting or comment-ing lowering the linguistic barrier of entry while maintainingaccuracy in rscience Going further language-aware writingtechnologies can support writers more than just highlightingwords potentially surfacing automated questions or edits forimproving clarity or audience reception

Scientific discourse with less strict language normsOne other possibility to encourage people with diverging lan-guage from actively contributing over a longer period of timecould be to provide opportunities for reading and contribut-ing posts and comments that are less strict in their languagenorms rscience could offer an extra space for such discoursewhere moderators could follow up on posts that compromiserigor While this would essentially create a subcommunityan advantage of this approach is that people might not feelalienated by incomprehensible content rscience does have asister subreddit reverythingscience which is ostensibly moreopen to broader discussion however the posted guidelines aresimilar to rsciencersquos with one main difference being the postdoes not need to reference a peer-reviewed paper from the last6 months This suggests that the language barriers betweenthe two might not be so different reverythingscience couldexperiment with relaxing some language norms to encouragebroader participation

Rather than offering a separate subcommunity rscience couldalso offer different interface views One view could show allposts and comments together but color code those that wereintended for a broader audience Users could choose whetherthey prefer to see only posts and comments that strictly followrscience guidelines versus a mixed view

LIMITATIONS amp FUTURE WORKWe focused only on first time authors as newcomers howeverwork has shown that users often lurk in a community observ-ing acceptable behaviour and norms before contributing [10]Lurkers might only contribute occasionally (eg once in ayear) including them in our classification of transient contrib-utors while in actuality they may be fully acculturated to thecommunity The fact that we found differences in languagebetween transient and returning members suggests that lurk-ers who post occasionally form a minority of transient usersor this initial post is a catalyst for more active contributionWe also aggregated across subreddits to study how languagediffered at the community level treating each subreddit sepa-rately An exciting extension of this work is to explore howthe same contributorrsquos language differs depending on whatcommunity they are in

Most automatic measures of language style including thelanguage models we used count relative frequency of wordsmaking it difficult to disentangle style from topic [35 37]or what people are talking about versus how they are talkingabout it An exciting future direction of this work is to developmeasures of language style robust to topical variation

CONCLUSIONThis paper contributes a comparison of the language used inrscience a subforum on Reddit that is focused on sciencecommunication to 11 other large or topically related subred-dits We showed that rscience uses a specialized languagedistinct from the other subreddits and that transient contribu-tors to rscience (those that contribute only once) do not matchthis specialized language compared to contributors who ulti-mately stay even when comparing their first comments Theseresults indicate that rsciencersquos specialized language acts as agatekeeper to the community discouraging a broader audiencefrom contributing Our findings add to the ongoing discussionof how to involve a diverse public in scientific discourse byproviding methods for reducing linguistic barriers of entry toone of the largest forums of public scientific discussion

DATASET AND TOOLSWe make available our datasets and Python code used foranalysis at httpsgithubcomtalaugustrscience_language

ACKNOWLEDGMENTSThis work was partially funded by NSF award 1651487 Wethank the ARK lab and Justine Zhang for providing valuablefeedback and support We also thank the reviewers for theirtime and input

REFERENCES1 Dominique Brossard and Dietram A Scheufele 2013

Science new media and the public Science 339 6115(2013) 40ndash41

2 Moira Burke and Robert Kraut 2008 Mind your Ps andQs the impact of politeness and rudeness in onlinecommunities In Proceedings of the 2008 ACMConference on Computer supported Cooperative WorkACM 281ndash284

3 Terry W Burns D John OrsquoConnor and Susan MStocklmayer 2003 Science communication acontemporary definition Public Understanding ofScience 12 2 (2003) 183ndash202

4 Justine Cassell and Dona Tversky 2005 The language ofonline intercultural community formation Journal ofComputer-Mediated Communication 10 2 (2005)

5 Eshwar Chandrasekharan Mattia Samory Shagun JhaverHunter Charvat Amy Bruckman Cliff Lampe JacobEisenstein and Eric Gilbert 2018 The Internetrsquos HiddenRules An Empirical Study of Reddit Norm Violations atMicro Meso and Macro Scales In Proceedings of theACM on Human-Computer Interaction

6 Stanley F Chen and Joshua Goodman 1999 Anempirical study of smoothing techniques for languagemodeling Computer Speech amp Language 13 4 (1999)359ndash394

7 Robert B Cialdini Raymond R Reno and Carl AKallgren 1990 A focus theory of normative conductrecycling the concept of norms to reduce littering inpublic places Journal of Personality and SocialPsychology 58 6 (1990)

8 Cristian Danescu-Niculescu-Mizil Robert West DanJurafsky Jure Leskovec and Christopher Potts 2013 Nocountry for old members User lifecycle and linguisticchange in online communities In Proceedings of the22nd International Conference on World Wide Web307ndash318

9 Casey Fiesler Jialun ldquoAaronrdquo Jiang Joshua McCannKyle Frye and Jed R Brubaker 2018 Reddit rulesCharacterizing an ecosystem of governance InProceedings of International AAAI Conference on Weband Social Media (ICWSM)

10 Denae Ford Kristina Lustig Jeremy Banks and ChrisParnin 2018 We donrsquot do that here How collaborativeediting with mentors improves engagement in socialQampA communities In Proceedings of the 2018 CHIConference on Human Factors in Computing Systems

11 Aaron Halfaker R Stuart Geiger Jonathan T Morganand John Riedl 2013 The rise and decline of an opencollaboration system How Wikipediarsquos reaction topopularity is causing its decline American BehavioralScientist 57 5 (2013) 664ndash688

12 Aaron Halfaker Aniket Kittur Robert Kraut and JohnRiedl 2009 A jury of your peers quality experience andownership in Wikipedia In Proceedings of the 5thInternational Symposium on Wikis and OpenCollaboration ACM

13 William L Hamilton Justine Zhang CristianDanescu-Niculescu-Mizil Dan Jurafsky and JureLeskovec 2017 Loyalty in online communities InEleventh International AAAI Conference on Web andSocial Media

14 Per Hetland 2014 Models in science communicationformatting public engagement and expertise NordicJournal of Science and Technology Studies 2 2 (2014)5ndash17

15 Shih-Wen Huang Minhyang Mia Suh Benjamin MakoHill and Gary Hsieh 2015 How activists are both bornand made An analysis of users on change org InProceedings of the 33rd Annual ACM Conference onHuman Factors in Computing Systems 211ndash220

16 Aaron Jaech Victoria Zayats Hao Fang Mari Ostendorfand Hannaneh Hajishirzi 2015 Talking to the crowdWhat do people react to in online discussions InProceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing 2026ndash2031

17 Ridley Jones Lucas Colusso Katharina Reinecke andGary Hsieh 2019 rscience Challenges andopportunities for online science communication (2019)

18 Robert E Kraut and Paul Resnick 2012 BuildingSuccessful Online Communities Evidence-based SocialDesign MIT Press

19 William Labov 1973 The linguistic consequences ofbeing a lame Language in Society 2 1 (1973) 81ndash115

20 William Labov 2006 The Social Stratification of Englishin New York City Cambridge University Press

21 Trevor Martin 2017 community2vec Vectorrepresentations of online communities encode semanticrelationships In Proceedings of the Second Workshop onNLP and Computational Social Science 27ndash31

22 J Nathan Matias and Merry Mou 2018 CivilServantCommunity-led experiments in platform governance InProceedings of the 2018 CHI Conference on HumanFactors in Computing Systems ACM

23 James Milroy and Lesley Milroy 1978 Belfast Changeand variation in an urban vernacular SociolinguisticPatterns in British English 19 (1978) 19ndash36

24 Jonathan T Morgan and Anna Filippova 2018lsquoWelcomersquo changes Descriptive and injunctive norms ina Wikipedia sub-community Proceedings of the ACM onHuman-Computer Interaction 52 (2018)

25 Randall Munroe 2015 A Thing Explainer word checker(2015) httpsblogxkcdcom20150922a-thing-explainer-word-checker

26 Dong Nguyen A Seza Dogruoumlz Carolyn P Roseacute andFranciska de Jong 2016 Computational sociolinguisticsA survey Computational Linguistics 42 3 (2016)537ndash593

27 Dong Nguyen and Carolyn P Roseacute 2011 Language useas a reflection of socialization in online communities InProceedings of the Workshop on Languages in SocialMedia 76ndash85

28 Nigini Oliveira Michael Muller Nazareno Andrade andKatharina Reinecke 2018 The exchange inStackExchange Divergences between Stack Overflowand its culturally diverse participants Proceedings of theACM on Human-Computer Interaction (2018)

29 Pontus Plaveacuten-Sigray Granville James MathesonBjoumlrn Christian Schiffler and William Hedley Thompson2017 The readability of scientific texts is decreasing overtime eLife 6 (2017)

30 Mathieu Ranger and Karen Bultitude 2016 lsquoThe kind ofmildly curious sort of science interested person like mersquoScience bloggersrsquo practices relating to audiencerecruitment Public Understanding of Science 25 3(2016) 361ndash378

31 Joseph Seering Tony Wang Jina Yoon and GeoffKaufman 2019 Moderator engagement and communitydevelopment in the age of algorithms New Media ampSociety (2019)

32 Eva Sharma and Munmun De Choudhury 2018 Mentalhealth support and its relationship to linguisticaccommodation in online communities In Proceedings ofthe 2018 CHI Conference on Human Factors inComputing Systems 641

33 Andreas Stolcke 2002 SRILM-an extensible languagemodeling toolkit In Proceedings of the SeventhInternational Conference on Spoken LanguageProcessing

34 Chenhao Tan Lillian Lee and Bo Pang 2014 The effectof wording on message propagation Topic-andauthor-controlled natural experiments on Twitter InProceedings of the 52nd Annual Meeting of theAssociation for Computational Linguistics

35 Trang Tran and Mari Ostendorf 2016 Characterizing thelanguage of online communities and its relation tocommunity reception In Proceedings of the 2016Conference on Empirical Methods in Natural LanguageProcessing 1030ndash1035

36 Debbie Treise and Michael F Weigold 2002 Advancingscience communication A survey of sciencecommunicators Science Communication 23 3 (2002)310ndash322

37 Justine Zhang William L Hamilton CristianDanescu-Niculescu-Mizil Dan Jurafsky and JureLeskovec 2017 Community identity and userengagement in a multi-community landscape InProceedings of the International AAAI Conference onWeblogs and Social Media

Introduction

Related Work

Science Communication

Online Community Norms and Guidelines

Community Language

Method

Data

Analysis

Results

Discussion

Design Implications

Limitations amp Future Work

Conclusion

Dataset and Tools

Acknowledgments

References