8
Proceedings of Recent Advances in Natural Language Processing, pages 371–378, Hissar, Bulgaria, Sep 7–9 2015. Predicting the Level of Text Standardness in User-generated Content Nikola Ljubešc Darja Fišer Tomaž Erjavec Jaka ˇ Cibej Dafne Marko Senja Pollak Iza Škrjanec Dept. of Knowledge Technologies, Jožef Stefan Institute [email protected] Dept. of Translation studies, Faculty of Arts, University of Ljubljana [email protected] Dept. of Inf. Sciences, Faculty of Humanities and Social Sciences, University of Zagreb Abstract Non-standard language as it appears in user-generated content has recently at- tracted much attention. This paper pro- poses that non-standardness comes in two basic varieties, technical and linguistic, and develops a machine-learning method to discriminate between standard and non- standard texts in these two dimensions. We describe the manual annotation of a dataset of Slovene user-generated content and the features used to build our re- gression models. We evaluate and dis- cuss the results, where the mean abso- lute error of the best performing method on a three-point scale is 0.38 for tech- nical and 0.42 for linguistic standard- ness prediction. Even when using no language-dependent information sources, our predictor still outperforms an OOV- ratio baseline by a wide margin. In addi- tion, we show that very little manually an- notated training data is required to perform good prediction. Predicting standardness can help decide when to attempt to nor- malise the data to achieve better annota- tion results with standard tools, and pro- vide linguists who are interested in non- standard language with a simple way of selecting only such texts for their research. 1 Introduction User-generated content (UGC) is becoming an in- creasingly frequent and important source of hu- man knowledge and people’s opinions (Crystal, 2011). Language use in social media is char- acterised by special technical and social circum- stances, and as such deviates from the norm of traditional text production. Researching the lan- guage of social media is not only of great value to (socio)linguists, but also beneficial for improving automatic processing of UGC, which has proven to be quite difficult (Sproat, 2001). Consistent de- creases in performance on noisy texts have been recorded in the entire text processing chain, from PoS-tagging, where the state-of-the-art Stanford tagger achieves 97% accuracy on Wall Street Jour- nal texts, but only 85% accuracy on Twitter data (Gimpel et al., 2011), to parsing, where double- digit decreases in accuracy have been recorded for 4 state-of-the-art parsers on social media texts (Petrov and McDonald, 2012). Non-standard linguistic features have been analysed both qualitatively and quantitatively (Eisenstein, 2013; Hu et al., 2013; Baldwin et al., 2013) and they have been taken into account in automatic text processing applications, which ei- ther strive to normalise non-standard features be- fore submitting them to standard text processing tools (Han et al., 2012), adapt standard process- ing tools to work on non-standard data (Gimpel et al., 2011) or, in task-oriented applications, use a series of simple pre-processing steps to tackle the most frequent UGC-specific phenomena (Foster et al., 2011). However, to the best of our knowledge, the level of (non-)standardness of UGC has not yet been measured to improve the corpus pre-processing pipeline or added to the corpus as an annotation 371

Predicting the Level of Text Standardness in User

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Predicting the Level of Text Standardness in User

Proceedings of Recent Advances in Natural Language Processing, pages 371–378,Hissar, Bulgaria, Sep 7–9 2015.

Predicting the Level of Text Standardness in User-generated Content

Nikola Ljubešic⇤‡ Darja Fišer† Tomaž Erjavec⇤ Jaka Cibej†

Dafne Marko† Senja Pollak⇤ Iza Škrjanec†

⇤ Dept. of Knowledge Technologies, Jožef Stefan [email protected]

† Dept. of Translation studies, Faculty of Arts, University of [email protected]

‡ Dept. of Inf. Sciences, Faculty of Humanities and Social Sciences, University of Zagreb

Abstract

Non-standard language as it appears inuser-generated content has recently at-tracted much attention. This paper pro-poses that non-standardness comes in twobasic varieties, technical and linguistic,and develops a machine-learning methodto discriminate between standard and non-standard texts in these two dimensions.We describe the manual annotation of adataset of Slovene user-generated contentand the features used to build our re-gression models. We evaluate and dis-cuss the results, where the mean abso-lute error of the best performing methodon a three-point scale is 0.38 for tech-nical and 0.42 for linguistic standard-ness prediction. Even when using nolanguage-dependent information sources,our predictor still outperforms an OOV-ratio baseline by a wide margin. In addi-tion, we show that very little manually an-notated training data is required to performgood prediction. Predicting standardnesscan help decide when to attempt to nor-malise the data to achieve better annota-tion results with standard tools, and pro-vide linguists who are interested in non-standard language with a simple way ofselecting only such texts for their research.

1 Introduction

User-generated content (UGC) is becoming an in-creasingly frequent and important source of hu-

man knowledge and people’s opinions (Crystal,2011). Language use in social media is char-acterised by special technical and social circum-stances, and as such deviates from the norm oftraditional text production. Researching the lan-guage of social media is not only of great value to(socio)linguists, but also beneficial for improvingautomatic processing of UGC, which has provento be quite difficult (Sproat, 2001). Consistent de-creases in performance on noisy texts have beenrecorded in the entire text processing chain, fromPoS-tagging, where the state-of-the-art Stanfordtagger achieves 97% accuracy on Wall Street Jour-nal texts, but only 85% accuracy on Twitter data(Gimpel et al., 2011), to parsing, where double-digit decreases in accuracy have been recordedfor 4 state-of-the-art parsers on social media texts(Petrov and McDonald, 2012).

Non-standard linguistic features have beenanalysed both qualitatively and quantitatively(Eisenstein, 2013; Hu et al., 2013; Baldwin et al.,2013) and they have been taken into account inautomatic text processing applications, which ei-ther strive to normalise non-standard features be-fore submitting them to standard text processingtools (Han et al., 2012), adapt standard process-ing tools to work on non-standard data (Gimpel etal., 2011) or, in task-oriented applications, use aseries of simple pre-processing steps to tackle themost frequent UGC-specific phenomena (Foster etal., 2011).

However, to the best of our knowledge, the levelof (non-)standardness of UGC has not yet beenmeasured to improve the corpus pre-processingpipeline or added to the corpus as an annotation

371

Page 2: Predicting the Level of Text Standardness in User

layer in comprehensive corpus-linguistic analy-ses. In this paper, we present an experiment inwhich we manually annotated and analysed the(non-)standardness level of Slovene tweets, forummessages and news comments. The findings werethen used to train a regression model that auto-matically predicts the level of text standardnessin the entire corpus. We believe this informationwill be highly useful in linguistic analyses as wellas in all stages of text processing, from more ac-curate sampling to build representative corpora tochoosing the best tools for processing the collecteddocuments, either with tools trained on standardlanguage or with tools specially adapted for non-standard language varieties.

The paper is organised as follows. Section 2presents the dataset. Section 3 introduces the fea-tures used in subsequent experiments, while Sec-tion 4 describes the actual experiments and theirresults, with an emphasis on feature evaluation,the gain when using external resources, and ananalysis of the performance on specific subcor-pora. The paper concludes with a discussion ofthe results and plans for future work.

2 The Dataset

This section presents the dataset used in subse-quent experiments, starting with our corpus ofuser-generated Slovene and the sampling used toextract the dataset for manual annotation. We thenexplain the motivation behind having two dimen-sions of standardness, and describe the process ofmanual dataset annotation.

2.1 The Corpus of User-generated Slovene

The dataset for the reported experiments is takenfrom our corpus of user-generated Slovene, whichcurrently contains three types of text: tweets, fo-rum posts, and news site comments. The completecorpus contains just over 120 million tokens.

Tweets were collected with the TweetCaT tool(Ljubešic et al., 2014b), which was constructedspecifically for compiling Twitter corpora ofsmaller languages. The tool uses the Twitter APIand a small lexicon of language specific Slovenewords to first identify the users that predominantlytweet in Slovene, as well as their friends andfollowers. TweetCaT continuously collected theusers’ tweets for a period of almost two years,also updating the list of users. This resulted inthe Slovene tweet subcorpus, which contains 61

million tokens. Currently, most of the collectedtweets were written between 2013 and 2014. Itshould be noted that the majority of these tweetsdid not turn out to be user-generated content, butrather news feeds, advertisements, and similar ma-terial produced by professional authors.

For forum posts and news site comments, sixpopular Slovene sources were chosen as they werethe most widely used and contained the most texts.The selected forums focus on the topics of mo-toring, health, and science, respectively. The se-lected news sites pertain to the national Slovenebroadcaster RTV Slovenija, and the most popularleft-wing and right-wing weekly magazines. Be-cause the crawled pages differ in terms of struc-ture, separate text extractors were developed usingthe Beautiful Soup1 module, which enables writ-ing targeted structure extractors from HTML doc-uments. This allowed us to avoid compromisingcorpus content with large amounts of noise typi-cally present in these types of sources, e.g. advertsand irrelevant links. It also enabled us to structurethe texts and extract relevant metadata from them.

The forum posts contribute 47 million tokens tothe corpus, while the news site comments amountto 15 million tokens. As with tweets, the majorityof the collected comments were posted between2013 and 2014. The forum posts cover a widertime span, with similar portions of text comingfrom each of the years between 2006 and 2014.

The corpus is also automatically annotated. Thetexts were first tokenised and the word tokensnormalised (standardised) using the method ofLjubešic et al. (2014a), which employs character-based statistical machine translation. The CSMTtranslation model was trained on 1000 keywordstaken from the Slovene tweet corpus (comparedto a corpus of standard Slovene) and their manu-ally determined standard equivalents. Then, usingthe models for standard Slovene the standardisedword tokens were PoS-tagged and lemmatised.

2.2 Samples for Manual Annotation

For the experiments reported in this paper, we con-structed a dataset containing individual texts thatwere semi-randomly sampled from the corpus oftweets, forum posts and comments. The datasetwas then manually annotated.

To guarantee a balanced dataset, we selected

1http://www.crummy.com/software/BeautifulSoup/

372

Page 3: Predicting the Level of Text Standardness in User

equal proportions (one third) of texts for each texttype. For forum posts and comments, we includedequal proportions of each of their six sources.In order to obtain a balanced dataset in terms oflanguage (non-)standardness from a corpus heav-ily skewed towards standard language, we useda heuristic to roughly estimate the degree of text(non-)standardness, which makes use of the cor-pus normalisation procedure. For each text, wecomputed the ratio between the number of wordtokens that have been changed by the automaticnormalisation, and its overall length in words. Ifthis ratio was 0.1 or less, the text was consid-ered as standard, otherwise it was considered asnon-standard. The dataset was then constructedso that it contained an equal number of ”standard”and ”non-standard” texts. It should be noted thatthis is only an estimate, and that the presentedmethod does not depend on an exact balance. Dif-ferent rough measures of standardness could alsobe taken, e.g. a simple ratio of out-of-vocabularywords to all words, given a lexicon of standardword forms.

2.3 Dimensions of Standardness

It is far from easy to tell how ”standard” a cer-tain text is. While it could be regarded as a sin-gle dimension of a text (as is usually the case withe.g. sentiment annotation), standardness turns outto comprise a very disparate set of features. Forexample, some authors use standard spelling, butno capital letters. Others make many typos, whilesome will typeset their text in a standard fashion,but use colloquial or dialectal lexis and morphol-ogy.

To strike a balance between the adequacy andthe complexity of the annotation, we decided touse two dimensions of standardness: technical andlinguistic. The score for technical text standard-ness focuses on word capitalisation, the use ofpunctuation, and the presence of typos or repeatedcharacters in the words. The score for linguisticstandardness, on the other hand, takes into accountthe knowledge of the language by the authors andtheir more or less conscious decisions to use non-standard language, involving spelling, lexis, mor-phology, and word order.

These two dimensions are meant to be straight-forward enough to be applied by the annotators,informative enough for NLP tools to appropri-ately apply possibly different normalisation meth-

ods, and relevant enough to linguists for filteringrelevant texts when researching non-standard lan-guage.

2.4 Manual Annotation and ResultingDataset

The annotators, who were postgraduate studentsof linguistics, were presented with annotationguidelines and criteria for annotating the two di-mensions of non-standardness. Each given textwas to be annotated in terms of both dimen-sions using a score between 1 (standard) and 3(very non-standard), with 2 marking slightly non-standard texts. We used a three-score system asthe task is not (and can hardly be) very preciselydefined. Using this scale would also allow us tobetter observe inter-annotator agreement. In addi-tion, for learning standardness models, we used aregression model, which returns a degree of stan-dardness, rather than a classification one. In thisparticular case, a slightly more fine-grained scor-ing is beneficial.

To give an idea of the types of features takeninto account for each dimension, two examples ofshort texts are presented below:

• T=1 / L=3Original: Ma men se zdi tole s poimenovanjioz s poslovenjenjem imen mest cist mem.Standardised: Meni se zdi to s poimenovanjioz. s poslovenjenjem imen mest cisto mimo.English: To-me Refl. it-seems this with nam-ing i.e. with making-into-Slovene names of-cities completely wrong.Differences: Colloquial particle (“ma”), col-loquial form of pronoun (“tole” vs. “to”),phonetic transcription of dialectal wordforms (“men” vs. “meni”, “cist” vs. “cisto”,“mem” vs. “mimo”)

• T=3 / L=1Original: se pravi,da predvidevaš razvel-javitevStandardised: Se pravi, da predvidevašrazveljavitev?English: Refl. this-means, that you-foreseeannuling?Differences: No capital letter at the startof sentence, no space after the comma, nosentence-final punctuation.

The annotators were told to mark with 0 thosetexts that were out of scope for the experiment,

373

Page 4: Predicting the Level of Text Standardness in User

e.g. if they were written in a foreign language,automatically generated (such as news or advertlead-ins in tweets) or if they contained no linguis-tic material (e.g. only URLs, hashtags, and emoti-cons). These texts were then not included in themanually annotated dataset.

After a training session in which a small set oftexts was annotated and discussed by all annota-tors, the experimental data was annotated in twocampaigns. A first batch of 904 text instanceswas annotated, each by a single annotator, and wassubsequently used as the development data in ourexperiments. For the second batch, each of 402text instances was annotated by two annotators. In8 of these instances, the difference between the an-notations made by separate annotators in at leastone dimension was two. This means that the firstannotator marked a text as standard in at least onedimension, while the other marked it as very non-standard. This is why these data points were re-moved from the dataset, leaving 394 instances thatconstituted the testing set for the experiments. Theresponse variables for the experiments were com-puted as the average of the values given by twoannotators.

3 The Feature Space

We defined 29 features to describe the technicaland linguistic text properties. The features canbe grouped in two main categories. Character-based features (listed in Table 1 and described in3.1) concern the incorrect use of punctuation andspaces, character repetition, the ratio of alphabeticvs. non-alphabetic characters, vowels vs. conso-nants, etc. Token-based features (listed in Table2 and described in 3.2) describe word properties.Some are very general, e.g. proportions of veryshort words, capitalised words, etc., while othersdepend on the use of language-specific lexiconsand mostly compute the proportion of words notincluded in these lexicons.

3.1 Character-based Features

This category contains features dealing either withthe use of punctuation and brackets or the use ofalphanumeric characters.

In terms of punctuation and brackets, we calcu-late the ratio of punctuation compared to all char-acters, ratio of paragraphs ending with an end-of-sentence punctuation sign, and the ratio of spacespreceding or not following a punctuation sign.

Name Descriptionpunc_space_ratio

ratio of punctuations followedby a space

space_punc_ratio

ratio of punctuations following aspace

ucase_char_ratio

ratio of upper-case characters

punc_ratio ratio of punctuation characterssentpunc_ucase_ratio

ratio of sentence endings fol-lowed by an upper-case charac-ter

parstart_ucase_ratio

ratio of paragraph beginningswith an upper-case character

parend_sentpunc_ratio

ratio of paragraphs ending witha punctuation

alpha_ratio ratio of letter charactersweirdbracket_ratio

ratio of brackets with unex-pected spaces

weirdquote_ratio

ratio of quotes with unexpectedspaces

char_repeat_ratio

ratio of character repetitions ofn={2,3}

alpha_repeat_ratio

ratio of letter repetitions ofn={2,3}

char_length text length in characterscons_alpha_ratio

ratio of consonants among let-ters

vow_cons_ratio

ratio of vowels and consonants

alphabet_ratio

ratio of Slovene alphabet charac-ters

Table 1: Overview of character-based features

Similarly, we calculate the ratio of opening orclosing brackets that are preceded and followed byspaces.

For the alphanumeric characters, we calculatethe ratios of alphabetic and alphanumeric char-acters in the text, the ratio of uppercase letters,and the ratios of sentences and paragraphs start-ing with an uppercase letter. One feature is basedon the ratio between vowels and consonants in thetext, while another encodes the ratio of charactersfrom the Slovene alphabet. Two other features arebased on repeating characters, one covering anycharacter, the other focusing on alphabetic charac-ters only.

374

Page 5: Predicting the Level of Text Standardness in User

Name Descriptionalphanum_token_ratio

ratio of tokens consisting of al-phanumeric characters

token_rep_ratio

ratio of token repetitions

ucase_token_ratio

ratio of upper-case tokens

tcase_token_ratio

ratio of title-case tokens

short_token_ratio

ratio of short tokens (up to 3characters)

oov_ratio ratio of OOVs given a lexical re-source

short_oov_ratio

ratio of OOVs among short to-kens (up to 4 characters)

lowercased_names_ratio

ratio of names written in lower-case

Table 2: Overview of token-based features

3.2 Token-based Features

In this category, we discriminate between string-based and lexicon-based features, the latter beingdependent on external data sources.

In terms of string-based features, we computethe ratio of title-case and upper-case words, aswell as word repetitions. Another feature is theratio of words composed only of consonants. Wealso consider the ratio of very short words.

A large part of lexicon-based features usesthe Sloleks lexicon2 (Krek and Erjavec, 2009),consisting of Slovene words with all their wordforms. The lexicon consists of 961,040 formssince Slovene is a morphologically rich languageand each lexeme has many possible word forms.

The features based on this resource are the ra-tio of out-of-vocabulary (OOV) words (sloleks),the ratio of words that are OOVs, but are miss-ing a vowel character (sloleks_vowel), the ra-tio of short words that are OOVs (sloleks_short),and the number of lower-case forms covered bya title- or upper-case entry in the lexicon only(sloleks_names).

We experimented with another source of lexi-cal information – the KRES balanced corpus ofstandard Slovene (Logar Berginc et al., 2012). Weproduced two lexicons from the corpus, one con-sisting of all letter-only tokens occurring at least

2Sloleks is available under the CC BY-NC-SA license athttp://www.slovenscina.eu/.

ten times (70,249 entries, kresleks_10), and theother with the frequency threshold of 100 (4,339entries, kresleks_100). We used both resources tocalculate OOV ratios.

Finally, we used a very small lexical resource of195 most frequent non-standard forms of Slovene(nonstdlex). We produced this resource by cal-culating the log-likelihood statistic of each tokenfrom our corpus with respect to its frequency inthe KRES corpus. We manually inspected the 250highest-ranked tokens, cleaning out 55 irrelevantentries.

4 Experiments and Results

In this section, we describe our regressor optimi-sation and evaluation, the analysis of feature co-efficients, the dependence on external informationsources, the learning curve of the problem, and theindependence from the text genre.

4.1 Regressor OptimisationIn the first set of experiments, we used the de-velopment set to perform grid search hyperparam-eter optimisation via 10-fold cross-validation onthe SVR regressor using an RBF kernel. As ourscoring function throughout the paper, we use themean absolute error as it is more resistant to out-liers than the mean squared error, and is also easierto interpret.

The results obtained from the optimised regres-sor, presented in Table 3, showed that the task ofpredicting technical standardness is simpler thanthat of predicting linguistic standardness, whichwas expected.

Dimension Mean absolute errortechnical 0.451 ± 0.033linguistic 0.544 ± 0.033

Table 3: Results obtained from the dev set

4.2 Test Set EvaluationOnce we had optimised our hyperparameters onthe development set, we performed an evaluationof the system on our test set. Given that thetest set was double-annotated, with opposite-labelinstances removed and neighbouring labels aver-aged, we expected a lower level of error comparedto the development data.

In Table 4 we compare our system with multiplebaselines. The first two baselines (baseline_linear

375

Page 6: Predicting the Level of Text Standardness in User

and baseline_SVR) are supervised equivalents towhat researchers mostly use in practice – the OOVratio heuristic. Those baselines use only onefeature – the OOV ratio on the Sloleks lexicon(sloleks). The first baseline (baseline_linear) is al-gorithmically simpler as it uses linear regression,thereby linearly mapping the [0-1] range of theOOV ratio heuristic to the expected [1, 3] range ofour response variables. The second baseline (base-line_SVR) uses SVR with an RBF kernel.

The last two baselines are random baselines thatproduce random numbers in the [1,3] range. Thebaseline_test was evaluated on our test set and thebaseline_theoretical was evaluated on another ran-domly generated sequence of values in the [1,3]range. Both baselines were evaluated on drasti-cally longer test sets (either by repeating our testset or by generating longer sequences of randomnumbers) to produce accurate estimates.

We can observe that the mean absolute error ofour final system did, as expected, go down in com-parison to the 10-folding result obtained on the de-velopment data from Table 3. There are two rea-sons for this: 1) most incorrect annotations in thetesting data were removed, and 2) the 3-level scalewas transformed to a 5-level scale. The error levelon the technical dimension is still lower comparedto the linguistic dimension, although the distancebetween those two dimensions has shrunk from0.09 points to 0.05 points.

Comparing our system to baseline_linear on thelinguistic dimension shows that using more vari-ables than just the OOV ratio and training a non-linear regressor does produce a much better sys-tem, with an error reduction of more than 0.17points. When using a non-linear regressor as abaseline (baseline_SVR), the error difference fallsto 0.12 points, which argues for using non-linearregressors on this problem.

For the technical dimension, as expected, theOOV ratio heuristic is not optimal, producing, dif-ferently than when using all the features, similaror worse results compared to the linguistic dimen-sion.

The two random baselines show that both oursystem and the OOV baselines are a safe distanceaway from these weak baselines.

Beside the fact that using multiple features en-hances our results, we want to stress that using asupervised system should not be questioned at all,since the output of heuristics such as the OOV ra-

tio is very hard to interpret by the final corpus user,in contrast to the [1, 3] range defined in this paper.

Technical Linguisticfinal system 0.377 0.424baseline_linear 0.594 0.597baseline_SVR 0.584 0.548baseline_test 0.713 0.749baseline_theoretical 0.889 0.889

Table 4: Final evaluation and comparison with thebaselines via mean absolute error.

4.3 Feature CoefficientsOur next experiment focused on the usefulness ofspecific features by training a linear kernel SVRon standardised data and analysing its coefficients.We thereby inspected which variables demonstratethe highest prediction strength for each of our twodimensions.

For the technical dimension, the most promi-nent features are the ratio of alphabetic characters,the number of character repetitions, the ratio ofupper-case characters and the ratio of spaces afterpunctuation.

On the other hand, for the linguistic dimen-sion, the most prominent features are the OOV rategiven a standard lexicon (sloleks), the OOV rategiven a lexicon of non-standard forms (nonstdlex),and the ratio of short tokens.

As expected, for the technical dimension,character-based features are of greater importance.As for the linguistic dimension, token-based fea-tures, especially the lexicon-based ones, carrymore weight.

4.4 External Information SourcesOur next experiment looked into how much in-formation is obtained from external informationsources used in lexicon-based features. With thisexperiment, we wanted to measure our depen-dence on these resources and the level of predic-tion quality one can expect if some of those re-sources are not available.

The results obtained with information sourcesthat influence prediction quality the most aregiven in Table 5. While the technical predictionin general does not suffer a lot when removingexternal information sources, the linguistic onedoes suffer an 0.11-point increase in error whenall the resource-dependent features are removed

376

Page 7: Predicting the Level of Text Standardness in User

(none). The most informative resource is the lex-icon of standard language (sloleks), yielding a0.07-out-of-0.11-points error reduction. Produc-ing a lexicon from a corpus of standard language(kresleks_100) does not come close to that, pro-ducing just a 0.02-point error reduction. Whilethe small lexicon of non-standard forms (nonst-dlex) does reduce the error by 0.06 points, usingall standard-lexicon-related features (sloleks_all)comes 0.03 points closer to the final result ob-tained with all features (all).

Information source Technical Linguisticall 0.377 0.424none 0.384 0.537kresleks_100 0.385 0.514sloleks 0.379 0.461sloleks_vowel 0.378 0.488sloleks_all 0.380 0.445nonstdlex 0.379 0.476

Table 5: Dependence of prediction quality on ex-ternal information sources

4.5 Learning CurveThis set of experiments focused on the impact ofthe amount of data available for training on ourprediction quality. We compared three predictors:the technical one, the linguistic one and the lin-guistic one without using any external informationsources. The three learning curves are depicted inFigure 1. They show that useful results can be ob-tained with just a few hundred annotated instances.The learning of the technical and linguistic dimen-sions seem to be equally hard when there are upto 100 instances available for learning, the techni-cal dimension taking off after that. The linguis-tic dimension is obviously harder to learn whenexternal information sources are not used. Bothlearning curves seem to be parallel, showing that alarger amount of training data, at least for the fea-tures defined, cannot compensate for the lack ofexternal knowledge.

4.6 Genre DependenceOur final experiment focused on the dependenceof the results on the genre of the training and test-ing data. We took into consideration the three gen-res present in the corpus: tweets, news commentsand forum posts. On each side, training and test-ing, we experimented with using either data from

200 400 600 800

−0.6

0−0

.55

−0.5

0−0

.45

−0.4

0

amount of training data

mea

n ab

solu

te e

rror

technicallinguisticlinguistic wout external

Figure 1: Mean absolute error as a function oftraining data size

just one genre or from all genres together. On thetraining side, we made sure to use the same num-ber of instances in each experiment. The results ofthe genre dependence experiment are presented inTable 6.

TechnicalTweet Comment Forum All

Tweet 0.384 0.451 0.485 0.440Comment 0.519 0.389 0.400 0.437Forum 0.514 0.408 0.370 0.431All 0.426 0.382 0.417 0.409

LinguisticTweet Comment Forum All

Tweet 0.410 0.452 0.503 0.455Comment 0.453 0.429 0.510 0.465Forum 0.444 0.458 0.500 0.467All 0.395 0.439 0.507 0.448

Table 6: Impact of the genre of training (rows) andtesting data (columns)

In the technical dimension, we observe best re-sults when training and testing data comes fromthe same genre. There are no significant differ-ences between the genres.

In the linguistic dimension, Twitter data provesto be easiest to perform prediction on, and forumdata the most complicated. Interestingly, the bestpredictions are not made if training data comesfrom the same genre, but if all genres are com-bined.

377

Page 8: Predicting the Level of Text Standardness in User

5 ConclusionIn this paper, we presented a supervised-learningapproach to predicting the text standardness level.While we differentiated between two dimensionsof standardness, the technical and the linguisticone, we explained both with the same 29 features,most of which were independent from external in-formation sources.

We showed that we outperform the super-vised baselines that rely on the traditionally usedOOV ratio only. We outperformed those base-lines even when not using any external informa-tion sources, which makes our predictor highlylanguage-independent.

Both predictors outperformed the supervisedbaselines when only a few tens of instances areavailable for training. Most of the learning is per-formed on the first 500 instances. This makesbuilding the training set for a language an easytask that can be completed in day or two.

While the single most informative external in-formation source was the lexicon of standard lan-guage, adding information from very small lexi-cons of frequent non-standard forms or from au-tomatically transformed lexicons of standard lan-guage significantly improved the results.

Finally, we showed that the predictors are, ingeneral, genre-independent. The technical dimen-sion is slightly more genre-dependent than the lin-guistic one. While predicting linguistic standard-ness on tweets is the simplest task, predicting thesame on forums proves to be much more difficult.

Future work includes applying more transfor-mations to the lexicon of standard language thanjust vowel dropping, inspecting the language inde-pendence of features without relying on manuallyannotated data in the target language, and usinglexical information from the training data to im-prove prediction.

AcknowledgmentsThe work described in this paper was funded bythe Slovenian Research Agency, project J6-6842and by the European Fund for Regional Devel-opment 2007-2013, grant RC.2.2.08-0050 (ProjectRAPUT).

ReferencesTimothy Baldwin, Paul Cook, Marco Lui, Andrew

MacKinlay, and Li Wang. 2013. How Noisy SocialMedia Text, How Diffrnt Social Media Sources. InSixth Intl. Joint Conference on NLP, pages 356–364.

David Crystal. 2011. Internet Linguistics: A StudentGuide. Routledge, New York, NY, 10001, 1st edi-tion.

Jacob Eisenstein. 2013. What to Do About Bad Lan-guage on the Internet. In NAACL-HLT, pages 359–369. ACL, June.

Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner,Joseph Le Roux, Joakim Nivre, Deirdre Hogan, andJosef van Genabith. 2011. From News to Com-ment: Resources and Benchmarks for Parsing theLanguage of Web 2.0. In IJCNLP, pages 893–901,Chiang Mai, Thailand. Asian Federation of NLP.

Kevin Gimpel, Nathan Schneider, Brendan O’Connor,Dipanjan Das, Daniel Mills, Jacob Eisenstein,Michael Heilman, Dani Yogatama, Jeffrey Flanigan,and Noah A. Smith. 2011. Part-of-Speech Taggingfor Twitter: Annotation, Features, and Experiments.In ACL (Short Papers), pages 42–47. ACL.

Bo Han, Paul Cook, and Timothy Baldwin. 2012.Automatically Constructing a Normalisation Dic-tionary for Microblogs. In EMNLP-CoNLL, pages421–432, Jeju Island, Korea.

Yuheng Hu, Kartik Talamadupula, and SubbaraoKambhampati. 2013. Dude, srsly?: The sur-prisingly formal nature of twitter’s language. InICWSM.

Simon Krek and Tomaž Erjavec. 2009. StandardisedEncoding of Morphological lexica for Slavic lan-guages. In MONDILEX Second Open Workshop,pages 24–9, Kyiv, Ukraine. National Academy ofSciences of Ukraine.

Nikola Ljubešic, Tomaž Erjavec, and Darja Fišer.2014a. Standardizing Tweets with Character-LevelMachine Translation. In CICLing, Lecture notes incomputer science, pages 164–75. Springer.

Nikola Ljubešic, Darja Fišer, and Tomaž Erjavec.2014b. TweetCaT: a Tool for Building Twitter Cor-pora of Smaller Languages. In Ninth LREC, Reyk-javik. ELRA.

Nataša Logar Berginc, Miha Grcar, Marko Brakus,Tomaž Erjavec, Špela Arhar Holdt, and Simon Krek.2012. Korpusi slovenskega jezika Gigafida, KRES,ccGigafida in ccKRES: gradnja, vsebina, uporaba.Zbirka Sporazumevanje. Trojina, zavod za uporabnoslovenistiko: Fakulteta za družbene vede, Ljubljana.

Slav Petrov and Ryan McDonald. 2012. Overview ofthe 2012 Shared Task on Parsing the Web. FirstWorkshop on Syntactic Analysis of Non-CanonicalLanguage (SANCL), 59.

Richard Sproat. 2001. Normalization of Non-StandardWords. Computer Speech & Language, 15(3):287–333, July.

378