26
10/6/14, 19:37 Introducing NLP with R Page 1 of 26 http://docs.supstat.com/NLPwithR/#1 Introducing NLP with R Charlie Redmon | SupStat Analytics Copyright Supstat Inc. All Rights Reserved

Introducing natural language processing(NLP) with r

Embed Size (px)

DESCRIPTION

Charlie

Citation preview

  • 1. Introducing NLP with R 10/6/14, 19:37 Introducing NLP with R Charlie Redmon | SupStat Analytics Copyright Supstat Inc. All Rights Reserved http://docs.supstat.com/NLPwithR/#1 Page 1 of 26
  • 2. Introducing NLP with R 10/6/14, 19:37 Outline Introduction to NLP Foundational Frameworks Working with text in R Regular Expressions As pattern matching device Theoretical connection with finite state automaton Application in morphological analysis - - - N-gram models Recognizing language Generating language - - Further reading 2/26 http://docs.supstat.com/NLPwithR/#1 Page 2 of 26
  • 3. Introducing NLP with R 10/6/14, 19:37 What+is+NLP? Natural Language Processing Briefly: Building models to facilitate human-computer interaction through language We say natural language here to distinguish languages like English, Hungarian, and Bengali from computer languages and other invented communication systems (e.g. Morse code) - - Major sub-disciplines: Speech Recognition/Synthesis Computational Morphology (word structure) Lexical Semantics (word meaning) Computational Syntax (phrase/sentence structure) Compositional Semantics (phrase/sentence meaning) Information Retrieval - - - - - - 3/26 http://docs.supstat.com/NLPwithR/#1 Page 3 of 26
  • 4. Introducing NLP with R 10/6/14, 19:37 Why+R? R has powerful text processing capabilities Many useful NLP-related packages Many of the more sophisticated procedures in NLP generalize to statistical models, which is where R really excels 4/26 http://docs.supstat.com/NLPwithR/#1 Page 4 of 26
  • 5. Introducing NLP with R 10/6/14, 19:37 Founda6onal+NLP+Frameworks Turing - Turing Machine: Finite State Automaton, Finite State Transducer Kleene - Regular Expressions Chomsky - Regular Languages and their relation to natural languages Markov: N-gram models HMMs - - Shannon Information Theory Noisy Channel, Entropy models - - 5/26 http://docs.supstat.com/NLPwithR/#1 Page 5 of 26
  • 6. Introducing NLP with R 10/6/14, 19:37 The+Workflow 1. Import and manipulate text in R 2. Create data structures facilitating NLP operations 3. Model implementation: Morphological parsing N-gram parsing N-gram language generation ... 6/26 http://docs.supstat.com/NLPwithR/#1 Page 6 of 26
  • 7. Introducing NLP with R 10/6/14, 19:37 Impor6ng+text+into+R Primary importing functions: scan(), readLines() monty_text = scan('data/grail.txt', what="character", sep="", quote="") monty_text[1:6] [1] "SCENE" "1:" "[wind]" "[clop" "clop" "clop]" malayalam_text = scan('data/mathrubhumi_2014-10_full.txt', what="character", sep="", quote="") malayalam_text[15:20] [1] "#Date:" "01-10-2014" [3] "#----------------------------------------" "kt" [5] "++n" "+n" Why might this data structure be a problem for many natural language structures? 7/26 http://docs.supstat.com/NLPwithR/#1 Page 7 of 26
  • 8. Introducing NLP with R 10/6/14, 19:37 Condensing+to+single+text+stream monty_text = paste(monty_text, collapse=" ") malayalam_text = paste(malayalam_text, collapse=" ") length(monty_text); length(malayalam_text) [1] 1 [1] 1 substr(monty_text, 1, 70) [1] "SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop clop c" substr(malayalam_text, 304, 400) [1] "4 cc dt cn .... +n .. D. D" 8/26 http://docs.supstat.com/NLPwithR/#1 Page 8 of 26
  • 9. Introducing NLP with R 10/6/14, 19:37 Regular+Expressions SYMBOL MEANING EXAMPLE [] Disjunction (set) / [Gg]oogle / = Google, google ? 0 or 1 characters / savou?r / = savor, savour * 0 or more characters / hey!* / = hey, hey!, hey!!, ... Escape character / hey? / = hey? + 1 or more characters / a+h / = ah, aah, aaah, ... {n, m} n to m repetitions / a{1-4}h{1-3} / = aahh, ahhh, ... . Wildcard (any character) / #.* / = #rstats, #uofl, ... () Conjunction / (ha)+ / = ha, haha, hahaha, ... [^ ] NOT (negates bracketed chars) / [^ #.*] / = everything but #... 9/26 http://docs.supstat.com/NLPwithR/#1 Page 9 of 26
  • 10. Introducing NLP with R 10/6/14, 19:37 Regular+Expressions SYMBOL MEANING EXAMPLE [x-y] Match characters from 'x' to 'y' / [A-Z][1-9] / = A1, Q8, X5, ... w Word character (alphanumeric) / w's / = that's, Jerry's, ... W Non-word character d Digit character (0-9) / d{3} / = 137, 254, ... D Non-digit character s Whitespace / w+s+w+ / = I am, I am, ... S Non-whitespace b Word boundary / btheb / = the, not then B Non-word boundary ^ Beginning of line / [a-z] / = non-capitalized beg. $ End of line / #.*$ / = hashtags at end of line 10/26 http://docs.supstat.com/NLPwithR/#1 Page 10 of 26
  • 11. Introducing NLP with R 10/6/14, 19:37 Manual+segmenta6on The advantage of having all the text in a single element is we can now split the text into different-sized segments for different kinds of natural language tasks. #sentence level pattern = "(?