140
Basic Text Processing Regular Expressions Word Tokeniza5on Word Normaliza5on Sentence Segmenta5on Many slides adapted from slides by Dan Jurafsky Thursday, September 1, 16

Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Basic Text Processing

Regular  ExpressionsWord  Tokeniza5onWord  Normaliza5on

Sentence  Segmenta5onMany  slides  adapted  from  slides  by  Dan  Jurafsky

Thursday, September 1, 16

Page 2: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Basic Text Processing

Regular  Expressions

Thursday, September 1, 16

Page 3: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  expressions• A  formal  language  for  specifying  text  strings• How  can  we  search  for  any  of  these?

• woodchuck• woodchucks• Woodchuck• Woodchucks

Thursday, September 1, 16

Page 4: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  Expressions:  Disjunc5ons

Thursday, September 1, 16

Page 5: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  Expressions:  Disjunc5ons

• LeIers  inside  square  brackets  []PaIern Matches

[wW]oodchuck Woodchuck,  woodchuck

[1234567890]! Any  digit

Thursday, September 1, 16

Page 6: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  Expressions:  Disjunc5ons

• LeIers  inside  square  brackets  []PaIern Matches

[wW]oodchuck Woodchuck,  woodchuck

[1234567890]! Any  digit

Thursday, September 1, 16

Page 7: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  Expressions:  Disjunc5ons

• LeIers  inside  square  brackets  []PaIern Matches

[wW]oodchuck Woodchuck,  woodchuck

[1234567890]! Any  digit

Thursday, September 1, 16

Page 8: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  Expressions:  Disjunc5ons

• LeIers  inside  square  brackets  []PaIern Matches

[wW]oodchuck Woodchuck,  woodchuck

[1234567890]! Any  digit

Thursday, September 1, 16

Page 9: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  Expressions:  Disjunc5ons

• LeIers  inside  square  brackets  []

• Ranges  [A-Z]

PaIern Matches

[wW]oodchuck Woodchuck,  woodchuck

[1234567890]! Any  digit

PaIern Matches the  First  Match  in  an  example

[A-Z] An  upper  case  leIer Drenched Blossoms

[a-z] A  lower  case  leIer my beans were impatient

[0-9] A  single  digit Chapter 1: Down the Rabbit Hole

Thursday, September 1, 16

Page 10: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  Expressions:  Nega5on  in  Disjunc5on

• Nega5ons [^Ss]• Carat  means  nega5on  only  when  first  in  []

PaIern Matches

[^A-Z] Not  an  upper  case  leIer Oyfn pripetchik

[^Ss]! Neither  ‘S’  nor  ‘s’ I have no exquisite reason”

[^e^] Neither  e  nor  ^ Look here

a^b The  paIern  a  carat  b Look up a^b now

Thursday, September 1, 16

Page 11: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  Expressions:  More  Disjunc5on

• Woodchucks  is  another  name  for  groundhog!• The  pipe  |  for  disjunc5on

PaIern Matches

groundhog|woodchuck

yours|mine yours mine

a|b|c|ab abc

[gG]roundhog|[Ww]oodchuck Photo  D.  Fletcher

Thursday, September 1, 16

Page 12: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  Expressions:  ?        * + .

Stephen  C  Kleene

PaIern Matches

colou?r 0  or  1  of  previous  char

color colour

oo*h! 0  or  more  of  previous  char

oh! ooh! oooh! ooooh!

o+h! 1  or  more  of  previous  char

oh! ooh! oooh! ooooh!

baa+ baa baaa baaaa baaaaa

beg.n any  char begin begun begun beg3nKleene  *,      Kleene  +      

Thursday, September 1, 16

Page 13: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Regular  Expressions:  Anchors    ^      $

PaIern Matches

^[A-Z] Palo Alto

^[^A-Za-z] 1 “Hello”

\.$ The end.

.$ The end? The end!

Thursday, September 1, 16

Page 14: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Example

Thursday, September 1, 16

Page 15: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Example

• Find  me  all  instances  of  the  word  “the”  in  a  text.the                                                                                                Misses  capitalized  examples

[tT]he                                                                                Incorrectly  returns  other  or  theology

[^a-zA-Z][tT]he[^a-zA-Z]                                                                                    

Thursday, September 1, 16

Page 16: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Errors

• The  process  we  just  went  through  was  based  on  fixing  two  kinds  of  errors• Matching  strings  that  we  should  not  have  matched  (there,  then,  other)• False  posi5ves  (Type  I)

• Not  matching  things  that  we  should  have  matched  (The)• False  nega5ves  (Type  II)

Thursday, September 1, 16

Page 17: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Errors  cont.

• In  NLP  we  are  always  dealing  with  these  kinds  of  errors.• Reducing  the  error  rate  for  an  applica5on  oden  involves  two  antagonis5c  efforts:  • Increasing  accuracy  or  precision  (minimizing  false  posi5ves)• Increasing  coverage  or  recall  (minimizing  false  nega5ves).

Thursday, September 1, 16

Page 18: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Summary

• Regular  expressions  play  a  surprisingly  large  role• Sophis5cated  sequences  of  regular  expressions  are  oden  the  first  model  for  any  text  processing  task

• For  many  hard  tasks,  we  use  machine  learning  classifiers• But  regular  expressions  are  used  as  features  in  the  classifiers• Can  be  very  useful  in  capturing  generaliza5ons

12

Thursday, September 1, 16

Page 19: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Basic Text Processing

Regular  Expressions

Thursday, September 1, 16

Page 20: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Basic  Text  Processing

Word  tokeniza5on

Thursday, September 1, 16

Page 21: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Text  Normaliza5on

Thursday, September 1, 16

Page 22: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Text  Normaliza5on

• Every  NLP  task  needs  to  do  text  normaliza5on:  1. Segmen5ng/tokenizing  words  in  running  text2. Normalizing  word  formats3. Segmen5ng  sentences  in  running  text

Thursday, September 1, 16

Page 23: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

• I  do  uh  main-­‐  mainly  business  data  processing• Fragments,  filled  pauses

Thursday, September 1, 16

Page 24: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

• I  do  uh  main-­‐  mainly  business  data  processing• Fragments,  filled  pauses

• Seuss’s  cat  in  the  hat  is  different  from  other  cats!  • Lemma:  same  stem,  part  of  speech,  rough  word  sense• cat  and  cats  =  same  lemma

• Wordform:  the  full  inflected  surface  form• cat  and  cats  =  different  wordforms

Thursday, September 1, 16

Page 25: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

they  lay  back  on  the  San  Francisco  grass  and  looked  at  the  stars  and  their

Thursday, September 1, 16

Page 26: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

they  lay  back  on  the  San  Francisco  grass  and  looked  at  the  stars  and  their

• Type:  an  element  of  the  vocabulary.

Thursday, September 1, 16

Page 27: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

they  lay  back  on  the  San  Francisco  grass  and  looked  at  the  stars  and  their

• Type:  an  element  of  the  vocabulary.• Token:  an  instance  of  that  type  in  running  text.

Thursday, September 1, 16

Page 28: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

they  lay  back  on  the  San  Francisco  grass  and  looked  at  the  stars  and  their

• Type:  an  element  of  the  vocabulary.• Token:  an  instance  of  that  type  in  running  text.• How  many?

• 15  tokens  (or  14)• 13  types  (or  12)  (or  11?)

Thursday, September 1, 16

Page 29: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

Thursday, September 1, 16

Page 30: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

N  =  number  of  tokens

Thursday, September 1, 16

Page 31: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

N  =  number  of  tokensV  =  vocabulary  =  set  of  types

|V|  is  the  size  of  the  vocabulary

Thursday, September 1, 16

Page 32: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

N  =  number  of  tokensV  =  vocabulary  =  set  of  types

|V|  is  the  size  of  the  vocabulary

Tokens  =  N Types  =  |V|

Switchboard  phone  conversa5ons

2.4  million 20  thousand

Shakespeare 884,000 31  thousand

Google  N-­‐grams 1  trillion 13  million

Thursday, September 1, 16

Page 33: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

How  many  words?

N  =  number  of  tokensV  =  vocabulary  =  set  of  types

|V|  is  the  size  of  the  vocabulary

Tokens  =  N Types  =  |V|

Switchboard  phone  conversa5ons

2.4  million 20  thousand

Shakespeare 884,000 31  thousand

Google  N-­‐grams 1  trillion 13  million

Church  and  Gale  (1990):  |V|  >  O(N½)

Thursday, September 1, 16

Page 34: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT ... ...

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 35: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT ... ...

Change all non-alpha to newlines

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 36: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT ... ...

Change all non-alpha to newlines

Sort in alphabetical order

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 37: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines

Sort in alphabetical order

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 38: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 39: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 40: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 41: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 42: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 43: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 44: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 45: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 46: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT ... ...

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 47: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT ... ...

25 Aaron 6 Abate 1 Abates 5 Abbess 6 Abbey 3 Abbot

....      …

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  laugh.

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 48: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT ... ...

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 49: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT ... ...

Change all non-alpha to newlines

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 50: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT ... ...

Change all non-alpha to newlines

Sort in alphabetical order

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 51: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines

Sort in alphabetical order

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 52: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 53: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 54: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 55: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 56: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 57: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 58: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 59: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 60: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT ... ...

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 61: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Simple  Tokeniza5on  in  UNIX

• (Inspired  by  Ken  Church’s  UNIX  for  Poets.)• Given  a  text  file,  output  the  word  tokens  and  their  frequenciestr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | uniq –c

1945 A

72 AARON 19 ABBESS

5 ABBOT ... ...

25 Aaron 6 Abate 1 Abates 5 Abbess 6 Abbey 3 Abbot

....      …

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Will  likes  to  eat.        Will  likes  to  babble.

1  babble1  eat2  likes  2  to2  Will

tr:  translate,  -­‐s:  squeeze,  -­‐c:  complement

Thursday, September 1, 16

Page 62: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

The  first  step:  tokenizing

tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head

(head: will print the first lines (10 by default) of its input. head -n NUM input)

THESONNETS

byWilliamShakespeare

Fromfairestcreatures

Thursday, September 1, 16

Page 63: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

The  second  step:  sor5ng

tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head

A

AAA

AAA

AA

...

Thursday, September 1, 16

Page 64: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  coun5ng

Thursday, September 1, 16

Page 65: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  coun5ng• Merging  upper  and  lower  case

Thursday, September 1, 16

Page 66: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  coun5ng• Merging  upper  and  lower  casetr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c

Thursday, September 1, 16

Page 67: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  coun5ng• Merging  upper  and  lower  casetr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c

• Sor5ng  the  counts  (-­‐n:  numerical  value,  -­‐k:  column,  -­‐r:  reverse)

Thursday, September 1, 16

Page 68: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  coun5ng• Merging  upper  and  lower  casetr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c

• Sor5ng  the  counts  (-­‐n:  numerical  value,  -­‐k:  column,  -­‐r:  reverse)tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r

Thursday, September 1, 16

Page 69: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  coun5ng• Merging  upper  and  lower  casetr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c

• Sor5ng  the  counts  (-­‐n:  numerical  value,  -­‐k:  column,  -­‐r:  reverse)tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r

23243 the22225 i

18618 and16339 to15687 of12780 a

12163 you10839 my10005 in8954 d

Thursday, September 1, 16

Page 70: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  coun5ng• Merging  upper  and  lower  casetr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c

• Sor5ng  the  counts  (-­‐n:  numerical  value,  -­‐k:  column,  -­‐r:  reverse)tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r

23243 the22225 i

18618 and16339 to15687 of12780 a

12163 you10839 my10005 in8954 d

What happened here?

Thursday, September 1, 16

Page 71: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Issues  in  Tokeniza5on

• Finland’s capital → Finland Finlands Finland’s  ?• what’re, I’m, isn’t → What are, I am, is not

• Hewlett-Packard → Hewlett Packard ?• state-of-the-art → state of the art ?• Lowercase!! → lower-case lowercase lower case ?• San Francisco! → one  token  or  two?• m.p.h.,  PhD.     → ??

Thursday, September 1, 16

Page 72: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Tokeniza5on:  language  issues

• French• L'ensemble  → one  token  or  two?• L  ?  L’  ?  Le  ?• Want  l’ensemble  to  match  with  un  ensemble

• German  noun  compounds  are  not  segmented• Lebensversicherungsgesellscha5sangestellter• ‘life  insurance  company  employee’• German  informa5on  retrieval  needs  compound  spli5er

Thursday, September 1, 16

Page 73: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Tokeniza5on:  language  issues

Thursday, September 1, 16

Page 74: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Tokeniza5on:  language  issues• Chinese  and  Japanese  no  spaces  between  words:

• 莎拉波娃现在居住在美国东南部的佛罗里达。• 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达

• Sharapova  now          lives  in              US              southeastern          Florida

Thursday, September 1, 16

Page 75: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Tokeniza5on:  language  issues• Chinese  and  Japanese  no  spaces  between  words:

• 莎拉波娃现在居住在美国东南部的佛罗里达。• 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达

• Sharapova  now          lives  in              US              southeastern          Florida

• Further  complicated  in  Japanese,  with  mul5ple  alphabets  intermingled• Dates/amounts  in  mul5ple  formats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji RomajiEnd-­‐user  can  express  query  en5rely  in  hiragana!

Thursday, September 1, 16

Page 76: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Word  Tokeniza5on  in  Chinese

• Also  called  Word  Segmenta9on• Chinese  words  are  composed  of  characters

• Characters  are  generally  1  syllable  and  1  morpheme.• Average  word  is  2.4  characters  long.

• Standard  baseline  segmenta5on  algorithm:  • Maximum  Matching    (also  called  Greedy)

Thursday, September 1, 16

Page 77: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Maximum  MatchingWord  Segmenta5on  Algorithm

• Given  a  wordlist  of  Chinese,  and  a  string.1) Start  a  pointer  at  the  beginning  of  the  string2) Find  the  longest  word  in  dic5onary  that  matches  the  string  

star5ng  at  pointer3) Move  the  pointer  over  the  word  in  string4) Go  to  2

Thursday, September 1, 16

Page 78: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

Thursday, September 1, 16

Page 79: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

• Theca5nthehat

Thursday, September 1, 16

Page 80: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

• Theca5nthehat the cat in the hat

Thursday, September 1, 16

Page 81: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

• Theca5nthehat• Thetabledownthere

the cat in the hat

Thursday, September 1, 16

Page 82: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

• Theca5nthehat• Thetabledownthere the table down there

the cat in the hat

Thursday, September 1, 16

Page 83: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

• Theca5nthehat• Thetabledownthere the table down there

the cat in the hat

theta bled own there

Thursday, September 1, 16

Page 84: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

• Theca5nthehat• Thetabledownthere the table down there

the cat in the hat

theta bled own there

Thursday, September 1, 16

Page 85: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

• Theca5nthehat• Thetabledownthere

• Doesn’t  generally  work  in  English!

the table down there

the cat in the hat

theta bled own there

Thursday, September 1, 16

Page 86: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

• Theca5nthehat• Thetabledownthere

• Doesn’t  generally  work  in  English!

the table down there

the cat in the hat

theta bled own there

Thursday, September 1, 16

Page 87: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

• Theca5nthehat• Thetabledownthere

• Doesn’t  generally  work  in  English!

• But  works  astonishingly  well  in  Chinese• 莎拉波娃现在居住在美国东南部的佛罗里达。• 莎拉波娃    现在      居住      在    美国      东南部          的    佛罗里达

the table down there

the cat in the hat

theta bled own there

Thursday, September 1, 16

Page 88: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Max-­‐match  segmenta5on  illustra5on

• Theca5nthehat• Thetabledownthere

• Doesn’t  generally  work  in  English!

• But  works  astonishingly  well  in  Chinese• 莎拉波娃现在居住在美国东南部的佛罗里达。• 莎拉波娃    现在      居住      在    美国      东南部          的    佛罗里达

• Modern  probabilis5c  segmenta5on  algorithms  even  beIer

the table down there

the cat in the hat

theta bled own there

Thursday, September 1, 16

Page 89: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Basic  Text  Processing

Word  tokeniza5on

Thursday, September 1, 16

Page 90: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Basic  Text  Processing

Word  Normaliza5on  and  Stemming

Thursday, September 1, 16

Page 91: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Normaliza5on

• Need  to  “normalize”  terms  • Informa5on  Retrieval:  indexed  text  &  query  terms  must  have  same  form.

• We  want  to  match  U.S.A.  and  USA

Thursday, September 1, 16

Page 92: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Normaliza5on

• Need  to  “normalize”  terms  • Informa5on  Retrieval:  indexed  text  &  query  terms  must  have  same  form.

• We  want  to  match  U.S.A.  and  USA

• We  implicitly  define  equivalence  classes  of  terms• e.g.,  dele5ng  periods  in  a  term

Thursday, September 1, 16

Page 93: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Normaliza5on

• Need  to  “normalize”  terms  • Informa5on  Retrieval:  indexed  text  &  query  terms  must  have  same  form.

• We  want  to  match  U.S.A.  and  USA

• We  implicitly  define  equivalence  classes  of  terms• e.g.,  dele5ng  periods  in  a  term

• Alterna5ve:  asymmetric  expansion:• Enter:  window   Search:  window,  windows• Enter:  windows   Search:  Windows,  windows,  window• Enter:  Windows   Search:  Windows

Thursday, September 1, 16

Page 94: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Normaliza5on

• Need  to  “normalize”  terms  • Informa5on  Retrieval:  indexed  text  &  query  terms  must  have  same  form.

• We  want  to  match  U.S.A.  and  USA

• We  implicitly  define  equivalence  classes  of  terms• e.g.,  dele5ng  periods  in  a  term

• Alterna5ve:  asymmetric  expansion:• Enter:  window   Search:  window,  windows• Enter:  windows   Search:  Windows,  windows,  window• Enter:  Windows   Search:  Windows

• Poten5ally  more  powerful,  but  less  efficientThursday, September 1, 16

Page 95: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Case  folding

• Applica5ons  like  IR:  reduce  all  leIers  to  lower  case• Since  users  tend  to  use  lower  case• Possible  excep5on:  upper  case  in  mid-­‐sentence?• e.g.,  General  Motors• Fed  vs.  fed• SAIL  vs.  sail

Thursday, September 1, 16

Page 96: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Case  folding

• Applica5ons  like  IR:  reduce  all  leIers  to  lower  case• Since  users  tend  to  use  lower  case• Possible  excep5on:  upper  case  in  mid-­‐sentence?• e.g.,  General  Motors• Fed  vs.  fed• SAIL  vs.  sail

• For  sen5ment  analysis,  MT,  Informa5on  extrac5on• Case  is  helpful  (US  versus  us  is  important)

Thursday, September 1, 16

Page 97: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Lemma5za5on

• Reduce  inflec5ons  or  variant  forms  to  base  form• am,  are,  is  →  be• car,  cars,  car's,  cars'  →  car

Context  dependent.  for  instance:in  our  last  mee5ng  (noun,  mee5ng).We’re  mee5ng  (verb,  meet)  tomorrow.

Thursday, September 1, 16

Page 98: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Lemma5za5on

• Reduce  inflec5ons  or  variant  forms  to  base  form• am,  are,  is  →  be• car,  cars,  car's,  cars'  →  car

• the  boy's  cars  are  different  colors  →  the  boy  car  be  different  color

Context  dependent.  for  instance:in  our  last  mee5ng  (noun,  mee5ng).We’re  mee5ng  (verb,  meet)  tomorrow.

Thursday, September 1, 16

Page 99: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Lemma5za5on

• Reduce  inflec5ons  or  variant  forms  to  base  form• am,  are,  is  →  be• car,  cars,  car's,  cars'  →  car

• the  boy's  cars  are  different  colors  →  the  boy  car  be  different  color• Lemma5za5on:  have  to  find  correct  dic5onary  headword  form

Context  dependent.  for  instance:in  our  last  mee5ng  (noun,  mee5ng).We’re  mee5ng  (verb,  meet)  tomorrow.

Thursday, September 1, 16

Page 100: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Lemma5za5on

• Reduce  inflec5ons  or  variant  forms  to  base  form• am,  are,  is  →  be• car,  cars,  car's,  cars'  →  car

• the  boy's  cars  are  different  colors  →  the  boy  car  be  different  color• Lemma5za5on:  have  to  find  correct  dic5onary  headword  form• Machine  transla5on

• Spanish  quiero  (‘I  want’),  quieres  (‘you  want’)  same  lemma  as  querer  ‘want’

Context  dependent.  for  instance:in  our  last  mee5ng  (noun,  mee5ng).We’re  mee5ng  (verb,  meet)  tomorrow.

Thursday, September 1, 16

Page 101: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Morphology

Thursday, September 1, 16

Page 102: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Morphology

• Morphemes:• The  small  meaningful  units  that  make  up  words• Stems:  The  core  meaning-­‐bearing  units• Affixes:  Bits  and  pieces  that  adhere  to  stems• Oden  with  gramma5cal  func5ons

Thursday, September 1, 16

Page 103: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Stemming

• Reduce  terms  to  their  stems  in  informa5on  retrievalcontext  independent

Thursday, September 1, 16

Page 104: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Stemming

• Reduce  terms  to  their  stems  in  informa5on  retrieval• Stemming  is  crude  chopping  of  affixes

• language  dependent• e.g.,  automate(s),  automaGc,  automaGon  all  reduced  to  automat.

context  independent

Thursday, September 1, 16

Page 105: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Stemming

• Reduce  terms  to  their  stems  in  informa5on  retrieval• Stemming  is  crude  chopping  of  affixes

• language  dependent• e.g.,  automate(s),  automaGc,  automaGon  all  reduced  to  automat.

for  example  compressed  and  compression  are  both  accepted  as  equivalent  to  compress.

context  independent

Thursday, September 1, 16

Page 106: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Stemming

• Reduce  terms  to  their  stems  in  informa5on  retrieval• Stemming  is  crude  chopping  of  affixes

• language  dependent• e.g.,  automate(s),  automaGc,  automaGon  all  reduced  to  automat.

for  example  compressed  and  compression  are  both  accepted  as  equivalent  to  compress.

for  exampl  compress  andcompress  ar  both  acceptas  equival  to  compress

context  independent

Thursday, September 1, 16

Page 107: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Porter’s  algorithmThe  most  common  English  stemmer

hIps://tartarus.org/mar5n/PorterStemmer/

Text

fixed  rules  put  in  groups,  applied  in  order.

Thursday, September 1, 16

Page 108: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Porter’s  algorithmThe  most  common  English  stemmer

     Step  1asses → ss! caresses → caress

ies → i! ponies → poniss → ss! caress → caress

s → ø                  cats → cat

hIps://tartarus.org/mar5n/PorterStemmer/

Text

fixed  rules  put  in  groups,  applied  in  order.

Thursday, September 1, 16

Page 109: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Porter’s  algorithmThe  most  common  English  stemmer

     Step  1asses → ss! caresses → caress

ies → i! ponies → poniss → ss! caress → caress

s → ø                  cats → cat

   Step  1b(*v*)ing → ø        walking → walk

sing → sing

(*v*)ed → ø        plastered → plaster

hIps://tartarus.org/mar5n/PorterStemmer/

Text

fixed  rules  put  in  groups,  applied  in  order.

Thursday, September 1, 16

Page 110: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Porter’s  algorithmThe  most  common  English  stemmer

     Step  1asses → ss! caresses → caress

ies → i! ponies → poniss → ss! caress → caress

s → ø                  cats → cat

   Step  1b(*v*)ing → ø        walking → walk

sing → sing

(*v*)ed → ø        plastered → plaster

     Step  2  (for  long  stems)ational→ ate relational→ relate

izer→ ize! digitizer → digitizeator→ ate! operator → operate

hIps://tartarus.org/mar5n/PorterStemmer/

Text

fixed  rules  put  in  groups,  applied  in  order.

Thursday, September 1, 16

Page 111: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Porter’s  algorithmThe  most  common  English  stemmer

     Step  1asses → ss! caresses → caress

ies → i! ponies → poniss → ss! caress → caress

s → ø                  cats → cat

   Step  1b(*v*)ing → ø        walking → walk

sing → sing

(*v*)ed → ø        plastered → plaster

     Step  2  (for  long  stems)ational→ ate relational→ relate

izer→ ize! digitizer → digitizeator→ ate! operator → operate

       Step  3  (for  longer  stems)al → ø            revival → revivable → ø            adjustable → adjust

ate → ø activate → activ

hIps://tartarus.org/mar5n/PorterStemmer/

Text

fixed  rules  put  in  groups,  applied  in  order.

Thursday, September 1, 16

Page 112: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Viewing  morphology  in  a  corpusWhy  only  strip  –ing  if  there  is  a  vowel?

38

Thursday, September 1, 16

Page 113: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Viewing  morphology  in  a  corpusWhy  only  strip  –ing  if  there  is  a  vowel?

(*v*)ing → ø        walking → walk

sing → sing

38

Thursday, September 1, 16

Page 114: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Viewing  morphology  in  a  corpusWhy  only  strip  –ing  if  there  is  a  vowel?

39

Thursday, September 1, 16

Page 115: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Viewing  morphology  in  a  corpusWhy  only  strip  –ing  if  there  is  a  vowel?(*v*)ing → ø        walking → walk

sing → sing

39

Thursday, September 1, 16

Page 116: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Viewing  morphology  in  a  corpusWhy  only  strip  –ing  if  there  is  a  vowel?(*v*)ing → ø        walking → walk

sing → sing

39

tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr

Thursday, September 1, 16

Page 117: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Viewing  morphology  in  a  corpusWhy  only  strip  –ing  if  there  is  a  vowel?(*v*)ing → ø        walking → walk

sing → sing

39

tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr

1312 King 548 being

541 nothing 388 king 375 bring 358 thing 307 ring

152 something 145 coming 130 morning

Thursday, September 1, 16

Page 118: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Viewing  morphology  in  a  corpusWhy  only  strip  –ing  if  there  is  a  vowel?(*v*)ing → ø        walking → walk

sing → sing

39

tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr

tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr

1312 King 548 being

541 nothing 388 king 375 bring 358 thing 307 ring

152 something 145 coming 130 morning

Thursday, September 1, 16

Page 119: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Viewing  morphology  in  a  corpusWhy  only  strip  –ing  if  there  is  a  vowel?(*v*)ing → ø        walking → walk

sing → sing

39

tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr

tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr

548 being541 nothing

152 something145 coming130 morning122 having120 living117 loving116 Being102 going

1312 King 548 being

541 nothing 388 king 375 bring 358 thing 307 ring

152 something 145 coming 130 morning

Thursday, September 1, 16

Page 120: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Dealing  with  complex  morphology  is  some5mes  necessary

• Some  languages  requires  complex  morpheme  segmenta5on• Turkish• Uygarlas5ramadiklarimizdanmissinizcasina• `(behaving)  as  if  you  are  among  those  whom  we  could  not  civilize’• Uygar  `civilized’  +  las  `become’  

+  5r  `cause’  +  ama  `not  able’  +  dik  `past’  +  lar  ‘plural’+  imiz  ‘p1pl’  +  dan  ‘abl’  +  mis  ‘past’  +  siniz  ‘2pl’  +  casina  ‘as  if’  

Thursday, September 1, 16

Page 121: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Basic  Text  Processing

Word  Normaliza5on  and  Stemming

Thursday, September 1, 16

Page 122: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Basic  Text  Processing

Sentence  Segmenta5on  and  Decision  Trees

Thursday, September 1, 16

Page 123: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Sentence  Segmenta5on

Thursday, September 1, 16

Page 124: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Sentence  Segmenta5on

• !,  ?  are  rela5vely  unambiguous

Thursday, September 1, 16

Page 125: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Sentence  Segmenta5on

• !,  ?  are  rela5vely  unambiguous• Period  “.”  is  quite  ambiguous

• Sentence  boundary• Abbrevia5ons  like  Inc.  or  Dr.• Numbers  like  .02%  or  4.3

Thursday, September 1, 16

Page 126: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Sentence  Segmenta5on

• !,  ?  are  rela5vely  unambiguous• Period  “.”  is  quite  ambiguous

• Sentence  boundary• Abbrevia5ons  like  Inc.  or  Dr.• Numbers  like  .02%  or  4.3

• Build  a  binary  classifier• Looks  at  a  “.”• Decides  EndOfSentence/NotEndOfSentence• Classifiers:  hand-­‐wriIen  rules,  regular  expressions,  or  machine-­‐learning

Thursday, September 1, 16

Page 127: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Determining  if  a  word  is  end-­‐of-­‐sentence:  a  Decision  Tree

Thursday, September 1, 16

Page 128: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  sophis5cated  decision  tree  features

Thursday, September 1, 16

Page 129: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  sophis5cated  decision  tree  features

• Case  of  word  with  “.”:  Upper,  Lower,  Cap,  Number

Thursday, September 1, 16

Page 130: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  sophis5cated  decision  tree  features

• Case  of  word  with  “.”:  Upper,  Lower,  Cap,  Number• Case  of  word  ader  “.”:  Upper,  Lower,  Cap,  Number

Thursday, September 1, 16

Page 131: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  sophis5cated  decision  tree  features

• Case  of  word  with  “.”:  Upper,  Lower,  Cap,  Number• Case  of  word  ader  “.”:  Upper,  Lower,  Cap,  Number

Thursday, September 1, 16

Page 132: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

More  sophis5cated  decision  tree  features

• Case  of  word  with  “.”:  Upper,  Lower,  Cap,  Number• Case  of  word  ader  “.”:  Upper,  Lower,  Cap,  Number

• Numeric  features• Length  of  word  with  “.”• Probability(word  with  “.”  occurs  at  end-­‐of-­‐s)• Probability(word  ader  “.”  occurs  at  beginning-­‐of-­‐s)

Thursday, September 1, 16

Page 133: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Implemen5ng  Decision  Trees

• A  decision  tree  is  just  an  if-­‐then-­‐else  statement

Thursday, September 1, 16

Page 134: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Implemen5ng  Decision  Trees

• A  decision  tree  is  just  an  if-­‐then-­‐else  statement• The  interes5ng  research  is  choosing  the  features

Thursday, September 1, 16

Page 135: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Implemen5ng  Decision  Trees

• A  decision  tree  is  just  an  if-­‐then-­‐else  statement• The  interes5ng  research  is  choosing  the  features• Se�ng  up  the  structure  is  oden  too  hard  to  do  by  hand

• Hand-­‐building  only  possible  for  very  simple  features,  domains• For  numeric  features,  it’s  too  hard  to  pick  each  threshold

• Instead,  structure  usually  learned  by  machine  learning  from  a  training  corpus

Thursday, September 1, 16

Page 136: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Decision  Trees  and  other  classifiers

Thursday, September 1, 16

Page 137: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Decision  Trees  and  other  classifiers

• We  can  think  of  the  ques5ons  in  a  decision  tree

Thursday, September 1, 16

Page 138: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Decision  Trees  and  other  classifiers

• We  can  think  of  the  ques5ons  in  a  decision  tree• As  features  that  could  be  exploited  by  any  kind  of  classifier• Logis5c  regression• SVM• Neural  Nets• etc.

Thursday, September 1, 16

Page 139: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Sentence  Spli5ers

• Stanford  coreNLP:  (determinis5c)• hIp://stanfordnlp.github.io/CoreNLP/

• UIUC  sentence  spliIer:  (determinis5c)• hIps://cogcomp.cs.illinois.edu/page/tools_view/2

48

Thursday, September 1, 16

Page 140: Basic Text Processing - faculty.cse.tamu.edufaculty.cse.tamu.edu › huangrh › Fall16 › l2_preProcessing.pdf| sort | uniq –c 1945 A 72 AARON 19 ABBESS Change all non-alpha to

Basic  Text  Processing

Sentence  Segmenta5on  and  Decision  Trees

Thursday, September 1, 16