24
10/26/13 1 15383: txt proc Intro to Text Processing Lecture 2 Behrang Mohit 15383: txt proc Today: Text Different types of text Corpus Levels of text processing Corpus Document Sentence Tokens Characters and Encoding 2

Intro’to’TextProcessing’ Lecture’2’behrang/15383/lecs/2013.10.23.Lec2.pdf · 10/26/13 5 15383: txt proc! Need’for’textual’data • All’textprocessing’systems’need’data:’

  • Upload
    others

  • View
    22

  • Download
    0

Embed Size (px)

Citation preview

10/26/13  

1  

15383: txt proc!

Intro  to  Text  Processing  Lecture  2  

Behrang  Mohit  

15383: txt proc!

Today:  Text  

•  Different  types  of  text  •  Corpus  •  Levels  of  text  processing  – Corpus  – Document  – Sentence  – Tokens  – Characters  and  Encoding  

2  

10/26/13  

2  

15383: txt proc!

So  Much  Text!  

•  Goal:  Organize  and  process  3  

15383: txt proc!

Different  forms  of  text  

•  Structured  text  – Databases  –  Knowledge-­‐base  

•  Semi-­‐structured  – Wikipedia  info-­‐boxes  

–  Excel  sheets,  tables  in  documents  –  XML  documents  

•  Unstructured  –  Plain  text:  News,  Webpages,  Emails,  SMS,  tweets,  …  

4  

10/26/13  

3  

15383: txt proc!

From  text  to  knowledge  

5  

15383: txt proc!

Organizing  the  knowledge  

•  Text:  non-­‐structured  informaZon  

6  

Charlie was born in London in 1889 and …. Born in 1889, Chaplin spent his youth in …. Chaplin (1889-1977) was a true genius of cinema …. He was born 6 years before the birth of cinema!

10/26/13  

4  

15383: txt proc!

Organizing  the  knowledge  

•  Text:  non-­‐structured  informaZon  

•  Knowledge-­‐base:  structured  informaZon  

7  

Charlie was born in London in 1889 and …. Born in 1889, Chaplin spent his youth in …. Chaplin (1889-1977) was a true genius of cinema …. He was born 6 years before the birth of cinema!

Person   Birth  date  

Charlie  Chaplin   1889  

15383: txt proc!

Organizing  the  knowledge  

•  Text:  non-­‐structured  informaZon  

•  Knowledge-­‐base:  structured  informaZon  

8  

Charlie was born in London in 1889 and …. Born in 1889, Chaplin spent his youth in …. Chaplin (1889-1977) was a true genius of cinema …. He was born 6 years before the birth of cinema!

Person   Birth  date  

Charlie  Chaplin   1889  

10/26/13  

5  

15383: txt proc!

Need  for  textual  data  

•  All  text  processing  systems  need  data:  – TranslaZon  – QuesZon  Answering  – SummarizaZon  – …  – …  

9  

15383: txt proc!

Corpus/Corpora  

•  No  access  to  all  texts  – We  use  a  representaZve  subset  

•  Corpus:  A  collecZon  of  documents  – Used  for  linguisZc  analysis  – Used  for  building  a  text  processing  system  •  Human  labeled  data  for  staZsZcal  systems  

– Used  for  benchmarking/tesZng  a  system  •  Performance  of  a  search  engine  

10  

10/26/13  

6  

15383: txt proc!

CharacterisZcs  of  a  corpus    

•  Wriaen,  speech  •  General  (balanced),  domain  specific  (e.g.  news)  

•  Monolingual,  Bilingual,  Comparable  

•  Size  of  corpus  – Web  as  corpora  >  2,000  billion  words  

•  Copyright  issues  and  license  11  

15383: txt proc!

Annotated  Corpus  

•  LinguisZcs  annotaZon:  adding  some  linguisZc  knowledge  to  the  plain-­‐text  

•  Need  for  annotated  data  – System  that  separates  sentences.  – System  that  finds  names  – System  that  answers  quesZons  – System  that  translates  – Search  Engine    – ….      

12  

10/26/13  

7  

15383: txt proc!

Examples  of  corpora  

•  BriZsh  NaZonal  Corpus  – 90%  wriaen,  10%  spoken  – 100M  words,  BriZsh  English  – Plain  text  with  some  linguisZc  annotaZons  

•  Brown  Corpus  – English,  balanced,  wriaen  American,  1M  words  – Annotated  with  syntax  and  semanZcs  informaZon  

13  

15383: txt proc!

Brown  corpus:  Text  +  syntax  

14  

10/26/13  

8  

15383: txt proc!

Named  EnZty  Corpus  

15  

15383: txt proc!

Parallel  Corpus  

16  

10/26/13  

9  

15383: txt proc!

LinguisZc  AnnotaZon  

•  CreaZng  gold-­‐standard  data  – Used  for  building  and  training  a  system  

– Used  for  evaluaZng  a  system  

•  Requirements:  – Robust  annotaZon  sokware  – Clear  guidelines  – Quality  control  

17  

15383: txt proc!

Desktop  and  web-­‐based  annotaZon  

•  Text  and  GUI-­‐based  desktop  annotaZon  •  New  web-­‐based  annotaZons  – Work  any-­‐Zme,  anywhere  

– Direct  storage  – Browser  and  bandwidth  limitaZons  

18  

10/26/13  

10  

15383: txt proc!

Web-­‐based  annotaZon  

19  

15383: txt proc!

CMUQ’s  QALB  Project  

•  QALB:  Qatar  Arabic  Language  Bank  – AutomaZc  correcZon  of  Arabic  text  •  Spelling,  grammar,  word-­‐choice  •  Text:  NaZves  (news  comments),  Learners  (essays),  MT  

•  Target:  A  corpus  of  2  million  words  •  10  annotators  

– Demo:  The  annotaZon  interface  

20  

10/26/13  

11  

15383: txt proc!

Quality  Control    

•  Corpus  needs  to  be  an  accurate  language  sample  

•  Quality  of  annotated  corpus:  – Human  annotator’s  agreement    high  quality  corpus  

•  Framework:  –  Randomly  assign  parts  of  the  task  to  mulZple  annotators  

– Measure  their  agreement  •  Agreement  measures  

21  

15383: txt proc!

New  Models:  Crowd  Sourcing  

•  Large  scale,  simple  and  cheap  annotaZons  – Quick  and  mass  annotaZon  

•  Amazon  Mechanical  Turk  – Cost:  1-­‐10  cents/tasks  – Frameworks  to  asses  the  annotators  •  Before  and  during  the  work  

– Limits:  Quality  and  the  range  of  experZse  

22  

10/26/13  

12  

15383: txt proc!

Document  structure  

•  Plain  text  •  Markups  – SGML:  Standard  Generalized  Markup  language  

– HTML:  Hyper  text  Mark-­‐up  Language  – XML:  Extensible  Mark-­‐up  Language  – Wikipedia  

•  Prints  

23  

15383: txt proc!

Plain  Text  

24  

10/26/13  

13  

15383: txt proc!

SGML  

25  

15383: txt proc!

XML  

26  

10/26/13  

14  

15383: txt proc!

Today:  Text  

✓ Different  types  of  text  ✓ Corpus  

 Levels  of  text  processing  ✓Corpus  ✓Document  – Sentence  – Tokens  – Characters  and  Encoding  

27  

15383: txt proc!

Sentences  

•  Group  of  words  which  …  – Usually  split  by  a  full-­‐stop,  exclamaZon/quesZon  mark.  

– Many  sentences  have  a  verb  

– Some  languages  have  a  word  order:  •  English:  Subject-­‐Verb-­‐Object  (SVO)  •  Arabic:  Verb  Subject  Object,  SVO  

28  

10/26/13  

15  

15383: txt proc!

Sentence  spliong  

•  Sentence  level  processing    Sentence  boundary  detecZon  

•  Baseline:  split  at  ./!/?  •  But:  – The  class  was  thought  by  Dr.  Ali  Eqbal  and  Prof.  J.  Patrick  Jr.  

– P.O.Box,  I.R.  Pakistan,  U.S.A.  •  Use  rules  with  list  of  acronyms    

29  

15383: txt proc!

StaZsZcal  Sentence  Spliong  

•  Collect  different  informaZon  (features)  from  the  surrounding  words:  – Lexical  category  of  the  preceding/following  words  – Case  of  the  preceding/following  words  –  Is  the  preceding  word  an  abbreviaZon  

30  

10/26/13  

16  

15383: txt proc!

Sentence  alignment  

•  Parallel  corpus  is  ideal  – Reality:  Comparable  corpus    •  Translators  reorganize  the  text,  creaZng  new  sentences.  

– Extremely  non-­‐aligned:  •  News  arZcles  in  different  languages  • Wikipedia  pages  in  different  languages  

31  

15383: txt proc!

Today:  Text  

✓ Different  types  of  text  ✓ Corpus  

 Levels  of  text  processing  ✓Corpus  ✓Document  ✓Sentence  

– Tokens  – Characters  and  Encoding  

32  

10/26/13  

17  

15383: txt proc!

Words  vs.  tokens  

•  Difference  mainly  in  languages  where  white  space  does  not  necessarily  mean    

•  Graphical  words:  token  – What  we  see  on  the  screen  or  in  the  book  

Her email is [email protected] !This is a half-cooked idea.!

•  Underlying  word  – Deeper  linguis/c  

•  Half,  cooked  

33  

15383: txt proc!

Types  

•  How  many  tokens  in  a  corpus?  – The  “wc”  command  

•  How  many  token  types  in  a  corpus?  – Number  of  unique  tokens  – CapitalizaZon  is  not  relevant  (The  =  the)  

34  

10/26/13  

18  

15383: txt proc!

CounZng  words  and  types  

•  The  chair  of  the  session  starts  to  chair  the  session  – Tokens?  – Types?  

35  

15383: txt proc!

TokenizaZon/SegmentaZon  

“In  Baghdad,  Kirkuk  (Kurdistan)  and,  Basra  you  hear  different  views.”  

“  In  Baghdad  ,  Kirkuk  (  Kurdistan  )  and  ,  Basra  you  hear  different  views  .  ”  

•  An  important  step  for  accurate  text  processing  –  Search  Engine:  looking  for  “Baghdad”  

•  Not  “Baghdad,”  –  TranslaZon  – …  

36  

10/26/13  

19  

15383: txt proc!

TokenizaZon  

•  Finding  the  word  boundaries  – Simple  for  Roman  languages  • White  space  and  some  rules  

•  StaZsZcal/probabilisZc  approaches  

•  Not  always  easy  for  Asian  languages  

37  

15383: txt proc!

Word  vs.  Lemma  

•  Lemma  is  the  collapsed  form  of  a  word  – Goes,  went,  gone    go  – Relies,  relied,  relying    rely  – Boxes    Box  

•  LemmaZzaZon  helps  improving  some  text  processing  applicaZon  – Search  – Machine  TranslaZon  

38  

10/26/13  

20  

15383: txt proc!

CounZng  words  and  types  

•  The  chair  of  the  session  starts  to  chair  the  session  – Tokens?  – Types?  

39  

15383: txt proc!

Stop  words  

•  Tokens  with  liale  stand-­‐alone  informaZon  – The,  a,  an,  there,  here,  is,  am,  …  

– PunctuaZons  

•  ~  funcZons  words  – Content  words    

40  

10/26/13  

21  

15383: txt proc!

Character  Encoding  

•  ASCII:  American  Standard  Code  for  InformaZon  Interchange  –             =  128  characters  (1  byte,  with  one  control  bit)  – 99  numbers,  punctuaZons,  leaers  •  Mainly  English  

41  

27

15383: txt proc!

ASCII  characters  

42  

10/26/13  

22  

15383: txt proc!

Unicode  

•  A  character  encoding  standard  developed  by  the  Unicode  ConsorZum  

•  Provides  a  single  representaZon  for  all  world's  wriZng  systems  

•  "Unicode  provides  a  unique  number  for  every  character,  no  maaer  what  the  plaxorm,  no  maaer  what  the  program,  no  maaer  what  the  language.”  –  hap://www.unicode.org  

43  

15383: txt proc!

Unicode  encodings  

•  Version  6.2  (2012)  has  codes  for  110,182  characters  –  Full  Unicode  standard  uses  32  bits  (4  bytes)  :  it  can  represent  

2^32  =  4,294,967,296  characters!  •  In  reality,  only  21  bits  are  needed  

•  Unicode  has  three  encoding  versions  –  UTF-­‐32  (32  bits/4  bytes):  direct  representaZon  –  UTF-­‐16  (16  bits/2  bytes):  216=65,536  possibiliZes  –  UTF-­‐8  (8  bits/1  byte):  28=256  possibiliZes  

•  Why  UTF-­‐16  and  UTF-­‐8?  –  They  are  more  compact  (for  certain  languages,  i.e.,  English)  

44  

10/26/13  

23  

15383: txt proc!

UTF-­‐8  

•  Unicode  has  three  encoding  versions  –  UTF-­‐32  (32  bits/4  bytes):  direct  representaZon  –  UTF-­‐16  (16  bits/2  bytes):  2^16=65,536  possibiliZes  –  UTF-­‐8  (8  bits/1  byte):  2^8=256  possibiliZes  

•  But  how  do  you  represent  all  of  2^32  (=4  billion)  code  points  with  only  one  byte  (UTF-­‐8:  2^8  =256  slots)?  –  You  don't.  –  In  reality,  only  2^21  bits  are  ever  uZlized  for  110K  characters.  

–  UTF-­‐8  and  UTF-­‐16  use  a  variable-­‐width  encoding.  

45  

15383: txt proc!

Today:  Text  

✓ Different  types  of  text  ✓ Corpus  

 Levels  of  text  processing  ✓Corpus  ✓Document  ✓Sentence  

✓Tokens  ✓Characters  and  Encoding  

46  

10/26/13  

24  

15383: txt proc!

Python  Lab  

•  Text  processing  warm  up  with  standard  Python  and  NLTK  

•  Lab-­‐1  is  available  on  course  webpage:  –  hap://www.qatar.cmu.edu/~behrang/15383  

–  Task-­‐1:  Reading  a  text  corpus  with  Python  and  basic  processing  to  find  frequent  word  types.  

–  Task-­‐2:  RepeaZng  task-­‐1,  using  some  of  NLTK  uZliZes.  

–  Start  now  and  submit  before  6  pm  on  Thursday  

47