53
Get the data! Data sets are typically compressed in large batches of files. The files are: Encrypted with gpg Compressed .gz/.bz/.xz files Hadoop sequence .sc files Thrift/JSON/XML/CSV

Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Get the data!• Data sets are typically compressed in

large batches of files.

• The files are:

• Encrypted with gpg

• Compressed .gz/.bz/.xz files

• Hadoop sequence .sc files

• Thrift/JSON/XML/CSV

Page 2: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Get the data!• Data sets are typically compressed in

large batches of files.

• The files are:

• Encrypted with gpg

• Compressed .gz/.bz/.xz files

• Hadoop sequence .sc files

• Thrift/JSON/XML/CSV

Page 3: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Get the data!• Data sets are typically compressed in

large batches of files.

• The files are:

• Encrypted with gpg

• Compressed .gz/.bz/.xz files

• Hadoop sequence .sc files

• Thrift/JSON/XML/CSV

Page 4: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Get the data!• Data sets are typically compressed in

large batches of files.

• The files are:

• Encrypted with gpg

• Compressed .gz/.bz/.xz files

• Hadoop sequence .sc files

• Thrift/JSON/XML/CSV

Page 5: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Get the data!• Data sets are typically compressed in

large batches of files.

• The files are:

• Encrypted with gpg

• Compressed .gz/.bz/.xz files

• Hadoop sequence .sc files

• Thrift/JSON/XML/CSV

Page 6: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Get the data!• Data sets are typically compressed in

large batches of files.

• The files are:

• Encrypted with gpg

• Compressed .gz/.bz/.xz files

• Hadoop sequence .sc files

• Thrift/JSON/XML/CSV

Page 7: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Get the data!• Data sets are typically compressed in

large batches of files.

• The files are:

• Encrypted with gpg

• Compressed .gz/.bz/.xz files

• Hadoop sequence .sc files

• Thrift/JSON/XML/CSV

Page 8: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Get the data!• Data sets are typically compressed in

large batches of files.

• The files are:

• Encrypted with gpg

• Compressed .gz/.bz/.xz files

• Hadoop sequence .sc files

• Thrift/JSON/XML/CSV

Page 9: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Sentence Segmentation

• Split a textual document into sentences.

• Was that an abbreviation?

• Was that inside a quote?

She stopped. She said, "Hello there," and then went on.^ ^ ^He's vanished! What will we do? It's up to us.^ ^ ^ ^Please add 1.5 liters to the tank.^

sentence = tokenize.sent_tokenize(text)

Page 10: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Word Tokenization

• Split a sentence into tokens

• text.split(“ “) is not always enough

• What about apostrophe, abbreviations, misspellings, URIs, different languages?

In Düsseldorf I took my hat off. But I can't put it back on.

Page 11: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Part-of-Speech Tagging (POS)

• Classifying word tokens into parts of speech

Page 12: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Part-of-Speech Tagging (POS)

• Classifying word tokens into parts of speech

Page 13: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Part-of-Speech Tagging (POS)

• Classifying word tokens into parts of speech

Page 14: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Named Entity Recognition (NER)

• Identify the tokens in a sentence that correspond to a Entity.

Type Tag Sample Categories

People PER Individuals, fictional characters, small groups

Organization ORG Companies, agencies, political parties, religious groups, sports teams

Location LOC Physical extents, mountains, lakes seasGeo-Political

EntityGPE Countries states, provinces, counties

Facility FAC Bridges, buildings, airports

Vehicles VEH Planes, trains, and automobiles

Page 15: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Named Entity Recognition (NER)

• Identify the tokens in a sentence that correspond to a Entity.

Type Tag Sample Categories

People PER Individuals, fictional characters, small groups

Organization ORG Companies, agencies, political parties, religious groups, sports teams

Location LOC Physical extents, mountains, lakes seasGeo-Political

EntityGPE Countries states, provinces, counties

Facility FAC Bridges, buildings, airports

Vehicles VEH Planes, trains, and automobiles

Page 16: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Named Entity Recognition (NER)

• Identify the tokens in a sentence that correspond to a Entity.

Type Tag Sample Categories

People PER Individuals, fictional characters, small groups

Organization ORG Companies, agencies, political parties, religious groups, sports teams

Location LOC Physical extents, mountains, lakes seasGeo-Political

EntityGPE Countries states, provinces, counties

Facility FAC Bridges, buildings, airports

Vehicles VEH Planes, trains, and automobiles

Page 17: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Chunking• Identify sequences of non-overlapping

labels

Page 18: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Chunking• Identify sequences of non-overlapping

labels

Page 19: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Chunking• Identify sequences of non-overlapping

labels

Page 20: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Chunking• Identify sequences of non-overlapping

labels

NP - Chunking

Page 21: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Chunking• Identify sequences of non-overlapping

labels

NP - Chunking

Page 22: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Chunking• Identify sequences of non-overlapping

labels

NP - Chunking

Page 23: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Chunking

• IOB Representation

• Every token is In a chunk or Out of a chunk.

• Distinguish the Beginnings of chunks.

Page 24: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

MADden

Page 25: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

MADden

Page 26: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

MADden

Page 27: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Dependency Parsing

• A graph depicting the relationship between a word (head) and its dependents.

• Starts with a verb and finds the related subject and object.

• Useful in understanding phrases

• Similar to chunking

• Very close to semantic relationships

• Link grammar is the most notable implementation (in AbiWord)

Page 28: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Dependency Parsing

Page 29: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Dependency Parsing

Page 30: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Dependency Parsing

Page 31: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Dependency Parsing

Page 32: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Dependency Parsing

Page 33: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Dependency Parsing

Page 34: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Dependency Parsing

Page 35: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Word-Sense Disambiguation

• Classifying the meaning of a word among many possible interpretations.

• Classification can be done in a myriad of ways.

• Still an open NLP problem

• I like bass!

Word

Page 36: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Co-Reference Resolution• Determining the mentions in a document

that correspond to the same entity.

Page 37: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Co-Reference Resolution• Determining the mentions in a document

that correspond to the same entity.

Page 38: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Co-Reference Resolution• Determining the mentions in a document

that correspond to the same entity.

Entity/Co-reference Chains

Page 39: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Cross-Document Entity Resolution

• Take coreference chains from across documents and match the ones that correspond to the same real world entity.

• A type of clustering problem.

• Use the features from the document.

Page 40: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution Model

• Entity Resolution/Coreference Resolution is an ubiquitous problem.

Page 41: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution Model

• Entity Resolution/Coreference Resolution is an ubiquitous problem.

• Many models and many domains.

Page 42: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution Model

• Entity Resolution/Coreference Resolution is an ubiquitous problem.

• Many models and many domains.

• We use the McCallum method (McCallum, Wellner 2004)

Page 43: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution Model

• Entity Resolution/Coreference Resolution is an ubiquitous problem.

• Many models and many domains.

• We use the McCallum method (McCallum, Wellner 2004)

• Statistically sound — Based on conditional random fields (CRF).

Page 44: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution Model

• Entity Resolution/Coreference Resolution is an ubiquitous problem.

• Many models and many domains.

• We use the McCallum method (McCallum, Wellner 2004)

• Statistically sound — Based on conditional random fields (CRF).

• Relational — Does not assume independence.

Page 45: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution Example

Page 46: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution ExampleWe extract set of noun phrases from text documents using named entity recognition.

Page 47: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution ExampleWe extract set of noun phrases from text documents using named entity recognition.

Page 48: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution ExampleWe extract set of noun phrases from text documents using named entity recognition.

...Late-night host and comic Jimmy Fallon was born on...

Page 49: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution ExampleWe extract set of noun phrases from text documents using named entity recognition.

...Late-night host and comic Jimmy Fallon was born on...

...the Kanye West appearance on Jimmy Kimmel Live last...

Page 50: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution ExampleWe extract set of noun phrases from text documents using named entity recognition.

...Late-night host and comic Jimmy Fallon was born on...

...the Kanye West appearance on Jimmy Kimmel Live last...

...you work at Jimmy John’s sammich shops, where...

Page 51: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution ExampleWe extract set of noun phrases from text documents using named entity recognition.

...Late-night host and comic Jimmy Fallon was born on...

...the Kanye West appearance on Jimmy Kimmel Live last...

...you work at Jimmy John’s sammich shops, where...

Jimmy Kimmel Shares His "Only Complaint" About Jimmy Fallon

Page 52: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution ExampleWe extract set of noun phrases from text documents using named entity recognition.

...Late-night host and comic Jimmy Fallon was born on...

...the Kanye West appearance on Jimmy Kimmel Live last...

...you work at Jimmy John’s sammich shops, where...

Jimmy Kimmel Shares His "Only Complaint" About Jimmy Fallon

Page 53: Get the data!Get the data! • Data sets are typically compressed in large batches of files. • The files are: • Encrypted with gpg • Compressed .gz/.bz/.xz files • Hadoop

Entity Resolution ExampleWe extract set of noun phrases from text documents using named entity recognition.

...Late-night host and comic Jimmy Fallon was born on...

...the Kanye West appearance on Jimmy Kimmel Live last...

...you work at Jimmy John’s sammich shops, where...

Jimmy Kimmel Shares His "Only Complaint" About Jimmy Fallon