31
[ (‘We’, ‘PRP’), (‘<3’, ‘VBP’), (‘NLTK’, ‘NNP’) ] Dhiana Deva | Gabriel Fonseca Data Matching @ UFRJ

We love NLTK

Embed Size (px)

DESCRIPTION

NLTK + Data Matching? Yep!

Citation preview

Page 1: We love NLTK

[(‘We’, ‘PRP’),(‘<3’, ‘VBP’),(‘NLTK’, ‘NNP’)

]Dhiana Deva | Gabriel Fonseca

Data Matching @ UFRJ

Page 2: We love NLTK
Page 3: We love NLTK
Page 4: We love NLTK
Page 5: We love NLTK
Page 6: We love NLTK
Page 7: We love NLTK
Page 8: We love NLTK
Page 9: We love NLTK
Page 10: We love NLTK
Page 11: We love NLTK

“NLTK” == “Natural Language ToolKit”

+ Python library for NLP+ Created in 2001 at University of Pennsylvania+ Very extensive+ Many examples+ Built-in support for 84 datasets (today!)+ Great documentation+ Open source ;)+ Active community

Page 12: We love NLTK

Lot’s of modules!corpus

standardized interfaces to corpora and lexicons

tokenizetokenizers!

stemstemmers!

collocationt-test, chi-squared, point-wise mutual information

classifydecision tree, maximum

entropy, naive bayes

clusterEM, k-means

chunkregular expression, n-gram, named-entity

metricsdistances, precision,

recall, agreement coefficients

probabilityfrequency distributions, smoothed probability

distributions

...parse

chart, feature-based, unification, probabilistic,

dependency

tagpart-of-speech tagging, n-gram, backoff, Brill,

HMM, TnT

Page 13: We love NLTK

Can I haz Data Matching?☑ Accuracy score

☑ Precision score

☑ Recall score

☑ F-measure score

☐ Reduction ratio

☑ Stop-words (11 languages)

★ Punkt sentence tokenizer

★ Punkt word tokenizer

☑ N-gram (words and chars)

☑ Tf-idf

☑ Levenshtein distance

☑ Damerau-Levenshtein distance

☑ Binary distance... Durr!

★ Krippendorff's distance

★ Masi distance

☑ Jaccard distance

☐ Jaro distance

☐ Jaro-Winkler distance

☐ Monge-Elkan distance

☐ Soundex

☐ Phonex

☐ NYSIIS

☐ ONCA

☐ Double-Metaphone

☐ Fuzzy Soundex

☑ Decision tree

☑ SVM

☑ Naive Bayes

★ MaxEnt

Page 14: We love NLTK
Page 15: We love NLTK
Page 16: We love NLTK
Page 17: We love NLTK
Page 18: We love NLTK
Page 19: We love NLTK

Fun fun fun!Sentiment analysisSpelling correctionSpam detectionTopic modelingRecommender systemsData deduplication

Page 20: We love NLTK

Why not song matching?!Grooveshark: online music streaming serviceSongs uploaded by record labels, independent artists and usersLot’s of duplicates!Tinysong: Grooveshark’s open RESTful APIOur goal: No repeated songs!

(remixes and lives are okay!)

Page 21: We love NLTK
Page 22: We love NLTK

Bohemian Rhapsody by Qween-?! {

"Url": "http:\/\/tinysong.com\/PBCJ",

"SongID": 33834073,

"SongName": "Bohemian Rhapsody",

"ArtistID": 2324,

"ArtistName": "Queen",

"AlbumID": 1071492,

"AlbumName": "Greatest Hits"

},

...

{

"Url": "http:\/\/tinysong.com\/CYxG",

"SongID": 28835215,

"SongName": "Bohemian Rhapsody",

"ArtistID": 1731732,

"ArtistName": "Qween -",

"AlbumID": 2364353,

"AlbumName": "A Night at the Opera"

}

...

Page 23: We love NLTK
Page 24: We love NLTK
Page 25: We love NLTK
Page 26: We love NLTK

Next stepsOther textual dataMachine learningAcoustic features

LoudnessBPMLiveness

Acoustic fingerprinting for supervised learningYes, songs have fingerprints too!

Page 27: We love NLTK

Our “sentiment”+ Quick and easy!+ Exteeeeeeeeeeeeeeeeensive!+ Docs & community!+ Internationalization- Time performance- Memory usage- No online or active learning

Page 28: We love NLTK

Want more?!+ jellyfish

Jaro-Winkler, Hamming, Soundex, Metaphone, NYSIIS, …+ nltk-trainer

Command-line NLTK classifiers!+ scikit-learn

More machine learning! Memory efficient!+ pattern

Web mining. Out-of-the-box!+ gensim

Topic modeling. Out-of-the-box!

Page 29: We love NLTK
Page 31: We love NLTK

Thanks! ;)