Text Representation NLP Module: Feature Engineering - Data-X

NLP Module: Feature Engineering & Text Representation

Text Processing

NLP Process

Feature Engineering & Text RepresentationLearn how to extract information from text and represent it numerically

Learning Models

Use learning models to identify parts of speech, entities, sentiment, and other aspects of the text.

Clean up the text to make it easier to use and more consistent to increase prediction accuracy later on

One-Hot-Encoding

Feature Engineering & Text Representation

Bag of Words Model Using Countvectorizer

Generalization of one-hot-encoding for a string of text

N-Gram Encoding

Captures word order in a vector model

Transforms categorical feature into many binary features

Converts a collection of raw documents to a matrix of TFIDF features

What is Feature Engineering & What is Text Representation?

Feature engineering is the process of transforming the raw data to improve the accuracy of models by creating new

features from existing data

Text Representation is numerically representing text to make it mathematically computable

One Hot Encoding

Encodes categorical data as real numbers such that the magnitude of each dimension is meaningful

For each distinct possible value, a new feature is created

One Hot Encoding

from sklearn.preprocessing import OneHotEncoder

oh_enc = OneHotEncoder()

oh_enc.fit(df[['name', 'kind']])

oh_enc.transform(df[['name', 'kind']]).todense()

One Hot Encoding in Scikit-Learn

One Hot Encoding in Pandaspd.get_dummies(df[['name', 'kind']])

Bag of Words Model

Extracts features from text

Stop words not included, word order is lost, sparse encoding

Bag of Words Model

sample_text = ['This is the first document.', 'This document is the second document.', ‘This is the third document.’ ]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words="english")

vectorizer.fit(sample_text)

#to see what words were kept

print("Words:", list(enumerate(vectorizer.get_feature_names())))

Bag of Words Model using Scikit-Learn

N-gram Encoding

Extracts features from text while capturing local word order by defining counts over sliding windows

Exp n =2 :

N-gram Encoding

from sklearn.feature_extraction.text import CountVectorizer

bigram = CountVectorizer(ngram_range=(1, 2))

bigram.fit(sample_text)

#to see what words were kept

print("Words:", list(zip(range(0,len(bigram.get_feature_names())), bigram.get_feature_names())))

N-gram Encoding using Scikit-Learn

TFIDF Vectorizer

Converts a collection of raw documents to a matrix of TFIDF features

What are TFIDF Features?

TFIDF stands for term frequency inverse document frequency and it represents text data by indicating the importance of the word relative to the other words in the text

2 Parts:

TF: (# of times term t appears in a document)/ (total # of terms in the document)

IDF: (log10 (total # of documents)/(# of documents with term t in it)

TFIDF Vectorizer

TFIDF = TF*IDF

TF represents how frequently the word shows up in the document

IDF represents how important the word is to the document (rare words)

TFIDF Vectorizer con.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(sample_text)

print(vectorizer.get_feature_names())

TFIDF Vectorizer Encoding using Scikit-Learn

Text Representation NLP Module: Feature Engineering - Data-X

Documents

REPAIR: Removing Representation Bias by Dataset Resampling · •Representation bias[3]: “Preference” of dataset towards different types of features •High bias — Feature representation

Material feature representation and identification with ...msse.gatech.edu/publication/JCDE_compositeSurfacelet_huang.pdfMaterial feature representation and identiﬁcation with composite

Feature Representation Analysis of Deep Convolutional

Modular Feature Models: Representation and Conﬁgurationebagheri.athabascau.ca/papers/jprit11-1.pdf · MODULAR FEATURE MODELS: REPRESENTATION AND CONFIGURATION 3 Figure 1. The main

NLP Module: Learning Models - Data-X · 2020. 7. 27. · NLP Module: Learning Models. Text Processing NLP Process Feature Engineering & Text Representation Learn how to extract information

Hierarchical feature representation and multimodal fusion

Fisher Vector image representation - Inrialear.inrialpes.fr/~verbeek/mlcr.slides.13.14/2.FisherVectors.pdf · 10 Fisher vector representation: Motivation • Feature vector quantization

Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1

Research Article Improving Feature Representation Based on a …downloads.hindawi.com/journals/cin/2016/1638936.pdf · 2019-07-30 · Research Article Improving Feature Representation

Semantic Feature Representation to Capture News …xie/pdf/xieEtAlFlairs2014.pdfSemantic Feature Representation to Capture News Impact Boyi Xie and Dingquan Wang and Rebecca J. Passonneau

Normalized Feature Representation with Resolution Mapping For Face ... · A normalization approach to face feature representation with scale mapped feature is presented. ... distribution

Semantic Feature Representation to Capture News Impactwdd/pub/Semantic Feature Representation t… · Semantic Feature Representation to Capture News Impact First Author Afﬁliation

NLP and Word Embeddings - web.stanford.edu · NLP and Word Embeddings GloVe word vectors. Andrew Ng GloVe (global vectors for word representation) I want a glass of orange juice to

DomainGeneralization via Invariant Feature Representationproceedings.mlr.press/v28/muandet13.pdf · DomainGeneralization via Invariant Feature Representation Krikamol Muandet krikamol@tuebingen.mpg.de

A Hierarchical Feature Representation for Phonetic

Feature representation with Deep Gaussian processes

74.793 NLP and Speech 2004 Feature Structures Feature Structures and Unification

An Eigen Based Feature on Time-Frequency Representation of EMG

Lecture 18a: Meaning Representation Languagesdemo.clab.cs.cmu.edu/NLP/S20/files/slides/18a-mrls.pdf · Lecture 18a: Meaning Representation Languages. Semantics Road Map 1.Lexical

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012