Download pdf - Sifting Social Data: Word Sense Disambiguation Using Machine Learning

Sifting Social Data Word Sense Disambiguation Using Machine Learning

Dr. Stuart Shulman

Founder & CEO, Texifter

“…a wealth of information creates a poverty of attention.”- Herbert Simon, 1971

Pronounced “tech-sifter” the metaphor is of a sifter

Text Classification

A 2500 year-old problem

Plato argued it would be frustrating and it still is…

Grimmer & Stewart “Text as Data” Political Analysis (2013)

Volume is a problem for scholarsCoders are expensive

Groups struggle to accurately label text at scaleValidation of both humans and machines is “essential”

Some models are easier to validate than othersAll models are wrong

Automated models enhance/amplify, but don’t replace humansThere is no one right way to do this

“Validate, validate, validate”“What should be avoided then, is the blind use of any method without a validation step.”

Our free, open-source, web-based text analytics toolkit

The original software kernel: tools for measurement

A mission to avoid tennis elbow

Items load to the screen and the coder hits the keystroke

Keystroke human coding: alone or in groups

Codes

Metadata Data

Human coding can be distributed to individuals, groups & crowds

Computer science & NSF influences: measure everything

How fast?How reliable?

How accurate?

Stuart Shulman – Texifter

Inter-rater reliability is one critical measurement


Plugged in to APIs & Government

Import data directly via APIs or from your desktop


Full historical Twitter access


PowerTrack operators for more precise queries


Store social data with survey responses and other data


Private, 3rd party & free (rate limited) social data sources


Unlimited “fire hose” premium data sources


The Five Pillars of Text Analytics

SearchFiltering

De-duplication and ClusteringHuman Coding

Machine-LearningStuart Shulman – Texifter

Pillar #1: Search


Pillar #1: Defined multi-term search


Pillar #2: Filters


Pillar #2: Filters


Pillar #3: Deduplication & clustering


Pillar #3: Deduplication & clustering


Pillar #4: Human coding (a.k.a. labeling or tagging)


Pillar #4: Human coding


Pillar #4: Human coding (adjudication)


Pillar#5: Machine-learning


Pillar#5: Machine-learning


Our ActiveLearning engine and coding tools combine…

what humans do best… with what computers do best

Humans and machines learning togetherKeep humans “in-the-loop” for more accurate results and better insights


Word sense disambiguation (relevance)







Human coding can be converted into machine classifiers

Accumulated human coding becomes training data via machine-learning

Users can drill into interactive reporting displays

Use metadata to examine sub-sets of responses and create reports.

Slicing big piles of text into smaller, more focused sets is key

Ultimately all text analytics are filtering techniques

Crowdsourcing accelerates the insight generation process through machine-learning

Distributed for synchronous & asynchronous collaboration

CoderRank (patent pending) for enhanced machine-learning is our key innovation

For more information visit the Texifter table ordiscovertext.com

@discovertextThank-you for listening!