Sifting Social Data Word Sense Disambiguation Using Machine Learning
Dr. Stuart Shulman
Founder & CEO, Texifter
“…a wealth of information creates a poverty of attention.”- Herbert Simon, 1971
Pronounced “tech-sifter” the metaphor is of a sifter
Text Classification
A 2500 year-old problem
Plato argued it would be frustrating and it still is…
Grimmer & Stewart “Text as Data” Political Analysis (2013)
Volume is a problem for scholarsCoders are expensive
Groups struggle to accurately label text at scaleValidation of both humans and machines is “essential”
Some models are easier to validate than othersAll models are wrong
Automated models enhance/amplify, but don’t replace humansThere is no one right way to do this
“Validate, validate, validate”“What should be avoided then, is the blind use of any method without a validation step.”
Our free, open-source, web-based text analytics toolkit
The original software kernel: tools for measurement
A mission to avoid tennis elbow
Items load to the screen and the coder hits the keystroke
Keystroke human coding: alone or in groups
Codes
Metadata Data
Human coding can be distributed to individuals, groups & crowds
Computer science & NSF influences: measure everything
How fast?How reliable?
How accurate?
Stuart Shulman – Texifter
Inter-rater reliability is one critical measurement
Stuart Shulman – Texifter
Plugged in to APIs & Government
Import data directly via APIs or from your desktop
Stuart Shulman – Texifter
Full historical Twitter access
Stuart Shulman – Texifter
PowerTrack operators for more precise queries
Stuart Shulman – Texifter
Store social data with survey responses and other data
Stuart Shulman – Texifter
Private, 3rd party & free (rate limited) social data sources
Stuart Shulman – Texifter
Unlimited “fire hose” premium data sources
Stuart Shulman – Texifter
The Five Pillars of Text Analytics
SearchFiltering
De-duplication and ClusteringHuman Coding
Machine-LearningStuart Shulman – Texifter
Pillar #1: Search
Stuart Shulman – Texifter
Pillar #1: Defined multi-term search
Stuart Shulman – Texifter
Pillar #2: Filters
Stuart Shulman – Texifter
Pillar #2: Filters
Stuart Shulman – Texifter
Pillar #3: Deduplication & clustering
Stuart Shulman – Texifter
Pillar #3: Deduplication & clustering
Stuart Shulman – Texifter
Pillar #4: Human coding (a.k.a. labeling or tagging)
Stuart Shulman – Texifter
Pillar #4: Human coding
Stuart Shulman – Texifter
Pillar #4: Human coding (adjudication)
Stuart Shulman – Texifter
Pillar#5: Machine-learning
Stuart Shulman – Texifter
Pillar#5: Machine-learning
Stuart Shulman – Texifter
Our ActiveLearning engine and coding tools combine…
what humans do best… with what computers do best
Humans and machines learning togetherKeep humans “in-the-loop” for more accurate results and better insights
Stuart Shulman – Texifter
Word sense disambiguation (relevance)
Stuart Shulman – Texifter
Word sense disambiguation (relevance)
Stuart Shulman – Texifter
Word sense disambiguation (relevance)
Stuart Shulman – Texifter
Stuart Shulman – Texifter
Human coding can be converted into machine classifiers
Accumulated human coding becomes training data via machine-learning
Users can drill into interactive reporting displays
Use metadata to examine sub-sets of responses and create reports.
Slicing big piles of text into smaller, more focused sets is key
Ultimately all text analytics are filtering techniques
Crowdsourcing accelerates the insight generation process through machine-learning
Distributed for synchronous & asynchronous collaboration
CoderRank (patent pending) for enhanced machine-learning is our key innovation
For more information visit the Texifter table ordiscovertext.com
@discovertextThank-you for listening!
Stuart Shulman – Texifter