Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Deep Distillation from Text Naveen Ashish University of Southern California & Cognie Inc., March 18th 2014
This is about ….. § “DEEP TEXT DISTILLATION” § The hard nut of having computers “understand” natural language (text) …. § Pushing the boundaries of what we can achieve ….
"It's (the problem of computers understanding natural language) ambi<ous ...in fact there's no more important project than understanding intelligence and recrea<ng it.“ -‐ Ray Kurzweil (2013)
Alan Turing based the Turing Test en<rely on wriDen language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do ar<ficial intelligence in. -‐ Ray Kurzweil (2013)
Why ….
§ the problem is far from solved ….. !!!!
search
text analytics
big data analytics
health informatics
social-media intelligence
IntroducSon
§ About myself § Associate Professor (InformaScs), Keck School of Medicine, University of Southern California
§ Cognie Inc.,
§ Work leverages § InformaSon extracSon work and systems developed at UC Irvine § XAR, UCI-‐PEP
§ Advisory consulSng engagements with several companies and start-‐ups
Outline
§ Deep disSllaSon: What is and why § State-‐of-‐the-‐art § Fundamentals § Approach § Details § Expressions, EnSSes, SenSment
§ Case studies § Retail, Health, Risk assessment
§ Conclusions
What is “Deep” text distillation ?
Deep DisSllaSon
§ The abstract, not explicitly menSoned ! § What falls in this category § Expressions § Contextual senSment § Aspect classificaSon
I think you need better chefs à SUGGESTION
The mocha is too sweet à NEGATIVE
I used to take Lipitor for …à PERSONAL EXPERIENCE
The dim lights have a cozy effect ….à AMBIENCE
A Common IntersecSon
§ DisSll at sentence level § Aggregate to enSre feedback, post, comment or thread
§ Three primary elements § Expression/Intent § EnSSes/Aspects (and Classes) § SenSment
Why Deeper ?
§ Goal: Get acSonable insights from data ! § Hypothesis: Deeper extracSon à Beaer insights !
The top advice items advised for skin rash are aloe vera, vitamin E oil and oatmeal
Complaints comprise 36% of the overall feedback with top issues being slow service, drinks and coffee
Context
§ COGNIETM: A PLATFORM for text analyScs
COGNIE TM
XAR UCI-PEP
SHIP SURVEY ANALYTICS
RETAIL ANALYTICS
RISK ASSESSMENT
Expressions
§ Beyond enSSes and senSment : EXPRESSSIONS § EXPRESSIONS § Introduced in [Ashish et al, 2011]
Expressions
You should try Vitamin E oil … à ADVICE
..I have had arthritis since 1991… à EXPERIENCE
HEALTH
..for me lipitor worked like a charm… à OUTCOME
Expressions
…showers had no hot water !… à COMPLAINT
..you should have more veggie options… à SUGGESTION
RETAIL/ENTERPRISE
..meats on special this weekend… à ANNOUNCEMENT
..this is the best store on the west side… à ADVOCACY
There is hardly any evidence to suggest a link between salt and diabetes à -
This results confirm that high intake of salt leads to increase in BPà +
RISK ASSESSMENT
The Landscape
Text AnalyScs Spectrum
§ Wide offering of § Text analyScs engines § Text analysis tools – many open-‐source
§ Largely sSll for “spofng things” § enSSes, concepts, senSment, topics, emoSons ….
§ Going deeper § Luminoso § Aaensity (Intents)
§ Deep Learning for SenSment § Stanford § Recursive Neural Networks
Approach
Approach
natural language processing
machine learning
semantics
Architecture: COGNIE TM Platform
Segmentation
POS Tagging
Entity extraction
Anaphora
Parsing
Gram analysis
Existing (DMOZ, SNOMED,UMLS)
Creation
Declarative
Naïve-Bayes
MaxEnt
TFIDF
CRF
RNN Deep Learning
ENSEMBLE
NLP
Machine Learning
Knowledge Engineering
The Indicators: “Give Aways”
§ A combina<on of mul<ple types of elements !
…showers had no hot water !… COMPLAINT
(You) should have more veggie options… SUGGESTION
..i have been on lipitor… EXPERIENCE
..this is the best store on the west side… ADVOCACY
Approach: Given Indicators
§ NLP § IdenSficaSon of individual elements § Unsupervised
§ RelaSonships between elements
§ SemanScs § IdenSficaSon of individual elements § Knowledge driven
§ Machine Learning ClassificaSon § Combine elements à classify
Natural Language Processing
§ UIMA and GATE § Stanford NLP Tools § POS tagging § Parsing § NE Recognizer § Geo-‐tagger § ….
Natural Language Processing
§ Text SegmentaSon § In many cases the “unit” if disSllaSon is a sentence
§ SegmentaSon § UIMA (or GATE) § Custom
§ Complex sentence segmentaSon § Breakup into individual clauses
NLP
§ Part-‐of-‐speech tags are key indicators § Expression disSllaSon
§ EnSty extracSon § Names, LocaSons, OrganizaSons
§ Parsing § If required
§ Anaphora
NGram Analysis
§ Unigram and Bigram analysis § Obtain § Grams § Frequency § Entropy
§ Grams of tokens as well as POS Paaerns § VB VBD
Before Automated ClassificaSon: Manual Paaerns
§ SoL: Sequences of Labels § Labels § LEX-‐FOODADJ § spicy
§ LEX-‐EXCESS § too, very
§ ONT-‐FOOD § POS-‐NOUN
§ Sequences (Paaerns) § ANY LEX-‐EXCESS LEX-‐FOODADJ ANY à § POS-‐VB POS-‐MD ….
ClassificaSon: Machine Learning
§ ClassificaSon tasks § Expression § (Contextual) SenSment § Aspect category
§ Frameworks § Weka § Mallet
Baseline Classifiers
§ Mallet and Weka § NaiveBayes § MaxEnt § CRF
§ Gram-‐based § Uni, Bi and Trigram features
§ Baseline § ~ 10% accuracy
Expression ClassificaSon: Features
§ Features § Polar words § PunctuaSons § Ngrams § POS paaerns § Length ! § Beginning § Ontology § …
Classifiers
§ Trees § Decision Tree (J48)
§ FuncSons § LogisSc Regression § SVM
§ Sequence Tagging § CRF: CondiSonal Random Fields
Expression ClassificaSon: Results
§ Have achieved 75% precision and recall for all expressions considered
§ Factors § Feature engineering § Classifier selecSon § Knowledge engineering
Contextual SenSment
§ (Just) polar words can be misleading ! § Polar words many not be present at all ! § CombinaSon of elements
The mocha is too sweet
Wait time is over an hour
Aisles are too narrow
Service is slow
SemanScs: Ontologies § Health § Drugs § CondiSons § Procedures § Symptoms § …
§ Retail (Dining) § Food/Entrees § Service § Ambience § ….
Leverage Exis<ng Knowledge Sources
§ Health informaScs § UMLS § NCI Thesaurus
§ SNOMED § Retail § DMOZ
§ Many other § Freebase § Wikipedia, DBPedia
§ OpenData § data.gov
Knowledge Engineering Tools
§ “Mini” ontology creaSon § API access § Freebase § BioPortal
§ Wrappers § DMOZ, ….
PracScal Requirements
§ Confidence Measures § Below threshold routed to manual transcripSon teams
§ Polarity § Snippets
Open-Source Leverage
COGNIE TM : Open Source Tools § Framework § UIMA
§ ClassificaSon § Weka § Mallet
§ NLP § Stanford tools
§ Indexing § Lucene
§ Databases § MySQL, MongoDB
§ Knowledge Engineering § Protégé
Select Case Studies
Case Study: Health InformaScs
Insights from, for, by Patients
DisSllaSon
Case Study: Retail & Survey AnalyScs
§ Feedback § Direct, device collected § Social-‐media
§ Typically short, few sentences § Strong requirement for aspect classificaSon § [Food,Service,Ambience,Pricing,Other]
§ NegaSve : “Immediate” vs “Long Term” classificaSon
…food was awesome, service needs improvement ….
you need to be open longer !
Case Study: Risk Assessment
§ Biomedical Literature Abstracts § CorrelaSon direcSon (+ -‐) § Subject § ArScle type
§ Features § Clauses § NegaSon and Triggers § SemanSc Heterogeneity
Performance
MapReduce
§ Throughput can be an issue § Complex language processing algorithms § Large ontologies in some cases
§ Hadoop MapReduce § [Kahn and Ashish, 2014]
Conclusions
Conclusions
§ Deeper disSllaSon from text is important § Can be achieved by § DetecSng and combining mulSple elements in text § Feature engineering § Knowledge engineering § Classifier selecSon
§ Does not have to be perfect § Every domain, dataset has its nuances
thank you ! [email protected]