Perfect Text AnalyticsSeth RedmoreVP, Product Management
All right reserved © 2010 Lexalytics Inc. 2
Perfect
per·fect
[adj., n. pur-fikt; v. per-fekt]
1. conforming absolutely to the description or definition of an ideal type: a perfect sphere; a perfect gentleman.
2. excellent or complete beyond practical or theoretical improvement: There is no perfect legal code. The proportions of this temple are almost perfect.
All right reserved © 2010 Lexalytics Inc. 3
Text Analytics The term text analytics describes a set of linguistic
statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)
In other words, enhancing the value of text content by extracting entities, features, context, relationships and emotion.
All right reserved © 2010 Lexalytics Inc.
Perfect is Fast Average Human Reading Speed:
250wpm Conservative computer reading
speed: 6000 wpm/core (our speed on a moderate single core)
Each core is equivalent to the reading bandwidth of 12 people.
Modern machines have 8 cores. That’s just about 100 people
in a box. Nice.
4
All right reserved © 2010 Lexalytics Inc. 5
Perfect is Useable “I don’t like the results” is not the same as “the results are
incorrect” Understanding the behavior key to usefulness Can you make better decisions? Can you make more money or save money? What is the most controversial area of text analytics? Thompson Reuters trading w/Sentiment Analysis increased
Alpha (profit over market) by 80 basis points
All right reserved © 2010 Lexalytics Inc. 6
Useable: How much can you differ? “In my shop, that up until now has relied exclusively on human coding, we consider anything
below 90% to be unacceptably inaccurate…. There is no doubt that automated sentiment is getting much much better, but to suggest that people should be okay with 20% of their data being wrong is just absurd.” Katie Delahaye Payne
Why is 10% “wrong” so much less absurd than 20% “wrong”?
20% Error 10% Error
All right reserved © 2010 Lexalytics Inc.
Perfect is Consistent Same results for same
content, every time University of Pittsburgh
“Multi-Perspective Question Answering” Corpus: 535 documents, 11k+ sentences.
40 hours of training for each rater
~80% inter-rater agreement
7
All right reserved © 2010 Lexalytics Inc. 8
Perfect is (new) Knowledge Discover the stuff you
don’t know Text Analytics is really, really
great at telling you the who, the what, and the where. Sometimes the “how”
You have to supply the “why” – but that question is way easier to answer when you know the other “w’s and the h”
All right reserved © 2010 Lexalytics Inc. 9
Perfect Includes Everything Running our top of the line
software flat out across one year will cost you about $.002/document analyzed (news article sized content) (assuming 3 docs/core-second, 8 core machine)
The more data the better and the greater worth your ta has
All right reserved © 2010 Lexalytics Inc.
Perfect is Trainable Can you solve YOUR business
problem with it? Can you optimize to suit
different kinds of content and roll those results up into a single reporting system?
10
All right reserved © 2010 Lexalytics Inc.
Perfect Text Analytics
11
FastUseableConsistentKnowledge(that is)
InclusiveTrainable
Customer Snapshots(or, “rubber, meet road”)
All right reserved © 2010 Lexalytics Inc. 13
Reputation Management
All right reserved © 2010 Lexalytics Inc. 14
Politics
15
Market Intelligence
Client Employee
Client CompanyWeb 2.0
CollaborationFIR
EW
AL
L
crawl, FTPor CD
SinglePoint
Integrated Index
External Content Providers
MI Analyst Text Analytics
Single Sign-on
Trashcan
Internal research
OptionalDocument Repository
Search Results
NL Search Engine
User Authentication
User Authentication
User Authentication
Custom Web Crawls & Gov.
Databases
SecondaryResearchSuppliers
News& Journals
Financial analyst reports
All right reserved © 2010 Lexalytics Inc.
Content Processing
InternalDocument Repository
All right reserved © 2010 Lexalytics Inc. 16
Hospitality
All right reserved © 2010 Lexalytics Inc. 17
Financial Services Turns News into numbers for automatic trading systems
Company stocks + Commodities
Resilient server product
Buy/Sell
Indicators
Indicators
Financial data
Ultimate customers are financial institutions QED (Quantitative and Event-Driven Trading) Banks, hedge funds.
JPMorgan, SocGen, Alpha Equities…and others
Algorithmic
Trading(QED firm)
RNSEServer
ROI – Retrieving Organized Information
RTI CONSULTING SERVICES
REPEATABLEEVOLVINGDESIGNS
BALANCED METHODOLOGYBusiness AssessmentUser InterviewsTaxonomy Design and RecommendationContent Governance / Analysis
DEPLOYMENT / SUPPORTSolution AlternativesIntegration & DeploymentTesting, Tuning, and Evaluation
THOUGHT LEADERSHIPStrategy ConsultationRoadmaps – Evolution and Growth
PROF. TED SULLIVAN
All right reserved © 2010 Lexalytics Inc. 19
Pharma
The Next Year…
All right reserved © 2010 Lexalytics Inc. 21
Opinion Mining Who said what about whom?
Clinton: N. Korea must face consequences over sinking
U.S. Secretary of State Hillary Clinton warned Friday that North Korea must face consequences over the alleged sinking of a South Korean warship which has stoked tensions in the divided peninsular.
A South Korean military report published this week claimed that the sinking of the Cheonan was caused by a North Korean torpedo attack.
Pyongyang denies that claim and said Friday that it could back out of a nonaggression pact between the neighbors if Seoul attempted to punish it over the sinking.
North Korea and South Korea have remained officially at war since an armistice in 1953 brought their three-year Cold War conflict to an end.
"I think it's important to send a clear message to North Korea that provocative actions have consequences," Clinton said Friday as she began a week-long Asian tour in Tokyo, Japan.
She said she was consulting with international allies to find the appropriate reaction.
Speaker Topic Sentiment
Pyongyang Seoul 0
nonaggression pact
0
Mike Mullen North Korea 0
present situation
0.021728
normal state
0
South Korea
-0.478279
Hillary Clinton North Korea 0
provocative actions
0
Hillary Clinton
0
clear message
0.6
North Korea
0.6
All right reserved © 2010 Lexalytics Inc. 22
Sarcasm, Twitter Model trained to detect sarcasm Once detected, you can decide what to do with it – because
actually determining the sentiment is going to be unreliable New model trained on Twitter content Moving towards a concept of text analytics driven by
business logic
All right reserved © 2010 Lexalytics Inc. 23
Thesaurus-based Theme RollupMachine generated conceptual taxonomyGas/Electric Hybrid and EV might roll up to EVFewer themes, but very useful to detect patterns across content
24
Foreign Language Support French is first, followed by other Romance languages New stemmer New summarization algorithm New part-of-speech tagger Automatic language detection New sentiment/entity extraction algorithms
Also applicable to vertical specific content
Confidence scoring by algorithm
Use business logic to meld the results
All right reserved © 2010 Lexalytics Inc.
All right reserved © 2010 Lexalytics Inc. 25
Trainable Entity Sentiment New technique for entity sentiment Initial results from testing in English extremely
promising Average human scoring overlap of >> 90% for
scored sentences Initially used only for FrenchP(Human | Computer) Human Tagged
Computer Tagged Negative Neutral Positive Grand Total
Negative 100.00% 0.00% 0.00% 100.00%
Neutral 0.64% 98.29% 1.07% 100.00%
Positive 0.00% 6.67% 93.33% 100.00%
Grand Total 5.70% 88.2% 6.27% 100.00%
precision
All right reserved © 2010 Lexalytics Inc. 26
Tool Enhancements Entity Management Toolkit
Part of Speech Tagset trainingUsing to train Salience on French
Sentiment ToolkitBuild your own entity sentiment models:
French (first)
Eventually use on English content:
TwitterCustomer SatisfactionOthers…
Fully Tagged
DocumentDoc POS Tagger
New EMT helps us build a new French PoS tagger
New Sentiment Toolkit + Maximum Entropy model builder allows new
Entity and Sentiment modules
Themes&
Summaries
Entity Extraction
& Sentiment Models
27
Business Logic + TA Algorithms
Content
A B C D
Finance$ Sports
SourceSearchBusiness LogicOther TA SystemSarcasm
POS 25
NEG 25
NEU 25
MIX 25
POS 60
NEG 10
NEU 20
MIX 10
POS 80
NEG 05
NEU 05
MIX 10
POS 50
NEG 20
NEU 30
MIX 0
Entity: Cisco
Route On
All right reserved © 2010 Lexalytics Inc.
Unknown?
Probability
Scores
Cisco : Positive
All right reserved © 2010 Lexalytics Inc. 28
Summary Lots of people making money with text analytics In lots of different verticals Next 12 months brings online a whole host of features to
make our software even more flexible Check out tas.lexalytics.com Check out www.lexalytics.com/lexascope