Upload
summer
View
51
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Semantic Infrastructure Workshop Development. Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com. Agenda. Text Analytics – Foundation Features and Capabilities Evaluation of Text Analytics Start with Self-Knowledge - PowerPoint PPT Presentation
Citation preview
Semantic Infrastructure Workshop Development
Tom ReamyChief Knowledge Architect
KAPS GroupKnowledge Architecture Professional Services
http://www.kapsgroup.com
2
Agenda Text Analytics – Foundation
– Features and Capabilities Evaluation of Text Analytics
– Start with Self-Knowledge – Features and Capabilities – Filter, Proof of Concept / Pilot
Text Analytics Development– Progressive Refinement– Categorization, Extraction, Sentiment– Case Studies – Best Practices
3
Semantic Infrastructure - FoundationText Analytics Features Noun Phrase Extraction
– Catalogs with variants, rule based dynamic– Multiple types, custom classes – entities, concepts, events– Feeds facets
Summarization– Customizable rules, map to different content
Fact Extraction– Relationships of entities – people-organizations-activities– Ontologies – triples, RDF, etc.
Sentiment Analysis– Rules – Objects and phrases – positive and negative
4
Semantic Infrastructure - Foundation Text Analytics Features Auto-categorization
– Training sets – Bayesian, Vector space– Terms – literal strings, stemming, dictionary of related terms– Rules – simple – position in text (Title, body, url)– Semantic Network – Predefined relationships, sets of rules– Boolean– Full search syntax – AND, OR, NOT– Advanced – NEAR (#), PARAGRAPH, SENTENCE
This is the most difficult to develop Build on a Taxonomy Combine with Extraction
– If any of list of entities and other words
5
6
7
8
9
10
11
12
13
Semantic Infrastructure - Foundation Vendors of Taxonomy/ Text Analytics Software
– Attensity– SAP - Business Objects –
Inxight– Clarabridge– ClearForest– Concept Searching– Data Harmony / Access
Innovations– Expert Systems– GATE (Open Source)– IBM Content Analyst
– Lexalytics– Multi-Tes– Nstein– SAS - Teragram– SchemaLogic– Smart Logic– Synaptica– Ontology Vendors
14
15
Semantic Infrastructure - Foundation Varieties of Taxonomy/ Text Analytics Software Taxonomy Management
– Synaptica, SchemaLogic Full Platform
– SAP-Inxight, Clear Forest, SAS- Teragram, Data Harmony, Concept Searching, IBM
Content Management– Nstein, Interwoven, Documentum, etc.
Embedded – Search– FAST, Autonomy, Endeca, Exalead, etc.
Specialty– Sentiment Analysis - Lexalytics
16
Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the taxonomy/text
analytics, how are you going to use it Formal Process - KA audit – content, users, technology, business
and information behaviors, applications - Or informal for smaller organization,
Text Analytics Strategy/Model – forms, technology, people– Existing taxonomic resources, software
Need this foundation to evaluate and to develop
17
Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge Do you need it – and what blend if so? Taxonomy Management Stand alone
– Multiple taxonomies, languages, authors-editors Technology Environment – ECM, Enterprise Search – where is it
embedded Publishing Process – where and how is metadata being added –
now and projected future– Can it utilize auto-categorization, entity extraction, summarization
Is the current search adequate – can it utilize text analytics? Applications – text mining, BI, CI, Alerts?
Semantic Infrastructure - Foundation Design of the Text Analytics Selection Team Interdisciplinary Team, led by Information Professionals
– IT – software experience, budget, support tests– Business – understand business and requirements– Library – understand information structure, understanding of
search semantics and functionality Much more likely to make a good decision
– This is not a traditional IT software evaluation – semantics Create the foundation for implementation
18
Semantic Infrastructure - Foundation Evaluating Text Analytics Software – Process
Start with Self Knowledge Eliminate the unfit
– Filter One - Ask Experts - reputation, research – Gartner, etc.• Market strength of vendor, platforms, etc.• Feature scorecard – minimum, must have, filter to top 3-4
– Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus
Filter Three – Focus Group one day visit – 3-4 vendors Deep pilot (2) / POC – advanced, integration, semantics
– Two Questions – who is better, can it be done, for how much Focus on working relationship with vendor.
19
Semantic Infrastructure - Foundation Evaluating Taxonomy Software - POC
Quality of results is the essential factor 6 weeks POC – bake off / or short pilot Real life scenarios, categorization with your content Preparation:
– Preliminary analysis of content and users information needs– Set up software in lab – relatively easy– Train taxonomist(s) on software(s)– Develop taxonomy if none available
Six week POC – 3 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of
content
20
Semantic Infrastructure - Foundation Evaluating Taxonomy Software - POC Scenarios – categorization, extraction, summarization, etc. Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities –
have to determine at POC time Elements:
– Content– Search terms / search scenarios– Training and test sets
Taxonomy Developers – expert consultants plus internal taxonomists Evaluate usability in action by taxonomists
21
Semantic Infrastructure - Foundation Evaluating Taxonomy Software – POC Issues Quality of content Quality of initial human categorization Normalize among different test evaluators Quality of taxonomists – experience with text analytics software and/or
experience with content and information needs and behaviors Quality of taxonomy
– General issues – structure (too flat or too deep)– Overlapping categories – Differences in use – browse, index, categorize
Categorization essential issue is complexity of language Entity Extraction essential issue is scale, disambiguation
22
Semantic Infrastructure - Foundation Case Study: Telecom Service Company History, Reputation Full Platform –Categorization,
Extraction, Sentiment Integration – java, API-SDK,
Linux Multiple languages Scale – millions of docs a day Total Cost of Ownership Ease of Development - new Vendor Relationship – OEM,
etc.
Expert Systems IBM SAS - Teragram Smart Logic
Option – Multiple vendors – Sentiment & Platform
IBM and SAS – finalists
23
24
Semantic Infrastructure - Foundation POC Design Discussion: Evaluation Criteria Basic Test Design – categorize test set
– Score – by file name, human testers Categorization – Call Motivation
– Accuracy Level – 80-90%– Effort Level per accuracy level
Sentiment Analysis– Accuracy Level – 80-90%– Effort Level per accuracy level
Quantify development time – main elements Comparison of two vendors – how score?
– Combination of scores and report
Text Analytics POC OutcomesCategorization Results
SAS IBM
Recall-Motivation 92.6 90.7
Recall-Actions 93.8 88.3
Precision – Mot. 84.3
Precision-Act 100
Uncategorized 87.5
Raw Precision 73 46
25
Text Analytics POC OutcomesVendor Comparisons Categorization Results – both good, edge to SAS on precision
– Use of Relevancy to set thresholds Development Environment
– IBM as toolkit provides more flexibility but it also increases development effort
Methodology – IBM enforces good method, but takes more time– SAS can be used in exactly the same way
SAS has a much more complete set of operators – NOT, DIST, START
26
Text Analytics POC OutcomesVendor Comparisons - Functionality Sentiment Analysis – SAS has workbench, IBM would require more
development– SAS also has statistical modeling capabilities
Entity and Fact extraction – seems basically the same– SAS and use operators for improved disambiguation –
Summarization – SAS has built-in– IBM could develop using categorization rules – but not clear that
would be as effective without operators Conclusion: Both can do the job, edge to SAS Now the fun begins - development
27
28
Text Analytics Development: Foundation
Articulated Information Management Strategy (K Map)– Content and Structures and Metadata– Search, ECM, applications - and how used in Enterprise– Community information needs and Text Analytics Team
POC establishes the preliminary foundation– Need to expand and deepen– Content – full range, basis for rules-training– Additional SME’s – content selection, refinement
Taxonomy – starting point for categorization / suitable? Databases – starting point for entity catalogs
29
Text Analytics DevelopmentEnterprise Environment – Case Studies A Tale of Two Taxonomies
– It was the best of times, it was the worst of times Basic Approach
– Initial meetings – project planning– High level K map – content, people, technology– Contextual and Information Interviews– Content Analysis– Draft Taxonomy – validation interviews, refine– Integration and Governance Plans
30
Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets Taxonomy of Subjects / Disciplines:
– Science > Marine Science > Marine microbiology > Marine toxins Facets:
– Organization > Division > Group– Clients > Federal > EPA– Instruments > Environmental Testing > Ocean Analysis > Vehicle– Facilities > Division > Location > Building X– Methods > Social > Population Study– Materials > Compounds > Chemicals– Content Type – Knowledge Asset > Proposals
31
Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets Project Owner – KM department – included RM, business
process Involvement of library - critical Realistic budget, flexible project plan Successful interviews – build on context
– Overall information strategy – where taxonomy fits Good Draft taxonomy and extended refinement
– Software, process, team – train library staff– Good selection and number of facets
Final plans and hand off to client
32
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Taxonomy of Subjects / Disciplines:
– Geology > Petrology Facets:
– Organization > Division > Group– Process > Drill a Well > File Test Plan– Assets > Platforms > Platform A– Content Type > Communication > Presentations
33
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Environment Issues
– Value of taxonomy understood, but not the complexity and scope
– Under budget, under staffed– Location – not KM – tied to RM and software
• Solution looking for the right problem– Importance of an internal library staff– Difficulty of merging internal expertise and taxonomy
34
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Project Issues
– Project mind set – not infrastructure– Wrong kind of project management
• Special needs of a taxonomy project• Importance of integration – with team, company
– Project plan more important than results• Rushing to meet deadlines doesn’t work with semantics as
well as software
35
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Research Issues
– Not enough research – and wrong people– Interference of non-taxonomy – communication– Misunderstanding of research – wanted tinker toy connections
• Interview 1 implies conclusion A Design Issues
– Not enough facets– Wrong set of facets – business not information– Ill-defined facets – too complex internal structure
36
Text Analytics Development Conclusion: Risk Factors Political-Cultural-Semantic Environment
– Not simple resistance - more subtle • – re-interpretation of specific conclusions and sequence of
conclusions / Relative importance of specific recommendations Understanding project scope Access to content and people
– Enthusiastic access Importance of a unified project team
– Working communication as well as weekly meetings
37
Text Analytics DevelopmentCase Study 2 – POC – Telecom Client Demo of SAS – Teragram / Enterprise Content
Categorization
38
Text Analytics Development Best Practices - Principles Importance of ongoing maintenance and refinement Need dedicated taxonomy team working with SME’s Work with application developers to incorporate text
analytics into new applications Importance of metrics and feedback
– Software and social Questions:
– What are important subjects (and changes)– What information do they need?– How is their information related to other silos?
39
Text Analytics Development Best Practices - Principles Process
– Realistic Budget – not a nice to have add on– Flexible Project plan - semantics are complex and messy
• Time estimates are difficult, object success measures are too– Transition from development to maintenance is fluid
Resources– Interdisciplinary Team is essential– Importance of communication – languages– Merging internal and external expertise
40
Text Analytics Development Best Practices - Principles Categorization taxonomy structure
– Tradeoff of depth and complexity of rules– Multiple avenues – facets, terms, rules, etc.
• No right balance– Recall-precision balance is application specific– Training sets of starting points, rules rule– Need for custom development
Technology– Basic integration – XML– Advanced –combine unstructured and structured in new ways
41
Text Analytics Development Best Practices – Risk Factors Value understood, but not the complexity and scope Project mindset – software project and then done Not enough research on user information needs, behaviors
– Talking to the right people and asking the right questions– Getting beyond “All of the Above” surveys
Not enough resources, wrong resources Enthusiastic access to content and people Bad design – starting with the wrong type of taxonomy Categorization is not library science
– More like cognitive anthropology
42
Semantic Infrastructure Development Conclusion Text Analytics is the Foundation for Semantic infrastructure Evaluation of Text Analytics – different than IT software
– POC – essential, foundation of development– Difference of taxonomy and categorization
• Concepts vs. text in documents Enterprise Context – strategic, self-knowledge
– Infrastructure resource, not a project– Interdisciplinary Team and applications
Integration with other initiatives and technologies– Text Mining, Data Mining, Sentiment & beyond, Everything!
Questions? Tom Reamy
[email protected] Group
Knowledge Architecture Professional Serviceshttp://www.kapsgroup.com