Semantic Infrastructure Workshop Development

Semantic Infrastructure Workshop Development

Tom ReamyChief Knowledge Architect

KAPS GroupKnowledge Architecture Professional Services

http://www.kapsgroup.com

2

Agenda Text Analytics – Foundation

– Features and Capabilities Evaluation of Text Analytics

– Start with Self-Knowledge – Features and Capabilities – Filter, Proof of Concept / Pilot

Text Analytics Development– Progressive Refinement– Categorization, Extraction, Sentiment– Case Studies – Best Practices

3

Semantic Infrastructure - FoundationText Analytics Features Noun Phrase Extraction

– Catalogs with variants, rule based dynamic– Multiple types, custom classes – entities, concepts, events– Feeds facets

Summarization– Customizable rules, map to different content

Fact Extraction– Relationships of entities – people-organizations-activities– Ontologies – triples, RDF, etc.

Sentiment Analysis– Rules – Objects and phrases – positive and negative

4

Semantic Infrastructure - Foundation Text Analytics Features Auto-categorization

– Training sets – Bayesian, Vector space– Terms – literal strings, stemming, dictionary of related terms– Rules – simple – position in text (Title, body, url)– Semantic Network – Predefined relationships, sets of rules– Boolean– Full search syntax – AND, OR, NOT– Advanced – NEAR (#), PARAGRAPH, SENTENCE

This is the most difficult to develop Build on a Taxonomy Combine with Extraction

– If any of list of entities and other words

5

6

7

8

9

10

11

12

13

Semantic Infrastructure - Foundation Vendors of Taxonomy/ Text Analytics Software

– Attensity– SAP - Business Objects –

Inxight– Clarabridge– ClearForest– Concept Searching– Data Harmony / Access

Innovations– Expert Systems– GATE (Open Source)– IBM Content Analyst

– Lexalytics– Multi-Tes– Nstein– SAS - Teragram– SchemaLogic– Smart Logic– Synaptica– Ontology Vendors

14

15

Semantic Infrastructure - Foundation Varieties of Taxonomy/ Text Analytics Software Taxonomy Management

– Synaptica, SchemaLogic Full Platform

– SAP-Inxight, Clear Forest, SAS- Teragram, Data Harmony, Concept Searching, IBM

Content Management– Nstein, Interwoven, Documentum, etc.

Embedded – Search– FAST, Autonomy, Endeca, Exalead, etc.

Specialty– Sentiment Analysis - Lexalytics

16

Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the taxonomy/text

analytics, how are you going to use it Formal Process - KA audit – content, users, technology, business

and information behaviors, applications - Or informal for smaller organization,

Text Analytics Strategy/Model – forms, technology, people– Existing taxonomic resources, software

Need this foundation to evaluate and to develop

17

Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge Do you need it – and what blend if so? Taxonomy Management Stand alone

– Multiple taxonomies, languages, authors-editors Technology Environment – ECM, Enterprise Search – where is it

embedded Publishing Process – where and how is metadata being added –

now and projected future– Can it utilize auto-categorization, entity extraction, summarization

Is the current search adequate – can it utilize text analytics? Applications – text mining, BI, CI, Alerts?

Semantic Infrastructure - Foundation Design of the Text Analytics Selection Team Interdisciplinary Team, led by Information Professionals

– IT – software experience, budget, support tests– Business – understand business and requirements– Library – understand information structure, understanding of

search semantics and functionality Much more likely to make a good decision

– This is not a traditional IT software evaluation – semantics Create the foundation for implementation

18

Semantic Infrastructure - Foundation Evaluating Text Analytics Software – Process

Start with Self Knowledge Eliminate the unfit

– Filter One - Ask Experts - reputation, research – Gartner, etc.• Market strength of vendor, platforms, etc.• Feature scorecard – minimum, must have, filter to top 3-4

– Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus

Filter Three – Focus Group one day visit – 3-4 vendors Deep pilot (2) / POC – advanced, integration, semantics

– Two Questions – who is better, can it be done, for how much Focus on working relationship with vendor.

19

Semantic Infrastructure - Foundation Evaluating Taxonomy Software - POC

Quality of results is the essential factor 6 weeks POC – bake off / or short pilot Real life scenarios, categorization with your content Preparation:

– Preliminary analysis of content and users information needs– Set up software in lab – relatively easy– Train taxonomist(s) on software(s)– Develop taxonomy if none available

Six week POC – 3 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of

content

20

Semantic Infrastructure - Foundation Evaluating Taxonomy Software - POC Scenarios – categorization, extraction, summarization, etc. Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities –

have to determine at POC time Elements:

– Content– Search terms / search scenarios– Training and test sets

Taxonomy Developers – expert consultants plus internal taxonomists Evaluate usability in action by taxonomists

21

Semantic Infrastructure - Foundation Evaluating Taxonomy Software – POC Issues Quality of content Quality of initial human categorization Normalize among different test evaluators Quality of taxonomists – experience with text analytics software and/or

experience with content and information needs and behaviors Quality of taxonomy

– General issues – structure (too flat or too deep)– Overlapping categories – Differences in use – browse, index, categorize

Categorization essential issue is complexity of language Entity Extraction essential issue is scale, disambiguation

22

Semantic Infrastructure - Foundation Case Study: Telecom Service Company History, Reputation Full Platform –Categorization,

Extraction, Sentiment Integration – java, API-SDK,

Linux Multiple languages Scale – millions of docs a day Total Cost of Ownership Ease of Development - new Vendor Relationship – OEM,

etc.

Expert Systems IBM SAS - Teragram Smart Logic

Option – Multiple vendors – Sentiment & Platform

IBM and SAS – finalists

23

24

Semantic Infrastructure - Foundation POC Design Discussion: Evaluation Criteria Basic Test Design – categorize test set

– Score – by file name, human testers Categorization – Call Motivation

– Accuracy Level – 80-90%– Effort Level per accuracy level

Sentiment Analysis– Accuracy Level – 80-90%– Effort Level per accuracy level

Quantify development time – main elements Comparison of two vendors – how score?

– Combination of scores and report

Text Analytics POC OutcomesCategorization Results

SAS IBM

Recall-Motivation 92.6 90.7

Recall-Actions 93.8 88.3

Precision – Mot. 84.3

Precision-Act 100

Uncategorized 87.5

Raw Precision 73 46

25

Text Analytics POC OutcomesVendor Comparisons Categorization Results – both good, edge to SAS on precision

– Use of Relevancy to set thresholds Development Environment

– IBM as toolkit provides more flexibility but it also increases development effort

Methodology – IBM enforces good method, but takes more time– SAS can be used in exactly the same way

SAS has a much more complete set of operators – NOT, DIST, START

26

Text Analytics POC OutcomesVendor Comparisons - Functionality Sentiment Analysis – SAS has workbench, IBM would require more

development– SAS also has statistical modeling capabilities

Entity and Fact extraction – seems basically the same– SAS and use operators for improved disambiguation –

Summarization – SAS has built-in– IBM could develop using categorization rules – but not clear that

would be as effective without operators Conclusion: Both can do the job, edge to SAS Now the fun begins - development

27

28

Text Analytics Development: Foundation

Articulated Information Management Strategy (K Map)– Content and Structures and Metadata– Search, ECM, applications - and how used in Enterprise– Community information needs and Text Analytics Team

POC establishes the preliminary foundation– Need to expand and deepen– Content – full range, basis for rules-training– Additional SME’s – content selection, refinement

Taxonomy – starting point for categorization / suitable? Databases – starting point for entity catalogs

29

Text Analytics DevelopmentEnterprise Environment – Case Studies A Tale of Two Taxonomies

– It was the best of times, it was the worst of times Basic Approach

– Initial meetings – project planning– High level K map – content, people, technology– Contextual and Information Interviews– Content Analysis– Draft Taxonomy – validation interviews, refine– Integration and Governance Plans

30

Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets Taxonomy of Subjects / Disciplines:

– Science > Marine Science > Marine microbiology > Marine toxins Facets:

– Organization > Division > Group– Clients > Federal > EPA– Instruments > Environmental Testing > Ocean Analysis > Vehicle– Facilities > Division > Location > Building X– Methods > Social > Population Study– Materials > Compounds > Chemicals– Content Type – Knowledge Asset > Proposals

31

Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets Project Owner – KM department – included RM, business

process Involvement of library - critical Realistic budget, flexible project plan Successful interviews – build on context

– Overall information strategy – where taxonomy fits Good Draft taxonomy and extended refinement

– Software, process, team – train library staff– Good selection and number of facets

Final plans and hand off to client

32

Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Taxonomy of Subjects / Disciplines:

– Geology > Petrology Facets:

– Organization > Division > Group– Process > Drill a Well > File Test Plan– Assets > Platforms > Platform A– Content Type > Communication > Presentations

33

Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Environment Issues

– Value of taxonomy understood, but not the complexity and scope

– Under budget, under staffed– Location – not KM – tied to RM and software

• Solution looking for the right problem– Importance of an internal library staff– Difficulty of merging internal expertise and taxonomy

34

Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Project Issues

– Project mind set – not infrastructure– Wrong kind of project management

• Special needs of a taxonomy project• Importance of integration – with team, company

– Project plan more important than results• Rushing to meet deadlines doesn’t work with semantics as

well as software

35

Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets Research Issues

– Not enough research – and wrong people– Interference of non-taxonomy – communication– Misunderstanding of research – wanted tinker toy connections

• Interview 1 implies conclusion A Design Issues

– Not enough facets– Wrong set of facets – business not information– Ill-defined facets – too complex internal structure

36

Text Analytics Development Conclusion: Risk Factors Political-Cultural-Semantic Environment

– Not simple resistance - more subtle • – re-interpretation of specific conclusions and sequence of

conclusions / Relative importance of specific recommendations Understanding project scope Access to content and people

– Enthusiastic access Importance of a unified project team

– Working communication as well as weekly meetings

37

Text Analytics DevelopmentCase Study 2 – POC – Telecom Client Demo of SAS – Teragram / Enterprise Content

Categorization

38

Text Analytics Development Best Practices - Principles Importance of ongoing maintenance and refinement Need dedicated taxonomy team working with SME’s Work with application developers to incorporate text

analytics into new applications Importance of metrics and feedback

– Software and social Questions:

– What are important subjects (and changes)– What information do they need?– How is their information related to other silos?

39

Text Analytics Development Best Practices - Principles Process

– Realistic Budget – not a nice to have add on– Flexible Project plan - semantics are complex and messy

• Time estimates are difficult, object success measures are too– Transition from development to maintenance is fluid

Resources– Interdisciplinary Team is essential– Importance of communication – languages– Merging internal and external expertise

40

Text Analytics Development Best Practices - Principles Categorization taxonomy structure

– Tradeoff of depth and complexity of rules– Multiple avenues – facets, terms, rules, etc.

• No right balance– Recall-precision balance is application specific– Training sets of starting points, rules rule– Need for custom development

Technology– Basic integration – XML– Advanced –combine unstructured and structured in new ways

41

Text Analytics Development Best Practices – Risk Factors Value understood, but not the complexity and scope Project mindset – software project and then done Not enough research on user information needs, behaviors

– Talking to the right people and asking the right questions– Getting beyond “All of the Above” surveys

Not enough resources, wrong resources Enthusiastic access to content and people Bad design – starting with the wrong type of taxonomy Categorization is not library science

– More like cognitive anthropology

42

Semantic Infrastructure Development Conclusion Text Analytics is the Foundation for Semantic infrastructure Evaluation of Text Analytics – different than IT software

– POC – essential, foundation of development– Difference of taxonomy and categorization

• Concepts vs. text in documents Enterprise Context – strategic, self-knowledge

– Infrastructure resource, not a project– Interdisciplinary Team and applications

Integration with other initiatives and technologies– Text Mining, Data Mining, Sentiment & beyond, Everything!

Questions? Tom Reamy

[email protected] Group

Knowledge Architecture Professional Serviceshttp://www.kapsgroup.com

Documents

Semantic Infrastructure Workshop Development