47
Content Classification – Where’s My Stuff? 1 IBM Confidential

Content Classification – Where’s My Stuff? 1 IBM Confidential

Embed Size (px)

Citation preview

Content Classification – Where’s My Stuff?

1IBM Confidential

Agenda

Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts

2IBM Confidential

Content that is not properly classified is not accessible– 1 in 2 business leaders don’t have access to the information they

need to do their jobs

Quality of decision-making suffers when content is not accurate– 1 in 3 business leaders frequently make business decisions based on

information they lack or don’t trust

Companies face difficulty in deriving full visibility and insight into breadth and depth of unstructured content– 77% of CEOs don’t have immediate information to make key business

decisions

Sources: IBM 2010 CEO & CFO Studies, IBM 2010 Break Away With Business Analytics and Optimization Study

Why Classify?

3IBM Confidential

Why Classify?

What if you walked into the Library of Congress and there was no Dewey Decimal System?

What about the hardware store, the grocery store, the clothing store? Do you park your car in the living room and place your sofa in the garage?

You have:Millions of pieces of contentHundreds of repositoriesThousands of workers

You need to:Find relevant content, quicklyAccurately, consistently categorize contentGather meaning and understanding from the content

Everything in our life is categorized and classified in some way

4IBM Confidential

Why Classify?

You have been storing content for many years, but… can you find it when you need it?

can you produce it for audits and litigation?can you gain insight from it?

How does your organization go from this…. to this?

IBM Confidential

5

Why Classify?

6IBM Confidential

Why Classify?

Can you find relevant content, quickly?– “Search, Refine, Repeat” is no longer acceptable – Image Capture, Content Collection, Enterprise Search

Are you uncovering business insight from your content?– Organized content produces better insight– Content Analytics

Is the right content available at the right time?– Business processes require timely access to content– Business Process Management, Case Management

Are you complying with Legal and Business mandates?– Content has a compliance lifecycle that must be enforced– Content Collection, Enterprise Records, eDiscovery

Accessibility, Usability, Compliance, Analytics

7IBM Confidential

Automated Classification makes information accessible, leaving your workers to focus on important business tasks rather searching, over and over, for relevant content

Classification provides enhanced content usability by automating routing decisions based on the meaning of the text in your content

Advanced Classification, combined with collection and records, enables your company to comply with business and legal mandates

Classification augments Content Analytics by providing extended facet navigation and content clustering, delivering added analysis and insight

Why Classify?

Agenda

Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts

9IBM Confidential

How does Classification work?

CLASSIFICATION AS A FACTORY WORKER

Think of a worker at the end of an assembly line Task is to sort items coming down the line into correct containersFour possible item types on the line:

– Can– Box– Bottle– Jar

How do you tell the factory worker which is which?Start with the item to the right as a ‘can’ reference model

– 6.5” high– Red with blue & white lettering– 3.5” diameter– Opened with a tab– Contains liquid

10IBM Confidential

How does Classification work?

Based on initial assumptions, which of these are “cans”?

What are our identification parameters?

─ Shape?─ Capacity/size?─ Contents (liquid vs. solid)?─ Method of opening?─ Construction material?

11IBM Confidential

Based on the original reference model, which of these is a can?

─ 6.5” high─ Red with blue & white lettering─ 3.5” diameter─ Opened with a tab─ Contains liquid

How does Classification work?

Analogy is very relevant to category definition & corpus selection Document classification involves the same problems

– What is an “Accounting and Finance” document?• How can we differentiate it from a “Legal” document? • How about “Regulatory?”

– How do humans tell which is which?• Keywords• Phrases• Intent

Some distinctions are clear…– Legal vs. Engineering– Personnel vs. Operations– Manufacturing vs. Advertising

Others are not…– Legal vs. Regulatory

Classification effort depends on your environment

12IBM Confidential

AIntellectualProperty isessential

Context-BasedClassification

?The core marketfor this newproduct has beendefined as such by IBM

BEngineeringdrafts requireapprovalB

Engineeringrequires skilledsoftware staffB

Engineeringrequires clearrequirements

ALegal ischanging the timeframe forcontractapproval

ALegal iscurrentlyrequiringfull approval

CStrategy should look out over36 monthsC

Strategy isImportant tothe marketing team

Business Information

Category ‘A’ Marketing

Category ‘B’Engineering

AThe core marketfor this newproduct has beendefined as such by IBM

Category ‘C’Strategy

13IBM Confidential

How does Classification work?

How does Classification work? Content Classification combines multiple methods of categorization

technologies to deliver the automatic classification– Uses natural language processing and semantic analysis– Uses rules-based on metadata or confidence score– Can be used in tandem or separately depending on requirements

14IBM Confidential

To: Bob Smith <[email protected]>From: Bill Roker <[email protected]>Subject: Contract?

Bob,

Hope you’re doing well.

A quick note to see if the payment came through, as prescribed by the contract? It would be terrible to have the firm sued over such a simple financial matter. No one wants this project to be derailed.

Regards,Bill

Bill Roker212-555-1234Financial Advisors, Inc.

To: Bob Smith <[email protected]>From: Bill Roker <[email protected]>Subject: Contract?

Bob,

Hope you’re doing well.

A quick note to see if the payment came through, as prescribed by the contract? It would be terrible to have the firm sued over such a simple financial matter. No one wants this project to be derailed.

Regards,Bill

Bill Roker212-555-1234Financial Advisors, Inc.

Does the email contains the phrase “contract”?

Does the sender belongs to the broker email group?

Does the email have anything that matches the pattern “XXX-YY-ZZZZ”?

Natural Language Processing + Semantic Analysis + Targeted Rules = Comprehensive Content Classification

Agenda

Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts

15IBM Confidential

Content Classification Features

1. Automatic Categorization of documents and emails– Analyzes the content of documents and emails in order to categorize them– Uses natural language processing and semantic analysis– Handles imperfect language (misspellings, abbreviations, poor grammar)– Assigns confidence score to each category suggestion (0 – 100)– Learns from examples or keywords

• Creates a profile for each category by analyzing sample texts• Categories can also be defined by keywords

2. Combines classification methods using text analysis and rules processing– Rules based on metadata can be defined in combination with classification based

on confidence score– Language identification capability can be used in tandem with rules

16IBM Confidential

3. Learns in real-time– Can adapt based on feedback from end users or administrators– Feedback is incorporated into analysis on-the-fly for immediate adaptation

4. Classification Workbench configuration tool– Enables the process of creation and maintenance of Knowledge Bases and Decision

Plans– Facilitates classification tune-up and reporting

5. Integrated to IBM ECM offerings – Application for bulk classification of content upon ingestion to repository and bulk

classification and reclassification of content already under management– Integrated with Datacap, Content Collector, Enterprise Records, Analytics, etc.

6. Taxonomy Creation Assistance– Suggests new taxonomies for organizations that do not have them– Suggests new elements for existing taxonomies

17IBM Confidential

Content Classification Features

A knowledge base contains learned information that Classification needs to perform matching, training, and online learning

It is filled with relevant statistical and semantic information derived from sample texts

Statistical entities consist of words, number of occurrences, hints about the text, and distance between words

A knowledge base is created & maintained through the Workbench application1. Collect and organize sample content2. Create, analyze, and learn3. Assess performance, review reports

18 18IBM Confidential

Content Classification Features – Knowledge Base

A Decision Plan is a collection of rules that you configure to determine how content is classified

A Decision Plan is developed by configuring one or more rules based on content or metadata.

Each rule consists of one trigger and one or more actions– Example: Trigger: “If Title contains ‘Contract’

” then, Action: “Assign to Contracts Category” & “Move to Contracts folder”

Rules can use strings, word distance, regular expressions, pattern extraction, Boolean expressions

Actions include set properties, invoke analysis, move to folder, declare record, custom actions, and more

Decision Plans can be used with or without a Knowledge Base

19IBM Confidential

Content Classification Features – Decision Plan

Agenda

Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts

20IBM Confidential

Content Classification – Taxonomy Basics

Taxonomy1.The science or technique of classification. 2.A classification into ordered categories. 3.The science dealing with the description,

identification, naming, and classification of organisms.

Business Taxonomy1.Usually follows a line of business hierarchy 2.Logical grouping of content for business,

repositories or compliance purposes. 3.Generally “flattened” for better control and

management

7 levels 3-4 levels21

IBM Confidential

Content Classification – Taxonomy BasicsThe Goldilocks Zone

“Too Many Categories”1000 categories is probably too many

22IBM Confidential

“Too Few Categories”10 categories is probably too few

23IBM Confidential

Content Classification – Taxonomy BasicsThe Goldilocks Zone

“Just Right”Somewhere around 100 categories is probably just right

24IBM Confidential

Content Classification – Taxonomy BasicsThe Goldilocks Zone

Taxonomies are important, but… They do not have to be complex or unwieldy Need to be acceptable to different organization areas

─ Finance, Legal, HR, IT Your organization may have a formal, internal taxonomy

─ If so, start there, but it may have to be flattened Your organization may have a de facto taxonomy

─ ECM document classes, folders, File System structures, Departmental structures, may be enough to start

Publicly available or 3rd-party taxonomies may be used─ Again, may have to be flattened

How are humans classifying today?─ Are workers filing paper in folder, drawers, cabinets?─ Are worker putting content in ECM, File Systems, Folders?

25IBM Confidential

Content Classification – Taxonomy Basics

Agenda

Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts

26IBM Confidential

Starting a Classification Project

Approaches– Taxonomy Proposal through Content Clustering– Taxonomy Creation through “Seeded” Keywords– Taxonomy Creation through Manual Content Gathering– Knowledge Base Creation through Content Extraction

27IBM Confidential

Taxonomy Proposal through Content Clustering─ We don’t know, what we don’t know─ Starting from a blank sheet

create

28IBM Confidential

Starting a Classification Project

gather

crawl

evaluate

categorize

cluster A

B

C

D

Taxonomy Creation through “Seeded” Keywords─ We know, what we don’t know─ Starting from a blank sheet

evaluate& tune

Knowledge Basecreation

Workbench

review

Keyword-basedcontent set

29IBM Confidential

Starting a Classification Project

gather

crawl

keyword

keyword

keyword

KeywordSeeded

taxonomy

Taxonomy Creation through Manual Content Gathering─ We know, what we don’t know─ Starting with known content

evaluate& tune

Knowledge Basecreation

30IBM Confidential

Starting a Classification Project

StrawmanTaxonomy

A

B

C

Manual content gathering

Manually gatheredcontent set

A

D

C

B

Knowledge Base Creation through Content Extraction─ We know, what we know─ Starting with known content and taxonomy

evaluate& tune

Knowledge Basecreation

31IBM Confidential

EstablishedECM Repository

Starting a Classification Project

Content extraction

Extractedcontent set

A

B

D

C

Agenda

Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts

32IBM Confidential

Look Listen Learn

33

Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten)

Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten)

Look─ In order to properly classify , you need to know your content ─ Understand how your content is created and by whom─ Understand how content used in your business─ Understand the meaning and purpose of content─ Set realistic expectations

─ 100% automation with 100% accuracy is rare─ Balance automation expectations with accuracy requirements

34

─ This is a resume─ It is used by Human Resources, Hiring

Managers─ It is a text document─ The purpose is to aide the hiring

process─ The document may have compliance

value

Listen─ All content owners and users have a stake in proper classification─ Gather input and consider all aspects of content, users and organizations─ Define categories based on business use

• Categories should represent organizational content, not organizational structure• Taxonomies are less hierarchical and flatter than “standard” taxonomies

35

Hierarchical Flat

Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten)

36

Learn─ Training is iterative, it improves and learns over time─ Training sets must contain “high value” examples─ Number of training documents varies by organization (~20 to ~50, rule of thumb)

─ 100’s of documents is less useful than 20 well selected documents─ More is not better, it’s just more

─ Addition of new categories affects existing categories─ Some categories may perform well immediately, others may require additional

effort─ Categories may “drift” over time (content intent, phrases, business changes,

etc.)─ Learning requires the active use of feedback capabilities

Remember what Grover taught us…“Three of these things belong together...”

Classification systems have to learn…….

Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten)

37

Best Practices for Classification – Summary Categories

─ Should be content driven and represent organizational content, not organization chart

Taxonomies─ Less hierarchical, generally flatter and less formal than “standard” taxonomies

Training Sets─ Training sets should be consistent with actual content and represent “high-

value” content─ Clearly delineation of content between various categories

Ongoing monitoring and training─ Training is iterative, similar to business process optimization, it improves over

time Set Realistic expectations with business user─ Balance automation expectations with accuracy requirements Engage competent and experienced service providers to assist with

initial classification project

Agenda

Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts

38IBM Confidential

Content Classification provides text analytics and statistical probability to

provide another recognition approach to Taskmaster’s vast array of methods

Real World ExampleImage Capture and Classification

Integration between Datacap Taskmaster and Content Classification brings the power of image capture and automated classification together

Classification Challenges

What type of document is this?– to vary processing by type

What pages contain the data I need?– to extract or key in the proper fields

Do the documents contain the correct pages?– to ensure that the documents are “in good order” and not missing information

What is the business meaning of this document?– to get the document to the right person or process with the right priority

Real World ExampleImage Capture and Classification

The Separation ChallengeWhere does one document end and the next begin?

41

Here?Here? Here?Here? Here?Here? Here?Here?

Real World ExampleImage Capture and Classification

Traditional Methods– Patch & Barcoded Separator Sheets– Barcode Labels and Documents– Manual Identification– Paper Sorting

Shortcomings– Labor-intensive– Relies on a worker knowledge to correctly

identify and sort out the documents– Externally generated documents cannot be

barcoded

Datacap Taskmaster & Classification for Separation & Page Identification

Taskmaster examines each page using multiple methods– The fastest methods are done first : barcode, pattern match, & fingerprint– The slower methods that require OCR follow: Text analytics and keywords– Rules examine the context to determine if any remaining pages can be identified based on the

surrounding pages– Taskmaster calls Content Classification to help identify pages– Taskmaster separates and assembles the pages into documents

Content Classification analyzes the text content– Statistical analysis of the text on a page compared to a knowledge base to find the closest

match– Assigns confidence score to each category suggestion (0 – 100)– Returns the Classification results to Taskmaster─ Classification feedback loop improves future results by providing feedback to the classification

engineExceptions, low confidence results are reviewed and classified by users

Real World ExampleImage Capture and Classification

Bank specializing in mortgage loan servicing

Slashing costs with IBM Production Imaging Editionand IBM Content Classification

The solution is targeted to reduce costs by automating the classifying, keying and filing of millions of pages of loan documentation per day.

The need• Reduce paper document scanning and processing costs• Reduce loan servicing customer service costs• Processing volumes can exceed 100 million scanned pages per

month

• PIE - Datacap Taskmaster scans and imports paper documents• PIE - Datacap Taskmaster rules classify documents to the page level

using barcodes, image fingerprint pattern matching, regular expressions, and text analytic classification

• IBM Classification Module classifies pages using text analytics• Taskmaster extracts text and data fields using optical character

recognition (OCR)• Data collection, statistical reporting, and feedback loops improve

accuracy and configuration tuning• PIE - FileNet Content Manager securely stores the documents • Acquisition and servicing processes are automated through web-

based document access and PIE business process capabilities.

Projected benefits• Save millions of dollars of staff time by

automating document classification, reducing data entry, and providing direct access to the loan documents with improved speed, accuracy, and granularity.

• Save millions of dollars in per-page licensing fees associated with the competitively replaced Kofax KTM system

• Provide a platform that can be rapidly ramped up to handle high loads associated with portfolio acquisitions

The solutionThe company contracted with IBM partner Imagine Solutions to implement IBM Production Imaging Edition (PIE) and IBM Classification Module software

43IBM Confidential

Agenda

Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts

44IBM Confidential

Closing ThoughtsHow can classification help my business?

Improve teaching programs and student learning─ Classifying educational content through analysis of lesson plan text

Automatically code medical bills─ Interpret doctors notes and apply industry standard codes (ICD-9, ICD-10)

Reduce manual, human intervention─ Automatically evaluate email service requests and establishing responses

Shorten process cycle time─ Distinguish mortgage, auto, personal, credit card loan applications─ Route content to appropriate worker or process step

Automatically understand Personally Identifiable Information (PII), Personal Health Information (PHI) in unstructured content─ Take actions such as file, record, route, redact

45IBM Confidential

Closing Thoughts

Classification is a powerful solution to automate the categorization of text-based content

Properly categorized content provides better accessibility, usability, compliance and analytics

Many factors lead to high-quality classification – consider and understand all of them

They keys to success are planning, preparation and persistence─ Is there any project that does not require these?

Automated classification allows you to cut costs associated with content capture, collection, archiving, retention, analysis and more

46IBM Confidential

“Anything worth doing, is worth doing right.” – Hunter S. Thompson

47