Upload
duane-lambert-patrick
View
233
Download
0
Tags:
Embed Size (px)
Citation preview
Agenda
Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts
2IBM Confidential
Content that is not properly classified is not accessible– 1 in 2 business leaders don’t have access to the information they
need to do their jobs
Quality of decision-making suffers when content is not accurate– 1 in 3 business leaders frequently make business decisions based on
information they lack or don’t trust
Companies face difficulty in deriving full visibility and insight into breadth and depth of unstructured content– 77% of CEOs don’t have immediate information to make key business
decisions
Sources: IBM 2010 CEO & CFO Studies, IBM 2010 Break Away With Business Analytics and Optimization Study
Why Classify?
3IBM Confidential
Why Classify?
What if you walked into the Library of Congress and there was no Dewey Decimal System?
What about the hardware store, the grocery store, the clothing store? Do you park your car in the living room and place your sofa in the garage?
You have:Millions of pieces of contentHundreds of repositoriesThousands of workers
You need to:Find relevant content, quicklyAccurately, consistently categorize contentGather meaning and understanding from the content
Everything in our life is categorized and classified in some way
4IBM Confidential
Why Classify?
You have been storing content for many years, but… can you find it when you need it?
can you produce it for audits and litigation?can you gain insight from it?
How does your organization go from this…. to this?
IBM Confidential
5
Why Classify?
Can you find relevant content, quickly?– “Search, Refine, Repeat” is no longer acceptable – Image Capture, Content Collection, Enterprise Search
Are you uncovering business insight from your content?– Organized content produces better insight– Content Analytics
Is the right content available at the right time?– Business processes require timely access to content– Business Process Management, Case Management
Are you complying with Legal and Business mandates?– Content has a compliance lifecycle that must be enforced– Content Collection, Enterprise Records, eDiscovery
Accessibility, Usability, Compliance, Analytics
7IBM Confidential
Automated Classification makes information accessible, leaving your workers to focus on important business tasks rather searching, over and over, for relevant content
Classification provides enhanced content usability by automating routing decisions based on the meaning of the text in your content
Advanced Classification, combined with collection and records, enables your company to comply with business and legal mandates
Classification augments Content Analytics by providing extended facet navigation and content clustering, delivering added analysis and insight
Why Classify?
Agenda
Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts
9IBM Confidential
How does Classification work?
CLASSIFICATION AS A FACTORY WORKER
Think of a worker at the end of an assembly line Task is to sort items coming down the line into correct containersFour possible item types on the line:
– Can– Box– Bottle– Jar
How do you tell the factory worker which is which?Start with the item to the right as a ‘can’ reference model
– 6.5” high– Red with blue & white lettering– 3.5” diameter– Opened with a tab– Contains liquid
10IBM Confidential
How does Classification work?
Based on initial assumptions, which of these are “cans”?
What are our identification parameters?
─ Shape?─ Capacity/size?─ Contents (liquid vs. solid)?─ Method of opening?─ Construction material?
11IBM Confidential
Based on the original reference model, which of these is a can?
─ 6.5” high─ Red with blue & white lettering─ 3.5” diameter─ Opened with a tab─ Contains liquid
How does Classification work?
Analogy is very relevant to category definition & corpus selection Document classification involves the same problems
– What is an “Accounting and Finance” document?• How can we differentiate it from a “Legal” document? • How about “Regulatory?”
– How do humans tell which is which?• Keywords• Phrases• Intent
Some distinctions are clear…– Legal vs. Engineering– Personnel vs. Operations– Manufacturing vs. Advertising
Others are not…– Legal vs. Regulatory
Classification effort depends on your environment
12IBM Confidential
AIntellectualProperty isessential
Context-BasedClassification
?The core marketfor this newproduct has beendefined as such by IBM
BEngineeringdrafts requireapprovalB
Engineeringrequires skilledsoftware staffB
Engineeringrequires clearrequirements
ALegal ischanging the timeframe forcontractapproval
ALegal iscurrentlyrequiringfull approval
CStrategy should look out over36 monthsC
Strategy isImportant tothe marketing team
Business Information
Category ‘A’ Marketing
Category ‘B’Engineering
AThe core marketfor this newproduct has beendefined as such by IBM
Category ‘C’Strategy
13IBM Confidential
How does Classification work?
How does Classification work? Content Classification combines multiple methods of categorization
technologies to deliver the automatic classification– Uses natural language processing and semantic analysis– Uses rules-based on metadata or confidence score– Can be used in tandem or separately depending on requirements
14IBM Confidential
To: Bob Smith <[email protected]>From: Bill Roker <[email protected]>Subject: Contract?
Bob,
Hope you’re doing well.
A quick note to see if the payment came through, as prescribed by the contract? It would be terrible to have the firm sued over such a simple financial matter. No one wants this project to be derailed.
Regards,Bill
Bill Roker212-555-1234Financial Advisors, Inc.
To: Bob Smith <[email protected]>From: Bill Roker <[email protected]>Subject: Contract?
Bob,
Hope you’re doing well.
A quick note to see if the payment came through, as prescribed by the contract? It would be terrible to have the firm sued over such a simple financial matter. No one wants this project to be derailed.
Regards,Bill
Bill Roker212-555-1234Financial Advisors, Inc.
Does the email contains the phrase “contract”?
Does the sender belongs to the broker email group?
Does the email have anything that matches the pattern “XXX-YY-ZZZZ”?
Natural Language Processing + Semantic Analysis + Targeted Rules = Comprehensive Content Classification
Agenda
Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts
15IBM Confidential
Content Classification Features
1. Automatic Categorization of documents and emails– Analyzes the content of documents and emails in order to categorize them– Uses natural language processing and semantic analysis– Handles imperfect language (misspellings, abbreviations, poor grammar)– Assigns confidence score to each category suggestion (0 – 100)– Learns from examples or keywords
• Creates a profile for each category by analyzing sample texts• Categories can also be defined by keywords
2. Combines classification methods using text analysis and rules processing– Rules based on metadata can be defined in combination with classification based
on confidence score– Language identification capability can be used in tandem with rules
16IBM Confidential
3. Learns in real-time– Can adapt based on feedback from end users or administrators– Feedback is incorporated into analysis on-the-fly for immediate adaptation
4. Classification Workbench configuration tool– Enables the process of creation and maintenance of Knowledge Bases and Decision
Plans– Facilitates classification tune-up and reporting
5. Integrated to IBM ECM offerings – Application for bulk classification of content upon ingestion to repository and bulk
classification and reclassification of content already under management– Integrated with Datacap, Content Collector, Enterprise Records, Analytics, etc.
6. Taxonomy Creation Assistance– Suggests new taxonomies for organizations that do not have them– Suggests new elements for existing taxonomies
17IBM Confidential
Content Classification Features
A knowledge base contains learned information that Classification needs to perform matching, training, and online learning
It is filled with relevant statistical and semantic information derived from sample texts
Statistical entities consist of words, number of occurrences, hints about the text, and distance between words
A knowledge base is created & maintained through the Workbench application1. Collect and organize sample content2. Create, analyze, and learn3. Assess performance, review reports
18 18IBM Confidential
Content Classification Features – Knowledge Base
A Decision Plan is a collection of rules that you configure to determine how content is classified
A Decision Plan is developed by configuring one or more rules based on content or metadata.
Each rule consists of one trigger and one or more actions– Example: Trigger: “If Title contains ‘Contract’
” then, Action: “Assign to Contracts Category” & “Move to Contracts folder”
Rules can use strings, word distance, regular expressions, pattern extraction, Boolean expressions
Actions include set properties, invoke analysis, move to folder, declare record, custom actions, and more
Decision Plans can be used with or without a Knowledge Base
19IBM Confidential
Content Classification Features – Decision Plan
Agenda
Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts
20IBM Confidential
Content Classification – Taxonomy Basics
Taxonomy1.The science or technique of classification. 2.A classification into ordered categories. 3.The science dealing with the description,
identification, naming, and classification of organisms.
Business Taxonomy1.Usually follows a line of business hierarchy 2.Logical grouping of content for business,
repositories or compliance purposes. 3.Generally “flattened” for better control and
management
7 levels 3-4 levels21
IBM Confidential
Content Classification – Taxonomy BasicsThe Goldilocks Zone
“Too Many Categories”1000 categories is probably too many
22IBM Confidential
“Too Few Categories”10 categories is probably too few
23IBM Confidential
Content Classification – Taxonomy BasicsThe Goldilocks Zone
“Just Right”Somewhere around 100 categories is probably just right
24IBM Confidential
Content Classification – Taxonomy BasicsThe Goldilocks Zone
Taxonomies are important, but… They do not have to be complex or unwieldy Need to be acceptable to different organization areas
─ Finance, Legal, HR, IT Your organization may have a formal, internal taxonomy
─ If so, start there, but it may have to be flattened Your organization may have a de facto taxonomy
─ ECM document classes, folders, File System structures, Departmental structures, may be enough to start
Publicly available or 3rd-party taxonomies may be used─ Again, may have to be flattened
How are humans classifying today?─ Are workers filing paper in folder, drawers, cabinets?─ Are worker putting content in ECM, File Systems, Folders?
25IBM Confidential
Content Classification – Taxonomy Basics
Agenda
Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts
26IBM Confidential
Starting a Classification Project
Approaches– Taxonomy Proposal through Content Clustering– Taxonomy Creation through “Seeded” Keywords– Taxonomy Creation through Manual Content Gathering– Knowledge Base Creation through Content Extraction
27IBM Confidential
Taxonomy Proposal through Content Clustering─ We don’t know, what we don’t know─ Starting from a blank sheet
create
28IBM Confidential
Starting a Classification Project
gather
crawl
evaluate
categorize
cluster A
B
C
D
Taxonomy Creation through “Seeded” Keywords─ We know, what we don’t know─ Starting from a blank sheet
evaluate& tune
Knowledge Basecreation
Workbench
review
Keyword-basedcontent set
29IBM Confidential
Starting a Classification Project
gather
crawl
keyword
keyword
keyword
KeywordSeeded
taxonomy
Taxonomy Creation through Manual Content Gathering─ We know, what we don’t know─ Starting with known content
evaluate& tune
Knowledge Basecreation
30IBM Confidential
Starting a Classification Project
StrawmanTaxonomy
A
B
C
Manual content gathering
Manually gatheredcontent set
A
D
C
B
Knowledge Base Creation through Content Extraction─ We know, what we know─ Starting with known content and taxonomy
evaluate& tune
Knowledge Basecreation
31IBM Confidential
EstablishedECM Repository
Starting a Classification Project
Content extraction
Extractedcontent set
A
B
D
C
Agenda
Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts
32IBM Confidential
Look Listen Learn
33
Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten)
Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten)
Look─ In order to properly classify , you need to know your content ─ Understand how your content is created and by whom─ Understand how content used in your business─ Understand the meaning and purpose of content─ Set realistic expectations
─ 100% automation with 100% accuracy is rare─ Balance automation expectations with accuracy requirements
34
─ This is a resume─ It is used by Human Resources, Hiring
Managers─ It is a text document─ The purpose is to aide the hiring
process─ The document may have compliance
value
Listen─ All content owners and users have a stake in proper classification─ Gather input and consider all aspects of content, users and organizations─ Define categories based on business use
• Categories should represent organizational content, not organizational structure• Taxonomies are less hierarchical and flatter than “standard” taxonomies
35
Hierarchical Flat
Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten)
36
Learn─ Training is iterative, it improves and learns over time─ Training sets must contain “high value” examples─ Number of training documents varies by organization (~20 to ~50, rule of thumb)
─ 100’s of documents is less useful than 20 well selected documents─ More is not better, it’s just more
─ Addition of new categories affects existing categories─ Some categories may perform well immediately, others may require additional
effort─ Categories may “drift” over time (content intent, phrases, business changes,
etc.)─ Learning requires the active use of feedback capabilities
Remember what Grover taught us…“Three of these things belong together...”
Classification systems have to learn…….
Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten)
37
Best Practices for Classification – Summary Categories
─ Should be content driven and represent organizational content, not organization chart
Taxonomies─ Less hierarchical, generally flatter and less formal than “standard” taxonomies
Training Sets─ Training sets should be consistent with actual content and represent “high-
value” content─ Clearly delineation of content between various categories
Ongoing monitoring and training─ Training is iterative, similar to business process optimization, it improves over
time Set Realistic expectations with business user─ Balance automation expectations with accuracy requirements Engage competent and experienced service providers to assist with
initial classification project
Agenda
Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts
38IBM Confidential
Content Classification provides text analytics and statistical probability to
provide another recognition approach to Taskmaster’s vast array of methods
Real World ExampleImage Capture and Classification
Integration between Datacap Taskmaster and Content Classification brings the power of image capture and automated classification together
Classification Challenges
What type of document is this?– to vary processing by type
What pages contain the data I need?– to extract or key in the proper fields
Do the documents contain the correct pages?– to ensure that the documents are “in good order” and not missing information
What is the business meaning of this document?– to get the document to the right person or process with the right priority
Real World ExampleImage Capture and Classification
The Separation ChallengeWhere does one document end and the next begin?
41
Here?Here? Here?Here? Here?Here? Here?Here?
Real World ExampleImage Capture and Classification
Traditional Methods– Patch & Barcoded Separator Sheets– Barcode Labels and Documents– Manual Identification– Paper Sorting
Shortcomings– Labor-intensive– Relies on a worker knowledge to correctly
identify and sort out the documents– Externally generated documents cannot be
barcoded
Datacap Taskmaster & Classification for Separation & Page Identification
Taskmaster examines each page using multiple methods– The fastest methods are done first : barcode, pattern match, & fingerprint– The slower methods that require OCR follow: Text analytics and keywords– Rules examine the context to determine if any remaining pages can be identified based on the
surrounding pages– Taskmaster calls Content Classification to help identify pages– Taskmaster separates and assembles the pages into documents
Content Classification analyzes the text content– Statistical analysis of the text on a page compared to a knowledge base to find the closest
match– Assigns confidence score to each category suggestion (0 – 100)– Returns the Classification results to Taskmaster─ Classification feedback loop improves future results by providing feedback to the classification
engineExceptions, low confidence results are reviewed and classified by users
Real World ExampleImage Capture and Classification
Bank specializing in mortgage loan servicing
Slashing costs with IBM Production Imaging Editionand IBM Content Classification
The solution is targeted to reduce costs by automating the classifying, keying and filing of millions of pages of loan documentation per day.
The need• Reduce paper document scanning and processing costs• Reduce loan servicing customer service costs• Processing volumes can exceed 100 million scanned pages per
month
• PIE - Datacap Taskmaster scans and imports paper documents• PIE - Datacap Taskmaster rules classify documents to the page level
using barcodes, image fingerprint pattern matching, regular expressions, and text analytic classification
• IBM Classification Module classifies pages using text analytics• Taskmaster extracts text and data fields using optical character
recognition (OCR)• Data collection, statistical reporting, and feedback loops improve
accuracy and configuration tuning• PIE - FileNet Content Manager securely stores the documents • Acquisition and servicing processes are automated through web-
based document access and PIE business process capabilities.
Projected benefits• Save millions of dollars of staff time by
automating document classification, reducing data entry, and providing direct access to the loan documents with improved speed, accuracy, and granularity.
• Save millions of dollars in per-page licensing fees associated with the competitively replaced Kofax KTM system
• Provide a platform that can be rapidly ramped up to handle high loads associated with portfolio acquisitions
The solutionThe company contracted with IBM partner Imagine Solutions to implement IBM Production Imaging Edition (PIE) and IBM Classification Module software
43IBM Confidential
Agenda
Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts
44IBM Confidential
Closing ThoughtsHow can classification help my business?
Improve teaching programs and student learning─ Classifying educational content through analysis of lesson plan text
Automatically code medical bills─ Interpret doctors notes and apply industry standard codes (ICD-9, ICD-10)
Reduce manual, human intervention─ Automatically evaluate email service requests and establishing responses
Shorten process cycle time─ Distinguish mortgage, auto, personal, credit card loan applications─ Route content to appropriate worker or process step
Automatically understand Personally Identifiable Information (PII), Personal Health Information (PHI) in unstructured content─ Take actions such as file, record, route, redact
45IBM Confidential
Closing Thoughts
Classification is a powerful solution to automate the categorization of text-based content
Properly categorized content provides better accessibility, usability, compliance and analytics
Many factors lead to high-quality classification – consider and understand all of them
They keys to success are planning, preparation and persistence─ Is there any project that does not require these?
Automated classification allows you to cut costs associated with content capture, collection, archiving, retention, analysis and more
46IBM Confidential
“Anything worth doing, is worth doing right.” – Hunter S. Thompson