*Gartner Innovation Insight: File Analysis Innovation Delivers an Understanding of Unstructured Dark Data, Analyst Alan Dayley, March 28, 2014
Unified Information Governance
*Gartner Innovation Insight: File Analysis Innovation Delivers an Understanding of Unstructured Dark Data, Analyst Alan Dayley, March 28, 2014
Stratas Data Forge - Create intelligence.
Gartner found that 80% of all corporate data is unstructured and will grow by 800% in the next five years*, meaning the bulk of a
company’s information backbone is not easily accessed, understood or utilised.
The majority of this data is deemed ROT (“Redundant, obsolete or trivial”) and the cost of managing and storing this information is
considerable. This “dark data” also represents a source of risk to the business, for the contents of these documents remains largely
unknown.
Remediating ROT will typically remove 70% of the unstructured data in the company, leaving the remainder of the data in a position
where it can be catalogued, classified and mined; creating actionable business intelligence and regulatory compliance.
The Stratas Data Forge is a revolutionary platform to discover, classify, search, store and control any document which may exist in
your organisation – be it physical or electronic - with speed; accuracy and functionality never before achievable.
Optical Character Recognition as a tool for processing data can be slow and limiting - the more data, the slower the process. Stratas
uses the Data Forge to break this convention through the application of scientifically-proven statistical tools which visually group and
classify enterprise data with content analytics and defensible remediation that delivers unified Information Governance.
Machine Learning, multi-purpose deep hierarchical architecture and statistics-based algorithms create scalable language-agnostic
supervised and unsupervised learning methods for automated document identification, classification, remediation and retention.
The Data Forge platform performs a number of key functions:
1. Unstructured Data Analysis. Data Forge allows faceted search and cross section analysis of unstructured enterprise data,
clustering search results using extractive summaries and key phrases.
2. Structured clustering. Data Forge can automatically group large volumes of documentation based upon similarity,
regardless of how the data has been stored. This step can be independent of any content recognition or classification of
the underlying documents. Using scientific text and image tools, the documents can be intelligently classified for routing
into workflow or inclusion into an Electronic Document Repository. By turning unstructured data into business intelligence,
you become fully aware of all the documents contained in the organisation – knowing what you have and where it is
enables informed decision-making and accurate remediation of enterprise wide unstructured data.
3. Intelligent Document Classification. With Data Forge, data extraction is made simple using context-based methodologies
for semantic recognition. Features include Logical Document Boundary Determination, scalable Near Duplicate
identification, enriched metadata and genre-base clustering. The platform is unique, powerful and intuitive.
Data Forge is more than the revolution of Document Management. It is a unified Information Governance platform for unstructured
data and content intelligence, which provides a systemic way to find, classify and manage compliance documents across
organisations. Our approach to document classification, defensible remediation and retention goes beyond the remit of eDiscovery,
fundamentally changing the economics of enterprise Information Governance and considerably improving business operations.
Stratas is the first and, to date, the only unified Information Governance solutions provider who can effectively addresses the need
for a single, highly flexible and integrated data platform. Designed specifically around the customer, our holistic approach considers
the people, process, technology and culture of every business. The resulting solutions are faster, intuitive, more accurate and feature-
rich than any technology or manual process. We deliver unparalleled process improvement, user adoption, compliance and cost
benefit to companies of any size.
Proof of Value (“POV”) Information sheet
We run two types of POV with Data Forge. The simpler of the two is an “eDiscovery” approach, which does not look to arrange the
data in accordance with a predetermined business process or need, but simply to identify and cluster the information.
The Crawler is “pointed” at a set of unstructured data and set to Discovery mode. The documents are segmented and the ROT
(Redundant, Obsolete and Trivial data) is identified, which includes duplicates and system files. The remaining data is then clustered
into groups based upon visual similarity.
In essence, this is a digital version of what the manual process for sorting a pile of unidentified documents would be. Spread them
out on the table, get rid of the ones you cannot read or do anything with, then group the rest together into piles based upon what
they look like e.g. invoices, contracts, purchase orders, CVs, pictures, presentations etc.
Once the data is clustered, you are now in a position to do something meaningful with that information.
Data Forge is a collection of specialised scientific tools which are compiled, as defined by a project scope, to address a specific need
within a customer environment. As such, upfront initialisation for POV aligned to a specific challenge or process, is greater detailed
than that of the “eDiscovery” approach. It requires a defined objective in order for the tools to be compiled correctly and the data
interrogated with this set goal in mind. The process is more easily explained by the graphic below:
Firstly, we build the scope for the POV – this tells the system what we want to achieve. It comprises obtaining document exemplars
from the data set, defining critical rules (e.g. document specific content sought), defining the customer specific taxonomies and
stating any rules (e.g. retention policies). This creates the Controlled Vocabulary for the POV. The tools required for the project are
assembled into the Data Forge platform and we are ready to initiate the process.
The data set is run and the ROT (Redundant, Obsolete and Trivial documents) is again excluded, together with System Files,
duplicates and multiple versions.
The Classification Strategy is applied to the resulting data set and refined according to the required output. Filters are applied and
then the Classification Decision is applied to the data via a number of techniques or tools, depending on the ultimate requirement.
The output is vetted via Quality Control measures and then presented to a database or ECM.
Data Forge Workflow Data Analysis and Classification
This targeted and scientific approach is more than the revolution of Document Management. It is the creation of a unified Information
Governance platform for unstructured data and content intelligence, which provides a systemic way to find, classify and manage
compliance across an organisations unstructured data. The Stratas approach to document classification, defensible remediation and
retention goes beyond the remit of eDiscovery and Enterprise Content Management. It fundamentally changes the economics of
enterprise Information Governance, considerably improving business operations and creating a far greater business value from your
unstructured data.
System requirements
We can provide the POV using either a Cloud-based service or via an appliance behind your firewall. A typical data set for the POV is
between 20gb and 50gb. The specification below provides the optimum host machine configuration for an appliance-based POV,
showing how the platform is associated with the target repositories and limitations as to the target environment.
Hardware
3 servers: 2 x 6 core Xeon, 96 GB RAM; and
1 GB Ethernet.
Data assumptions
about 30-40% of the data are non-records
System configuration
index nodes = 30
DB nodes = 3
crawlers/pre-processors = 30
classifiers = 30, however the platform scales linearly, so to process in 60 hours would require 6 servers rather than 3.
Speed of operation
The platform has the ability to support large (>50TB) data sets and processing speeds between 1 MB/sec to > = 25 MB/sec (or
faster) depending upon hardware and type of data.
The system is linear to hardware. For example the hardware configuration below will accomplish processing of 10 TB in 120 hours:
3 servers: 2 x 6 core Xeon, 96 GB RAM
1 GB Ethernet
Or 1,000,000 pages could be processed using:
3 processing servers: 2 x 6 core Xeon, 64 Gb RAM 1 network attached storage: 10Tb, 4 core, 24Gb RAM
Platform Capabilities The tables below highlight some of the capabilities the Data Forge platform exhibits and how they may be applied to varying
business processes and challenges within any business.
Subject Matter Expertise
Document Coding
Capability Description
Tagging/Coding
Documents
Input defining content affected by preservation holds;
use of Fuzzy Pattern Matching Framework for required data points extraction
Predictive Coding-
Custom Control Sets
Use of machine learning to code data by applying matter-specific control sets
Capability Description
Custom Training Model
Development
Build relevant custom document exemplar-based training models based on specific client
requirements using in-house SME
Turnkey Service Delivery Provision of certified labour resources (engagement managers, project management and data
analysts) required to deliver classification results to client-desired quality level
Pre-Built Knowledge
Models
Pre-built models to auto-classify data "out of the box" sorting the data based on business
function, security, product development, audit and fraud categories
Data Processing
Capability Description
File Types Identification
and Text Extraction
500 unstructured data file types using Oracle Software Development Kit (SDKs), Notes and
Exchange email, SharePoint
Culling Trash file identification and de-NIST'ing using Oracle SDKs or equivalent
Email Processing Thread detection, classification each message in the thread separately to the model, calculate
median score of thread, calculate median score of all attachments and take max median score as a
category of the thread.
Distributed Architecture
Grid architecture for processing large data volumes (100's TB/PB); hardware Determinative
(specifications)
OCR Engine
For processing TIFF and PDF images to create text file for classification/legal hold (coding). Fully
integrated solution to process scanned images: image pre-processing, OCR, post processing.
OCR Text QC Filter
Filter for text amount and presence of garbage text to separate from higher quality files
Native File Viewing
Using Oracle SDKs or equivalent
Clustering and
classification of the
scanned imaged and
Logical Boundary
Determination of scanned
multi document images
(PDF, TIFF)
Scanned multi-document images: clustering (visual, text based), classification (visual, text based);
data points extraction
Analytics
Capability Description
Duplicate Detection Using SHA-1 hash or equal
Near-Duplicate Detection Detecting document versions (image, text) and comparison of color-coded text differences (similar to
Delta View process) between selected text documents.
Data Profiling Ability to search, filter and facet results by file type, extension, domain, path, date, or full text search;
Also applying clustering to the search results.
Modelling of Data
(what approach)
Supervised learning/example-based training for auto classification into deep multi-purpose
categories hierarchies.
Information Extraction Fuzzy Pattern Matching Framework: context-based fuzzy pattern matching rules combined with set
of dictionaries (gazetteers) for:
- Named Entities extractions (Persons, Companies, Address/Locations)
- Context based information extraction and tables support
- identification of PII.
Clustering Clustering of Search Results prior to getting into rules and queries.
Sorting of the data in logical pools of data with semantic nearness; machine generated labels of
clusters; ability to facet clusters.
Email Thread Detection
and Classification
Perform analytics on the email based on data point extractions; correlation and cross-reference. (QC
and reporting functions)
Search Engine Search traditionally or by facets and in-context query completion
Saved Queries Filter for specific key words/phrases which can be saved and used as an additional facet for data
review, or used as rules for classification (selected via drop-down menu).
Extractive Summaries Machine-generated list of most important sentences and key phrases; requiring no user input
System Training and Quality Assurance
Capability Description
Random Sample
Generator
QA process for filtering data and retraining system
Iterative Learning
Environment
Presentation of sampled data and drag/drop retraining;
ability to filter samples with facets and saved queries or ad hoc searches
Discriminative Measure System feedback on discriminative gain associated with a potential training candidate; "Is it worth
training this document based on system feedback"
(Green = add, yes, this has value. Red = duplicate already, don't bother. Has no value for the training
set…..)
Novelty Detection Important feature in machine learning environments to detect novel documents and treat them as
such and thus reducing amount of false positives during classification
Faceted Data Review Separation of data by classification category, file type, age, saved queries, ad hoc queries
“More Like This” Retrieval of data with similarity to source; a feature of Solr
5-Fold Cross Validation Automated performance validation using control sets, measuring precision and recall and calculation
of the F-Score
Accuracy Level
Attainment
Ability to provide client-defined accuracy levels, using system tools and statistically valid protocols;
audit trail proof of attainment
Multi-Value Tagging and
Classification
Supports multi-value tagging and indefinite number of classification models that include manual
assignment, to more than one category.
Workflow flexibility Determine the landscape of the data at the outset of the case or classification process; spread of
categories and time to ramp-up (training); generally depends on hardware, composition of data
volume (email: longer, OCR: longer, native files: quicker), and how deep the file plan.
Manual Assignment As an outcome of the search results from a saved query, the search results could be manually assigned
to certain business groups without the requirement of training of the system, or including them for
usage as training exemplars
Data Management
Capability Description
Preservation Filtering
for Disposition Eligibility
Isolating content affected by one or more holds
Disposition Eligibility Calculation of eligibility based on older of file creation, file modified date or embedded document date
and time-based retention rule; caveat that hot document or extremely sensitive document may be of
value for training regardless of disposition.
Duplicate and
Near Duplicate
Management
Identify opportunity to cleanse data of these duplicates:
reporting of items; file path locations; version distance measurement and correlation;
best occurs within an extracted text environment.
Master Database
Creation
Ability to aggregate and ingest multiple repository content and its associated metadata, and to
perform cross-queries and correlation across the multiple repositories.
Ability to support engineering associated with supporting or developing APIs into other databases.
Scaling Ability to support large (>50TB) data sets and processing speeds between
1 MB/sec to > = 25 MB/sec (or faster)
depending upon hardware and type of data.
Reporting Capabilities Inclusive of file statistics details, duplicate and near-duplicates reports, classification, PII, custom
coding, and other custom reports
Security Protocols Ability to function behind the firewall or in the cloud, and meeting client requirements for dedicated
hardware, access protocols and other security requirements.
Custom Solutions
Capability Description
Poor Quality Documents Poor quality documents with OCR text that is not searchable are resolved using a soft-dictionary
approach to identify and extract titles, where document titles are used to classify and index the
documents.
What's in the Box Identifying relevant boxes and folders within the boxes for scanning and coding based on their short
descriptions and provided title taxonomy
PDF Splitting Reconstructing document collections using automated logical breaks and classification.
Copyright 2014 Stratas Business Solutions LLP Monday, 08 December 2014
Company Proprietary & Confidential 1 Non-Disclosure & Teaming Agreement