Practical advice for unstructured data for slideshare 6:29

Practical Advice About Unstructured Data

By Neil Raden [email protected] Analyst, Hired Brains Research, LLCJuly, 2015

Dealing with unstructured data requires a knowledge integration process as opposed to a data integration process and is excruciatingly difficult without a model-based approach

Analytics and “Used Data”

Anyone tasked with analyzing data for understanding past events and/or predicting the future knows that the data assembled is always “used.” It’s secondhand. Data is captured digitally for purposes that are almost exclusively meant for purposes other than analysis. Operational systems automate processes and capture data to record the transactions. Document data is stored in formats for presentation and written in flowing prose without obvious structure; it’s written to read (or just record), not mined for later analysis. Clickstream data in web applications is captured and stored in a verbose stream, but has to be reassembled for sense-making.

Another fact that analysts know is that a single source of data may be useful, but it becomes exponentially more useful when it can be combined with other sources, a process called integration or blending. Integrating data from internal, structured sources, such as ERP and CRM systems is actually difficult enough, as we learned from data warehousing, but best practices and useful tools emerged over the past two decades. Getting around semantic mismatch in different sources is still an issue and cannot be solved merely pointing and clicking at column names. However, current practices are still too time-consuming and rigid – different applications and/or areas of an organization may integrate the same data from the same sources, but they are duplicative, incompatible and rigid. No matter how well data warehouses, BI and ETL served (and continue to serve) organizations, business requirements today demand a better solution using:

Multiple Files Multiple File Formats Multiple Intake Sources Multiple Languages Multiple Entities and Concepts

Learning to capture data from logs and steaming API’s

Integrating external data sources, especially those that we call “unstructured1” is so challenging that often the time it takes to process it with structured data integration techniques and manual inspection exceeds the window in which the analysis is useful.

It would seem that the job of pulling these sources of information to draw insight is a nearly impossible job, and in most cases, it is, without the proper tools. Those tasked with the job are talented people who more or less straddle the boundary between IT and business domain experts, including those who are identified as “data scientists.” But it is well documented that these people spend an inordinate amount of time on the data preparation/data integration tasks, especially duplicating the efforts of others around them, which robs them of valuable time that should be devoted to discovering insight and proposing action.

What Actually Is Unstructured Data?

Before presenting some solutions to these hurdles, some precision on the terms “structured” and “unstructured” is helpful.

Structured data is that data that is stored in a fixed schema, such as relational database tables or other fixed format files based on a data model. Computer software developed both by in-house IT organizations and software vendors have, for a few decades, used the concept of structured data for consistency, reliability, etc. However, even structured data can be difficult to mine because the application logic and/or humans who drive the systems can do all sorts of things to degrade the semantic consistency in the files (the most widespread ERP system disables referential integrity in many of its tables for performance reasons and handles it in the application logic, making it difficult to just extract its data without invoking its API’s). This becomes an even more difficult problem when extracting from more than one system when semantic consistency does not exist across the systems. This was a lesson learned in data warehousing and should be a warning to those expanding their analytic databases – it takes time.

The naïve definition of unstructured data is anything that isn’t structured, but that isn’t true. “Unstructured” is really more of a fuzzy class definition than it is a precise one. Some may suggest that there is no unstructured data, only data whose structure we can’t elucidate. For example, a 100-page nicely structured paper report placed on top of a stick of dynamite and ignited would fall to earth in a million shards of paper fragments. Surely this would meet some criteria of unstructured-ness. But if we could precisely model the interaction of heat, light, the angle of the sun, ambient temperature, wind velocity and direction and probably a thousand other variables, we

could find the structure. Of course, we can’t in this case. And that is the essence of unstructured data – finding the structure is more troublesome than applying some other techniques to withdraw whatever latent value is in the data itself.

Data such as streaming telemetry, web logs and many other sources are also lumped into the unstructured category, even if they have a defined record structure. However, they are not based on a defined data model; their structure such as it is can change dynamically, therefor extracting information from them depends on making assumptions about the size and shapes of the records, which is not a reliable method. Twitter is a good example. Records from the Twitter API used to have forty-five fields (the actual text message being only one of them), but it expands without notice. So the first definition, that unstructured data is all data that isn’t structured, may be the best one.

How Hard Are We Working?

The rapid rise of interest in “big data” has spawned a variety of technology approaches to solve, or at least, ease this problem such as text analytics, sentiment analysis, NLP and bespoke applications of AI algorithms. They work. They perform functions that are too time-consuming to do manually but they are incomplete because each one is too narrow, aimed at only a single domain or document type, or too specific in its operation. They mostly defy the practice of agile reuse because each new source, or even each new special extraction for a new purpose, has to start from scratch.

The only workable solution is one that provides all of the tools to make and enhance a “Smart Data Platform” seamlessly, relieving the analysts and data scientists from the tedious effort of integrating and the inputs and outputs many operations. Data Scientists and Business Analysts are spending 50-80% of their time2 preparing and organizing their data (due in large measure to the preponderance of sorting through unstructured data and linked data (structured and unstructured) and only 20% of their time analyzing it. Furthermore, unstructured data is such an untapped wealth of information, it is estimated that more than 80% of all data is unstructured3 Obviously, whatever the proportion, you simply must have an unstructured data strategy.

If you consider that much of the useful analysis you will do must break through the artificial barriers between structured data and unstructured data: old tools, old ways, the crush of managing from scarcity, what can you do to create interactive data exploration with no boundaries?

Knowledge Integration Solution

The first step in getting control of and value from this disjoint and vast collection of data is a universal way to represent it and its meaning regardless of the source or format. Instead of data integration, one has to invest in the concept of knowledge integration and knowledge extraction. A “Smart Data” platform is needed with a minimum of:

Advanced Text Analytics Annotation, Harmonization and Canonicalization Dynamic Ontology Mapping Auto-generated conceptual models Semantics querying and data enrichment Fully customizable dashboards With full data provenance adhering to IT standards

A complete solution should be based on open standards and a semantic approach from its beginnings. In addition, it should incorporate a very rich tool set that includes easy inclusion of 3rd party applications that operate seamlessly within the Smart Data Platform. This is central to the ability to move the task of data integration and data extraction to more advanced knowledge Integration and knowledge extraction, without which it is impossible to fuel solutions in the areas of investigatory analytics, Customer 360, competitive intelligence, Insider trading surveillance, risk and compliance, as well as feeding existing BI applications (a requirement that is not going away anytime soon). It all works because it is based on a dynamic model-based approach.

What is a model-based approach?

At the heart of the model-based approach is integrating all forms of data in semantic technology. Though descriptions of semantic technology are often complicated the concept itself is actually very simple:

- It supplies meaning to data that travels with the data- The model of the data is updated on the fly as new data enters- It is a single, universal way to represent data from any source- The model also captures and understands the relationship between things from

which it can actually do a certain level of reasoning without programming- Information from many sources can be linked, not through views or indexes, but

through explicit and implicit relationships that are native to model - The “model” is based on ontology

Suppose you wish to predict how many people will come to your Emergency Room in the next month. You’ve done some preliminary research and found a high correlation of people complaining on Twitter about certain symptoms and the likelihood they will visit an ER (of course this is an oversimplified description; you would likely combine the Twitter data with other elements to expand its value and strength of its prediction). You chose to query and extract “tweets” from the Twitter API and begin to evaluate your data. Unlike what you see in Twitter, a tweet from the API is a lengthy record with dozens of fields that include sending server, ID, etc., etc. Only one field contains the text.

Why is this complicated? Consumers of tweets should tolerate the addition of new fields and variance in ordering of fields with ease. Not all fields appear in all contexts. It is generally safe to consider a nulled field an empty set, and the absence of a field as the same thing. Twitter data is actually some of the more logical and standardized “unstructured” data, but even Twitter data is a challenge. How do you actually get the data you are looking for? Even more importantly, how will you extract it repeatedly from subsequent draws without doing it manually?

The short answer is developing a “model” for that particular data source and applying that model (and modifying it as things change quickly in the big data world) to quickly extract and integrate the data for your ongoing analysis. The model should also help when doing not just the problem at hand, but quick mash-ups as other ideas and issues arise. The model makes it possible to combine the data from Twitter with any other data in the model, at will – no need for design and new code and testing.

That is the essence of the model-based approach. The knowledge base not only provides the all of the usual metadata one would find in a well-designed data warehouse or MDM (Master Data Management) environment, but it provides the “meaning” of the data (as well as models, etc.). What do we mean by meaning? Consider this:

What is the definition of Neil Armstrong’s walk on the moon. It happened sometime in July of 1969, at some location on the moon, for duration of a certain number of minutes, etc. But the meaning of that walk was that for the first time a human being stepped onto another planet and survived. The true meaning of that is how it affected people, the end of the space race, and how it has affected technology and civilization since. In other words, how it relates to other people and things. Meaning is found in context and context is the set of relations between things.

Linking

As mentioned earlier in this paper, linking data from multiple sources has an exponential effect in value and usefulness. For example, a tweet may contain a geocode for the writer, but if your model already contains extensive geographic, economic and even psychographic information for that geocode, each tweet essentially inherits all of that information for no cost or effort. The value of this is almost impossible to measure.

Big data is often confused with social media like the Twitter example above, but not all unstructured data is external to the enterprise. Commercial aircraft generate mountains of streaming telemetry, but real-time analysis combines other data from other devices, manufacturer specs, etc.; medical devices like chemotherapy infusion machines do too. This data is captured by the manufacturers and used for preventive maintenance and detection and alerting of abnormal readings. Charity:Water monitors streaming data from its water projects around the undeveloped world and combines it with local weather data and even third-party risk assessment data of troop and militia movements

for threat analysis. The economic realities of big data make integrating and analyzing this data feasible, but first, you have to marshal all this unruly data to make it usable.

If you believed that you could develop a workable model for predicting visits to your ER by mining text for the Twitter API, how would you go about it? There are, classically, many algorithms for predicting flow in and through queues, but you need to know who shows up at the ER door. Lets assume that you have a working model for internal scheduling in the ER, but you need to create as input the flow of patients appearing at the door. The Twitter feed can provide the time of day and the geographic location of the sending server (as well as a mountain of other “metadata”). Your job is to extract these various attributes that you think are significant predictors of the likelihood of a visit. You will also want to combine this data with historical trend data you have on visits and start to build some predictive models and test them.

Much of the data in a Twitter feed is fairly easy to understand, but the “tweet,” the text message, is where the real nugget of insight comes from. Because of the limitation of 140 characters, the messages are often difficult for a machine to parse:

Amber and I have been In the waiting rm for 4 hrs. Never again do we casually stroll into the er. Next time we shoot each other too.

Spacing, punctuation, and capitalization – all are pretty informal. And of course the last sentence – sarcasm. Just trying to pick up keywords, which is what most text analysis does, clearly will not suffice. True NLP (Natural Language Processing) is needed to make sense of this. In fact, one might pick up a few tweets from the writer for the sentiment analyzer to get of sense of his/her style.

Data from Twitter is only one of a myriad of unstructured data sources available, but it is a useful source of available information from 300 million monthly active users. 500 million Tweets are sent per day. But in the big data world, that is just one specialized source among hundreds of thousands.

Conclusion

Making sense of unstructured data takes discipline because a one-off approach will drain your best resources of time and patience. A model-based approach, complete with a suite of NLP, AI, graph-based models and semantic is the sensible approach.

The whole extended fabric of a complete solution and its ability to plug in third-party abilities collapses many layers of logical and physical models in traditional data

warehousing/business intelligence architectures into a single model. With a model-based approach, useful benefits accrue:

Widespread understanding of the model across many domains in the organization

Rapid implementation of new studies and applications by expanding the model, not re-designing it (even small adjustments to relational databases involve development at the logical, physical and downstream models, with time-consuming testing).

Application of Solution Accelerators that provide bundled models by industry/application type that can be modified for your specific needs

The use of ontology was hampered a decade ago by poor performance but the appearance of powerful graph databases and economical distributed computing (Hadoop) make it an attractive solution.

1 Actually, a great deal of unstructured data is not external at all. Documents, reports, spreadsheets, email, audio, video and picture data can all be found within an organization2 http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0 3 Various reports use a figure from 80-85% such as those from IBM and Merrill-Lynch, but it is impossible to be precise and on reality, it does not matter. What matters is how much is relevant to you, but in practice, it is vast.

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0