Analytics of Textual Big Data: Text Exploration of the Big Untapped Data Source
Introduction – Analyzing Textual Big Data
Big Data for Enriching Analytical Capabilities – Big data is revolutionizing the world of business intelligence and analytics. Gartner predicts that big data will drive $232 billion in spending through 2016, Wikibon claims that by 2017 big data revenue will have grown to $47.8 billion, and McKinsey Global Institute indicates that big data has the potential to increase the value of the US health care industry by $300 billion and to increase the industry value of Europe’s public sector administration by €250 billion.
The big data breakthrough comes from innovative big data analytics. For some companies the primary challenge comes from analyzing massive amounts of structured data, primarily numerical, such as credit card companies with millions of cardholders and billions of transactions looking for fraud patterns. Analyzing massive amounts of structured data may require new software strategies and technologies but is generally straightforward and readily achievable.
Not all big data is structured. Big data comes in all shapes and sizes. The greatest big data challenge is that a large portion of it is not structured, often in the form of unstructured text. Think of all the data used or created in a typical business − emails, documents, voice transcripts from customer calls, conferences with note taking, and more. Most of this data is unstructured text. Even in an industry dominated by numerical data, text abounds. For example, in commercial banking, financial statements and loan activity are well-structured data, but to understand the loan you have to read the file, which is full of correspondence, written assessments and notes from every phone call and meeting. To really understand the risk in a lending portfolio you need to read and understand every loan file.
In a medical environment, many structured data sources exist, such as test results over time and coded fields. However, some of the most valuable data is found within a clinician’s textual notes: his impressions, what he learned from conversing with the patient, why he reached his diagnosis or ordered a test, what he concluded from various test results, and much more. In most large clinical settings these invaluable notes comprise very large data sets but, while they are increasingly digitized, they are rarely analyzed.
Analyzing Textual Data – Advanced analytical capabilities have always been available for analyzing non-textual data. Almost every organization knows how to turn their own structured data that has been collected over the years by business processes into valuable business insights. Countless reporting and analytical tools are available to assist them. Surely, these tools and algorithms may have to be adapted somewhat to be able to run fast on big data (for example, they may have to use in-memory techniques and dedicated hardware), but the algorithms stay the same and are well-known.
But what about all the textual data that has been gathered in emails, document management systems, call center log files, instant messaging transcripts and voice transcripts from customer calls? And what about all the external textual data, such as blogs, tweets, Facebook messages and informational Websites? A wealth of information is hidden in the vast amounts of textual data being created every day. The challenge for every organization is to extract valuable business insights from this mountain of data that allows it to, for example, optimize its business processes, improve the level of customer care it offers, personalize products, and improve product development.
This paper will outline the benefits and challenges of analyzing textual big data. It will also discuss InterSystems iKnow™ technology, which offers an easier, less time-consuming way to unlock the information contained in textual data.