Three Drivers Propelling the Database Revolution: The Case for a Data Platform
First in a two-part series
By Philip Howard, Bloor Research International Ltd.
Business users want it all. They want to take any sort of data, from both internal and external sources, and blend it together, analyze it, and use the results not just for decision-making, but also to inform operational processes. Any data, any time, anywhere. This is the first driver underpinning the current database evolution.
That’s not all. Organizations are realizing they can get significant cost and agility savings by putting at least of some their corporate data into the cloud. Usually this will mean a hybrid deployment model whereby some data is held on premise and other data is stored in the cloud. And that means data processing has to support a distributed model. This is the second driver.
The third driver is scale. By 2020 the digital universe – the data we create and copy annually – will reach 44 zettabytes, or 44 trillion gigabytes, according to IDC estimates. Of course, scalability has always been an issue for database technology, but historically the data growth that needed support was incremental; today, while data growth may not be exponential, it is certainly geometric, and that creates different sorts of scalability issues from those we have previously known. You also need to be able to scale very rapidly and in an elastic fashion: sometimes you need a lot more capacity, sometimes you don’t.
These three drivers are pushing the development of database technology and will force database providers to evolve into what we are calling a “data platform.” In this blog we will describe what a data platform looks like, what is needed to make it work, and why you might want one.
What’s a data platform?
Organizations need a single, unifying data platform that:
1. Supports any kind of data, regardless of source. It therefore includes transactional data, reference data, social media data, time series data, documents and emails, relationship-based data, information derived from sensors and (smart) meters, clickstream data, log files, video, photos, audio, X-rays, Doppler radar: whatever.
2. Supports distributed, cloud and hybrid environments. This is partly because of the economics of cloud deployment, but also because of privacy and compliance reasons where different geographies have different rules about the security of personally identifiable information. Hybrid cloud deployments are likely to become the norm, so distributed environments need to be supported.
3. Can scale to thousands of nodes, with the ability to handle exabytes of data quickly and easily; this is the volume of data that may be needed. In some instances customers may require the need to scale extremely quickly.
4. Has the performance and architecture sufficient to process all of the supported data, in whatever format, and to analyze that data in an appropriate fashion. To a certain extent we can take performance as a given, at least at a certain level of scale, but the key issue is the ability to maintain that performance as scale increases.
5. Supports the ability to “do something” with the data: typically to enable decision making (automated or otherwise) or to embed the results of any calculations or analyses into business processes.
6. Provides the other capabilities one would expect from a mission-critical environment, such as ease of use, management and monitoring capabilities, reliability, security, and continuous availability.
The trend towards a multi-model data platform
In its 2014 Magic Quadrant for Operational Database Management Systems, Gartner states that “by 2017, the ‘NoSQL’ label will cease to distinguish DBMSs, which will reduce its value and result in its falling out of use. By 2017, all leading operational DBMSs will offer multiple data models, relational and NoSQL, in a single platform.”
It is nice to see our competitors agreeing with us. In 2012 at IBM’s Information on Demand conference in Las Vegas, this author publicly predicted that this multi-model approach would be true across all database environments, both data warehousing and operational, by the end of this decade. Representatives from both IDC and Forrester Research supported this view. We do, however, agree that this rationalization will first take place in the operational space. Data warehousing will be folded in subsequently.
However, it is worth discussing how the market in general gets to this position. There are two possible approaches. One is to use a database technology that is intrinsically multi-model, and the other is to build multiple storage engines under the same hood (for example, DB2 has a separate storage engine for XML). The challenge for most of the merchant database vendors is that they essentially have an inflexible underlying physical data model, and their only practical option is to build alternative storage models to support other types of data in formats that support graphs, JSON, key-value fields, documents, or whatever.
While theoretically viable, such an approach is in its infancy and will result in much more complicated environments than are available from multi-model vendors.
On the other hand, there are suppliers that have a multi-model underpinning. This doesn’t mean that they necessarily have everything in place to provide a complete data platform, but they do start from a much better place because their underlying model can support relational environments or XML or JSON, for example, without change. It is not hard to imagine one of these vendors adding a distributed file system option that looks like Hadoop, but this is likely to be further along their evolutionary path. As we mentioned previously, it is in operational environments that we are moving towards the use of a data platform, and some companies can justifiably argue they are already with us – even if they don’t support every single type of data (for example, Doppler radar) just yet.
Of course, there is a further requirement, apart from scalability, performance and so forth. This is the ability to process and/or analyze the disparate sources and types of data mentioned. It is all very well being able to store and manage this data, but if you do not have appropriate tools to query across these types of data, then storing it will not be of much use.
In operational environments you want the ability to take transactional or operational data and perform analytics against that data in order to inform current business processes and to support real (or appropriate)-time decision making. In addition, you may also wish to be able to combine historical and operational data of various types for analysis purposes. With respect to timeliness, this means being able to take the data, process it, and return results within whatever time scales are appropriate for the business to run its day-to-day operations. This is sometimes referred to as HTAP (hybrid transactional/analytical processing), but applied across all forms of data and not just relational data, so that “analytics” in this context applies as much to text as it does to analyzing structured data.
In the next installment of this blog series, the author will make the case for why you should care about a data platform.
About the Blogger
Philip Howard is Research Director at Bloor Research International Ltd., a London-based independent IT research, analysis and consultancy firm founded in 1989.
InterSystems blogs are authored by members of the InterSystems team as well as guest bloggers. Our blogs will provide a range of opinions that we hope you will find useful, engaging, informative – and fun to read.