masthead-resources

Introduction

To overcome the performance limitations of traditional relational databases, applications – ranging from those running on a single machine to large, interconnected grids – often use in-memory databases to accelerate data access. While in-memory databases and caching products increase throughput, they suffer from a number of limitations including lack of support for large data sets, excessive hardware requirements, and limits on scalability.

InterSystems Caché® is a high-performance object database with a unique architecture that makes it suitable for applications that typically use in-memory databases. Caché’s performance is comparable to that of in-memory databases, but Caché also provides:

  • Persistence – data is not lost when a machine is turned off or crashes
  • Rapid access to very large data sets
  • The ability to scale to hundreds of computers and tens of thousands of users
  • Simultaneous data access via SQL and objects: Java, C++, .NET, etc.

This paper explains why Caché is an attractive alternative to in-memory databases for companies that need high-speed access to large amounts of data.

Unique data engine enables persistence and high performance

Caché is a persistent database, which means that data maintained in RAM is written to disk by background processes. So how can Caché provide performance that is comparable to in-memory databases,which only periodically write data to some permanent data store?

Part of the answer lies in Caché’s unique architecture. Instead of the rows and columns of a traditional database, Caché uses multidimensional arrays, the structure of which is based on object definitions. Data is stored the way the architect designs it, and the same structures used for the in-memory cache are used on disk. Data that should be stored together is stored together. As a result, Caché can access data on disk very quickly.

The requirement that multiple in-memory caches need to be synchronized when data is updated also reduces the performance of many distributed cache products. With Caché, the updating of data and the distribution of data to caches are logically separate. This gives it a much simpler workflow which allows for superior performance.

Caché also provides “in-process bindings” to C++ and Java that allow applications written in those languages to directly populate Caché’s internal data structures.

The benefits of persistence

Given that Caché provides comparable performance, its ability to access data on disk confers some significant advantages compared to in-memory databases. The most obvious is that there is no need for a separate permanent data store. Caché is the permanent store, and it is always current. Data is not lost when a machine is turned off or crashes.

Another advantage is that, with Caché, the size of data sets is not limited by the amount of available RAM. If data is not in a local cache it is either obtained from a remote cache or from disk in a seamless manner. Since it is not RAM-limited, a Caché-based system can handle petabytes of data, in-memory databases cannot.

Adding RAM to a system in an attempt to increase capacity is more expensive than adding disk storage. (A terabyte of disk storage is cheaper than a terabyte of RAM.) Plus, many in-memory systems need to keep redundant copies of data on separate machines to safeguard against the effects of having a computer crash. Operating distributed cache systems with a persistent database like Caché often results in reduced hardware costs.

Seamless SQL and object data access

One problem shared by most in-memory databases is that, because their datastructures are optimized for high-speed processing, the data is usually not readily accessible via SQL. In order to be compatible with most analysis and reporting tools, the data must first be “mapped” into relational tables. This is usually done when data is transferred from the in-memory database to the permanent datastore and typically involves an ETL (extract, transform, and load) process. (The processing overhead and additional time required for mapping is the main reason relational databases are not fast enough for extremely high-speed distributed applications, and why in-memory databases are often used instead.)

A few in-memory databases are based on the relational model, and offer SQL data access. Such systems suffer from the opposite problem, in that data is not readily accessible to the object-oriented technologies that are typically used for application development. In addition, most relational in-memory databases are not designed for multi-computer configurations. They run on only one machine, and are RAM-limited.

Caché is different, because the multidimensional arrays it uses can be exposed simultaneously as relational tables and as objects. Caché’s Unified Data Architecture maintains both object and relational views of data at all times – without mapping.

FFIGURE 1: CACHE'S UNIFIED DATA ARCHITECTURE ENABLES MULTLE WAYS TO ACCESS DATA
Figure 1: Cache’s Unified Data Architecture Enables Multiple Ways To Access Data

Caché’s SQL access is compatible with both ODBC and JDBC. On the object side, Caché provides bindings to any number of object-oriented languages including Java, .NET, and C++. Caché’s object representation is full-featured and supports object-oriented concepts like inheritance, polymorphism, and encapsulation.

Enterprise Cache Protocol

In multi-computer applications Caché automatically maintains caches by use of its Enterprise Cache Protocol (ECP).

With ECP, Caché instances can be configured as data servers and/or application servers. Each piece of data is owned by a data server. Application servers understand where data is located and keep local caches of recently used data. If an application server cannot satisfy requests from its local cache it will request the necessary data from a remote data server. ECP automatically manages cache consistency.

ECP requires no application changes – applications simply treat the entire database as if it was local. This is a major distinction from some distributed cache systems, where each client needs to specify what subset of data it is interested in before any queries are performed.

One machine, one cache

Another key difference between Caché and other distributed cache products is that most other products maintain a separate cache for each process running on a machine. For example, if a single machine has eight clients then eight individual caches will be maintained on that machine.

In contrast, Caché maintains its cache in shared memory and provides bindings to allow processes running in their own memory address space to access the data. Data can be simultaneously accessed through TCP-based protocols like JDBC, through language bindings, and also – for exceptionally high performance –through bindings that allow applications to directly manipulate the cache.

Allowing multiple clients to share a single cache provides a number of benefits. One is that a shared-cache system has reduced memory requirements. When, as is often the case, individual clients require access to overlapping data, other distributed cache products maintain multiple copies of the data. With Caché only a single copy of the data needs to be maintained for each machine.

Having one cache per machine also results in reduced network I/O. In high-performance applications the network traffic associated with cache maintenance can be a major issue. However,with a single cache per machine, only that cache needs to be updated as the underlying data changes, rather than making overlapping updates to multiple caches.

Even with multi-core processors, a Caché-based system only uses one shared cache per machine, resulting in superior scalability compared with other distributed cache products. For example, in a Caché-based system of 250 machines, each with8 cores, only 250 caches need to communicate with each other in order to maintain cache coherence. But systems that require a separate cache for each core would need to coordinate 2000 caches. As modern computers may have eight, sixteen, or even more cores, the scalability advantage of Caché becomes increasingly important.

Figure 2a: Cache Coherency without InterSystems’ Enterprise Cache Protocol
Figure 2a: Cache Coherency without InterSystems’ Enterprise Cache Protocol
FIGURE 2b: Cache Coherency in a Caché-Based System
Figure 2b: Cache Coherency in a Caché-Based System

Populating the cache

In many distributed cache applications, pre-loading the cache can be a lengthy process. This may be due to the sheer amount of data, and/or because of the time required to “map” data form a relational store into the object-oriented structures used by the application. For some data-intensive applications,more time is spent populating in-memory caches than actually running calculations against them.

Not so with Caché. Caché’s exceptional SQL capabilities allow it to easily pull data from relational primary data sources. And of course, as a persistent database, Caché may be the primary source. In that case, there is no need to pre-load caches at all. Local caches will automatically load the data they need as queries are run.

Another consideration is how many machines are involved with the task of populating caches.With Caché, primary ownership of the data is held by a small percentage of the computers in a distributed grid environment. Populating that environment only requires access to the ECP data servers, and they can be loaded in the background while the other computers are used for other tasks. When the application servers come on line, their caches are repopulated automatically as data is requested.

In contrast, when data is loaded in most in-memory products, it is partitioned to be spread across the distributed cache so that all, or virtually all, data is in the memory of at least one machine. As a result, it is often not feasible to do dataloads with a small subset of the computers while bringing the rest on line as needed.

Conclusion

The primary reason for using in-memory databases is speed. But although they are fast, in-memory databases often suffer from poor scalability, lack of SQL support, excessive hardware requirements, and the risk of losing data due to unplanned outages.

Caché is the only persistent database that provides performance equal to that of in-memory databases. It also supports extremely large data sets, seamlessly allows data access via both SQL and objects, enables distributed systems of hundreds of machines, and is highly reliable.

All of this makes Caché an attractive alternative for applications that must process very high volumes of data at very high speed.