After years of waiting, Cassandra joins open source counterparts MySQL, MariaDB, PostgreSQL, and MongoDB as Cloud Database-as-a-Service (DBaaS). As reported by Asha Barbaschow, Amazon Keyspaces for Apache Cassandra is hitting general release. It sets the stage for real differentiation from what was once a market gap. When there were no Cassandra managed database cloud services, there will now be at least two. With DataStax already having a beta service based on a completely different design layout, the market has a real choice.

As Barbaschow noted, Keyspaces has been designed to facilitate the migration of local workloads to the cloud. But the real question is, will this make Cassandra, a database that has never been known for its ease of use, accessible to a wider audience?

Rough diamond

It is ironic. Apache Cassandra is probably the first NoSQL platform to introduce a truly distributed operating database in the wild. But it is also one of the last to get its own managed DBaaS (Database as a Service) cloud service, which is why AWS (and DataStax) have been in high demand. Both have managed the preview services over the past few months, and now AWS has prepared it for release. AWS offers a native optimized implementation of Cassandra that it calls “server-free Apache Cassandra compatible service.”

Cassandra’s strength has always been its large-scale support and performance as one of the first truly distributed databases to support multi-master operation. Its main challenge was that the database could be very complex to implement.

For example, tasks such as configuration; backups; and garbage collection and compaction (key to maintaining data consistency in a distributed database) required sophisticated skills due to the low level of tools. Part of the challenge is the inherent complexity of designing widely distributed databases. And then there’s the challenge of how to model the data. In the relational world, you would design tables based on early queries and indexes as shortcuts for queries that otherwise need many councils. At Cassandra, best practice is also to expose the data based on expected reading and writing patterns, but there are some important differences. You can index data in Cassandra, but in reality, desormalizing cluster data (or multiple clusters) is best practice. As with any distributed and abnormalized system, there is a question of balancing the proper workload.

What about DataStax?

All of this comes as DataStax is in the early stages of preparing for the Astra-managed Cassandra cloud service that we hope will likely debut in Google Cloud. And with all that, it will continue to be the business as usual for DataStax Enterprise at AWS, which will remain EC2 compliant.

We expect DataStax to initially offer pure Apache deployment (rather than DataStax Enterprise) designed for multiple public clouds once it is released from the beta. While the implementation of AWS will differ, opportunities are being explored where it can bring new features to the open source community.

Amazon approach

AWS seeks to simplify issues by offering Keyspaces as a serverless offering. To do this, it stole a page from DynamoDB, which is also serverless. As a managed service for an open source database, AWS is taking an approach that comes directly from its Amazon Aurora and DocumentDB game books: deploying an open source database in a native cloud architecture that separates the computational storage with specific functions optimized for AWS storage engines. The Keyspaces name refers to the top-level database container that controls the replication of database objects to Apache Cassandra

If you are not a server, it makes your life easier by doing without the tasks of supplying, pasting, and managing servers; it also eliminates the need to perform compactions manually because it has its own storage optimization that dispenses with the need to use the Apache Cassandra tombstone mechanism to mark deleted data; this optimization eliminates the need to provide more storage to continue with the home that has deleted the data. By not having a server, Keyspaces will support automatic calculation download that is priced either by the number of reads and writes, or by service level (e.g., the ability to handle 50,000 reads or writes per second).

As part of the AWS portfolio, Keyspaces will be integrated with its core security, identity and compliance services, such as AWS Identity and Access Management (IAM) for access management; Key Management Service (KMS) for resting encryption; and Amazon CloudWatch for control.

Like DynamoDB, all data at rest will be encrypted. And, like DynamoDB, Aurora, DocumentDB, Keyspaces will automatically support three replicas that can be distributed in different availability zones (AZ) within a region for durability and performance purposes. But there is a subtle difference, as Keyspaces also has the multi-master capability of Apache Cassandra, a feature not available in Aurora or DocumentDB. While DynamoDB already has a multi-region multi-master capability called Global Tables, the Keyspaces at launch will have no cross-region support. But it would not surprise us that a feature like Global Tables materializes for key spaces along the way.

Let’s take a look at how the new AWS service works against Apache Cassandra and the data platform that Cassandra herself is often compared to: DynamoDB.

Comparisons with Apache Cassandra

Because Keyspaces is an AWS implementation of Cassandra, there are some differences with the Apache platform. For example, Apache Cassandra can write transactions to any node, regardless of where it is located, while, for now, Keyspaces can only write to nodes in the same region. Another difference is that, at launch, Keyspaces will not have support for all CQL (Cassandra Query Language) functions; AWS states that it omitted CQL functions that would not be compatible with serverless operation along with others that it considered “experimental.”

There are other subtle differences with table space and key management, system table storage and load balancing, range suppression, along with differences in good practices for query tuning. CQL and partition sizing. For example, in Apache Cassandra, the best practice for sizing partitions is to keep the number of values ​​below 100,000 items and the disk size below 100 Mbytes; on the contrary, key spaces will have no limits. However, AWS enforces the limits that limit rows to a maximum of 1 Mbyte.

Comparisons with DynamoDB

On the covers, both databases are very different. DynamoDB follows a simpler key value scheme, while Cassandra implements a broader column model that is more complex and handles different partitions. As we noted in our comment after AWS announced Keyspaces at re: Invent, the use cases of both databases (such as distributed, operating platforms) are similar, but the main difference would probably be that of preference of developers.

Originally, DynamoDB was the recommended destination in AWS for distributed NoSQL databases, as it positioned itself as a platform that could manage key and document value data. In fact, Cassandra and DynamoDB have a shared lineage in which Apache Cassandra designers applied various principles from Amazon’s original Dynamo research paper; Amazon’s Dynamo and SimpleDB databases were the ancestors of DynamoDB. Since then, AWS has significantly diversified its NoSQL database portfolio with DocumentDB, Neptune, Timestream, ElastiCache and others to target different use cases and data types.

But Cassandra continued to stand out as a distributed database of multiple teachers, meaning she could accept scripts in instances spread across different data centers. While AWS claims Cassandra was not the model, a few years ago, DynamoDB customers demanded replication from multiple regions, which was how the global tables originated.

In developing Keyspaces, AWS took some lessons from DynamoDB; in addition to serverless operation, it adapted automated partition management to balance read and write loads to the new service. There are some features, such as the plug-in for short-term credential authentication that AWS already opened on GitHub. They can provide a comparable server component to the Apache Cassandra project to allow clients running the EC2 database to similarly manage access to their clusters.

A bigger stage for Cassandra?

With Keyspaces, Cassandra becomes the latest open source database for which AWS offers a managed service. Despite barriers to entry, Apache Cassandra has become one of the most popular databases out there, ranked eleventh by db-Engines. A managed cloud service needs to reach that audience.

But if she was so popular, how long did it take Cassandra so long to put her in the cloud? Look no further than the top five databases in db engines; apart from Oracle and SQL Server, the open source databases MySQL, PostgreSQL and MongoDB complete the top five. The first is the first.

Beyond that, the answer to why for so long is also the answer to why a poorly managed cloud service is needed: the complexity of the platform and the lack of decent tools (we’ll probably get complaints about that, but the tools available are not very intuitive). The good news is that the introduction of a managed cloud service will address half of the problem. But the database designer has yet to define the data model, something a managed service cannot automate on its own. There are some good white papers available, and in AWS, a NoSQL Workbench tool for DynamoDB that could be adapted to Cassandra. Ultimately, we would like to see some visual tool that offers a guided approach to developing the scheme. This is the missing link. Hopefully AWS or DataStax, or preferably both, will go to the board that is there.