24 Jun · 6 min read
As we all know, GDPR will be in force May 2018. After that, users of software products and services will have the right to be forgotten (cool, right? Finally I can rest assured that my browsing history will not be read aloud at my funeral). In other words, if a user from the EU asks a service provider to delete user's data, the provider will have to delete all user's data or face very serious consequences.
But, it is unclear what it means to delete a user's data. I guess the only way to find out is when the audit occurs.
Photo by Devin Avery / Unsplash
A user's data is simultaneously both deleted and not deleted until observed at the time of audit.
This post is the introduction to a series of blog posts about GDPR and Apache Cassandra databases.
As Apache Cassandra consultants, our main concern is: what does it mean to delete the data from Cassandra points of view? And what we can do to be as sure as possible that a user’s data will stay deleted. As we know, when Cassandra deletes the data, it just marks it as deleted. The actual "deletion" occurs during the compaction process.
When the data is marked as deleted:
Once again: Cassandra, as many other systems, does not actually delete data when it deletes the data. But this is in line with the definition of the verb delete from the Oxford dictionary:
"remove or obliterate (written or printed matter), especially by drawing a line through it"
On the other hand, a similar thing happens in the underlying OS (Linux). When a file is deleted, it is just marked as deleted. And you can actually recover the deleted files with specialized forensic tools.
Okay, so the actual, irreversible deleting of the data does not usually happen in the software engineering. But we would love to do as much as we can in order to make sure that the data is not accessible from Cassandra and any Cassandra tooling (like sstabledump, sstable2json). OS and file system engineers should do their part of work by doing the same for the OS level (if they think that’s necessary).
Another problem in Cassandra is that it is hard to filter on fields that are not part of the primary key. So, if some of the user’s data is held in the table where the primary key is something like deviceId, that would mean that we would have to search all the records for all the deviceIds and remove the corresponding user’s data. That does not scale.
As already said, even after a delete statement is issued, it is not guaranteed that the data is deleted. Furthermore, if the data model is not well designed, the deleted data might never get evicted. In Cassandra 3.10, this behavior is improved and compaction is triggered when there is a certain percent of expired tombstone (read more about it here), and deleting compaction strategy looks like it could solve this problem (note that the strategy is not an official part of Apache Cassandra). Also, I'm quite sure that I saw a Jira issue on an Apache Cassandra project about some other kind of Deleting compaction strategy, which should guarantee to actually delete the data, not only mark it as deleted, but I can't find it now. That would be cool.
Speaking of compaction strategies, SizeTieredCompactionStrategy can be tricky, because if you end up with one huge SSTable file, you need SSTables of a similar size in order to compact them. Which means that the tombstones will stay in a huge SSTable for a very long time, maybe forever. A situation similar to the one occurring in the 2048 game:
Tile 2048 will not be merged anytime soon.
The main takeaway is: be aware of how different compaction strategies work and know your system behavior. If you have a problem with tombstone eviction, it might be a good idea to change your compaction strategy and/or to redesign your tables.
Unlike in relational databases, in Apache Cassandra data is stored in denormalized form. Thus, it is not possible to (easily) filter on fields that are not part of the partition key. So, if we have the following table:
This means that we cannot just: DELETE FROM device_measurement WHERE user_id = bf884b98-0a72-10e8-ba89-0ed5f89f718b. It is, however, possible to issue DELETE FROM device_measurement WHERE user_id = bf884b98-0a72-10e8-ba89-0ed5f89f718b ALLOW FILTERING, but this might ruin the performance of the entire cluster.
Therefore, we should think about the user’s data in advance when designing the tables.
Solution 1: design tables in a way that the user's data can be easily deleted (user_id part of the primary key) from all the tables. This will obviously have an impact on the design process in both green field projects and when redesigning existing databases.
Solution 2: embrace encryption. Okay, this is not really a solution, it's more of an idea we're currently playing with in SmartCat. Encrypting the stored user's data with homomorphic encryption to preserve the ordering of clustering columns, and when the data needs to be deleted, just delete the key. If you have any thoughts on this or experience to share, we would love to hear from you.
Embrace Privacy by design. The idea of GDPR is a good thing from a consumer perspective. A user's data will be seen as a liability for the companies, not as an asset. Which means that companies will, hopefully, be very careful when storing a user's data. This is also a good opportunity for new players on database as a service market (DaaS) or some derivative of the concept. Because it seems that it is easier to build new systems with privacy in mind from scratch, than to refactor the existing ones. What I would like to see is a database (as a service) that would allow me to issue a delete for the userId, and for me (as a programmer/user of the database) to stop worrying about it. The DaaS provider would be responsible for the rest.
The article was first published here