Skip to main content

Duplicate Data in Microservices

So You Are Uncomfortable with Duplicate Data? 

If, like me, you’ve spent a reasonable amount of your career working with relational databases,  where data is rationalised to avoid duplication, the idea of duplicating data across microservices is probably anathema to you.

Even if you’ve worked with a noSql database like MongoDB, where data is often duplicated across the documents, you probably still struggle with the idea of a service keeping a copy of data owned by another service.

Discomfort with duplication doesn’t need to come from databases. The Don't Repeat Yourself (DRY) principle of software engineering states that "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system". Even the process of Test Driven Development (TDD) includes a step for refactoring to remove duplication as part of the cycle.

As software developers we are programmed to detest duplication in all its forms.

It’s ok, I have felt your pain and as soon as you come to terms with the idea of duplicating data across microservices being the best way to make your microservices more robust, independent and decoupled from other services your pain will go - forever.

Making Microservices more Robust and More Independent

A Microservices architecture consists of a collection of different services which each provide a well defined, loosely coupled and independent business function. You can find out more at: https://microservices.io/. Let’s have a look at an example from part of a previous project of mine, an app for finding local cafes that serve great tea, called Find My Tea.


Find My Tea had a handful of microservices, but two of the most important ones were the Location service, which managed the locations (cafes, restaurants, etc.) where great tea could be found and the Brand service which managed the brands of tea served by the locations. Managing locations and brands consists of creating, reading, updating and deleting (CRUD) them. The Location service was the Single Source of Truth for Locations and the Brand service for brands.

In the first iteration, when the app requested a location, the Location service looked-up the location in its database, found the IDs of the brands served by location and then requested the details of those brands from the Brand service. The Brand service then looked up the brands in its database and passed them back to the Location service. The Location service then enriched the location with the brand details and passed it back to the app.

It is arguable that the more independent a microservice is, the more robust it becomes. Here we have a very clear dependency on the Brand service by the Location service when performing a location lookup request. If the Brand service is down or otherwise unavailable, either the brands are not returned with the location or, depending on how the location service is designed to handle partial failure, the entire request fails. There is also the potential latency of inter-service communication and two database lookups, although in a simple request such as this, it is likely to be negligible.

To remove this dependency the Location service can keep a copy of the data maintained by the Brand service up-to-date in its own database, so that when a location is requested, a call to the Brand service is not necessary.  This does not make the Brand service redundant as it is still required to maintain the brand data and remains the Single Source of Truth for brands. The copy of the brand data kept in the Location service, although updatable from the Brand service, is effectively a read-only copy. The advantage is that if the Brand service is unavailable the location request will still succeed and the brand data will be present in the response.

Distribute the Data

However; this does beg the question of how to keep the brand data up-to-date in the Location service. Could this be a source of coupling? I’ve seen this done in two ways, but there are other approaches too.

One way is to have the Location service poll the Brand service every so often to get brand data and update it in its own database. There are a number of drawbacks with this approach. We all know polling is evil. In the case where you have multiple instances of the polling service, you have to specify one instance as the one that does the polling, or all of the instances could be polling for the data unnecessarily and all at the same time. The data in the Brand service may get updated in-between polls, meaning that the data is out of date, for a period of time, in the Location service - whatever method of data synchronisation you use, there will always be an element of Eventual Consistency. You either need to devise a clever mechanism for determining which brands have been updated since the last poll or always send back all of the data resulting in potentially large requests. The polling approach requires the Location service to know where to find the Brand service, creating an unnecessary dependency. It is also required to handle the error caused by the Brand service being down or unreachable. The polling approach doesn’t tend to scale very well for all of these reasons.

The approach I favour is to use a message broker. When a brand is created, updated or deleted, the Brand service can put a message onto a topic with only the details of that change. The Location service can listen to a queue, which is subscribed to the topic, and only update its database with a single brand when a message is received. There is no polling necessary. Message brokers are usually very fast and the amount of time the Location service would be out of date is likely to be negligible. The Location service only needs to know where the queue is that it is listening to. When there are multiple instances of the Location service, the queue can be configured to only deliver each message to the first instance which requests it. An added advantage is that the Brand service only needs to know where to find the topic. It doesn’t need to know anything about the Location service, or any other services, which may want to consume the messages via a queue subscribed to the topic. Of course both the sending and receiving services are slightly coupled by the format of the message and the data contract, potentially, becomes as important as any API contract.

More Robust, Independent and Loosely Coupled

It can be as simple as that. Maintaining a copy of data in one service, which is maintained by another service, can make the services more robust and independent. Distributing that data via a message broker makes the services loosely coupled, although not entirely decoupled, it keeps the data up-to-date and reduces the size of messages.

As with most things in software development, maintaining a local copy of data which is managed elsewhere is a tradeoff. As a software developer you must consider, for example, the security concerns which come with duplicating data. Especially if it is considered personal data. You must also consider the complexity of keeping the data up-to-date. For example, when does the data expire or become invalid. Does the data need to be versioned? Does the order of applied updates need to be taken into account?

I hope it goes without saying that in almost every other context duplication should still be avoided, detested and possibly even hated. However, it should also be clear that the trade off of duplicating data in microservices can make for better microservies.

Much of my early understanding of microservices, including the advantages of sharing data and some of the possible ways to do it came from Microservices Patterns by Chris Richardson (ISBN-13: 978-1617294549). If you’re interested in learning more about  microservices, I would strongly recommend giving it a read. The rest has come from trial and error, failure, eventual success and quite a lot of arguing with colleagues.


Comments

Popular posts from this blog

Write Your Own Load Balancer: A worked Example

I was out walking with a techie friend of mine I’d not seen for a while and he asked me if I’d written anything recently. I hadn’t, other than an article on data sharing a few months before and I realised I was missing it. Well, not the writing itself, but the end result. In the last few weeks, another friend of mine, John Cricket , has been setting weekly code challenges via linkedin and his new website, https://codingchallenges.fyi/ . They were all quite interesting, but one in particular on writing load balancers appealed, so I thought I’d kill two birds with one stone and write up a worked example. You’ll find my worked example below. The challenge itself is italics and voice is that of John Crickets. The Coding Challenge https://codingchallenges.fyi/challenges/challenge-load-balancer/ Write Your Own Load Balancer This challenge is to build your own application layer load balancer. A load balancer sits in front of a group of servers and routes client requests across all of the serv

Catalina-Ant for Tomcat 7

I recently upgraded from Tomcat 6 to Tomcat 7 and all of my Ant deployment scripts stopped working. I eventually worked out why and made the necessary changes, but there doesn’t seem to be a complete description of how to use Catalina-Ant for Tomcat 7 on the web so I thought I'd write one. To start with, make sure Tomcat manager is configured for use by Catalina-Ant. Make sure that manager-script is included in the roles for one of the users in TOMCAT_HOME/conf/tomcat-users.xml . For example: <tomcat-users> <user name="admin" password="s3cr£t" roles="manager-gui, manager-script "/> </tomcat-users> Catalina-Ant for Tomcat 6 was encapsulated within a single JAR file. Catalina-Ant for Tomcat 7 requires four JAR files. One from TOMCAT_HOME/bin : tomcat-juli.jar and three from TOMCAT_HOME/lib: catalina-ant.jar tomcat-coyote.jar tomcat-util.jar There are at least three ways of making the JARs available to Ant: Copy the JARs into th

Bloodstock 2009

This year was one of the best Bloodstock s ever, which surprised me as the line up didn't look too strong. I haven't come away with a list of bands I want to buy all the albums of, but I did enjoy a lot of the performances. Insomnium[6] sound a lot like Swallow the Sun and Paradise Lost. They put on a very good show. I find a lot of old thrash bands quite boring, but Sodom[5] were quite good. They could have done with a second guitarist and the bass broke in the first song and it seemed to take ages to get it fixed. Saxon[8] gave us some some classic traditional heavy metal. Solid, as expected. The best bit was, following the guitarist standing on a monitor, Biff Bifford ripped off the sign saying "DO NOT STAND" and showed it to the audience. Once their sound was sorted, Arch Enemy[10] stole the show. They turned out not only to be the best band of the day, but of the festival, but then that's what you'd expect from Arch Enemy. Carcass[4] were very disappoin