Skip to main content

Greenback backup


Why

When Naked Element was still a thing, we used DigitalOcean almost exclusively for our client’s hosting. For the sorts of projects we were doing it was the most straightforward and cost effective solution. DigitalOcean provided managed databases, but there was no facility to automatically back them up. This led us to develop a Python based program which was triggered once a day to perform the backup, push it to AWS S3 and send a confirmation or failure email.

We used Python due to familiarity, ease of use and low installation dependencies. I’ll demonstrate this later on in the Dockerfile. S3 was used for storage as DigitalOcean did not have their equivalent, ‘Spaces’, available in their UK data centre. The closest is in Amsterdam, but our clients preferred to have their data in the UK. 

Fast forward to May 2020 and I’m working on a personal project which uses a PostgreSQL database. I tried to use a combination of AWS and Terraform for the project’s infrastructure (as this is what I am using for my day job) but it just became too much effort to bend AWS to my will and it’s also quite expensive. I decided to move back to DigitalOcean and got the equivalent setup sorted in a day. I could have taken advantage of AWS’ free tier for the database for 12 months, but AWS backup storage is not free and I wanted as much as possible with one provider and within the same virtual private network (VPC).

I was back to needing my own backup solution. The new project I am working on uses Docker to run the main service. My Droplet (that’s what Digital Ocean calls its Linux server instances) setup up is  minimal: non-root user setup, firewall configuration and Docker install. The DigitalOcean Market Place includes a Docker image so most of that is done for me with a few clicks. I could have also installed Python and configured a backup program to run each evening. I’d also have to install the right version of the PostgreSQL client, which isn’t currently in the default Ubuntu repositories, so is a little involved. As I was already using Docker it made sense to create a new Docker image to install everything and run a Python programme to schedule and perform the backups. Of course some might argue that a whole Ubuntu install and configure in a Docker image is a bit much for one backup scheduler, but once it’s done it’s done and can easily be installed and run elsewhere as many times as is needed.

There are two more decisions to note. My new backup solution will use DigitalOcean spaces, as I’m not bothered about my data being in Amsterdam and I haven’t implemented an email server yet so there are no notification emails. This resulted in me jumping out of bed as soon as I woke each morning to check Spaces to see if the backup had worked, rather than just checking for an email. It took two days to get it all working correctly!

What

I reached for Naked Element’s trusty Python backup program affectionately named Greenback after the arch enemy of Danger Mouse (Green-back up, get it? No, me neither…) but discovered it was too specific and would need some work, but would serve as a great template to start with.

It’s worth nothing that I am a long way from a Python expert. I’m in the ‘reasonable working knowledge with lots of help from Google’ category. The first thing I needed the program to do was create the backup. At this point I was working locally where I had the correct PostgreSQL client installed, db_backup.py:

db_connection_string=os.environ['DATABASE_URL']

class GreenBack:
    def backup(self):    
        datestr = datetime.now().strftime("%d_%m_%Y_%H_%M_%S")
        backup_suffix = ".sql"
        backup_prefix = "backup_"

        destination = backup_prefix + datestr + backup_suffix
        backup_command = 'sh backup_command.sh ' + db_connection_string + ' ' + destination
        subprocess.check_output(backup_command.split(' '))
        return destination

I want to keep anything sensitive out of the code and out of source control, so I’ve brought in the connection string from an environment variable. The method constructs a filename based on the current date and time, calls an external bash script to perform the backup:

# connection string
# destination
pg_dump $1 > $2

and returns the backup file name. Of course for Ubuntu I had to make the bash script executable. Next I needed to push the backup file to Spaces, which means more environment variables:

region=''
access_key=os.environ['SPACES_KEY']
secret_access_key=os.environ['SPACES_SECRET']
bucket_url=os.environ['SPACES_URL']
backup_folder='dbbackups'
bucket_name='findmytea'

So that the program can access Spaces and another method:

class GreenBack:
    ...
    def archive(self, destination):
        session = boto3.session.Session()
        client = session.client('s3',
                                region_name=region,
                                endpoint_url=bucket_url,
                                aws_access_key_id=access_key,
                                aws_secret_access_key=secret_access_key)

        client.upload_file(destination, bucket_name, backup_folder + '/' + destination)
        os.remove(destination) 

It’s worth noting that DigitalOcean implemented the Spaces API to match the AWS S3 API so that the same tools can be used. The archive method creates a session and pushes the backup file to Spaces and then deletes it from the local file system. This is for reasons of disk space and security. A future enhancement to Greenback would be to automatically remove old backups from Spaces after a period of time.

The last thing the Python program needs to do is schedule the backups. A bit of Googling revealed an event loop which can be used to do this:

class GreenBack:
    last_backup_date = ""

    def callback(self, n, loop):
        today = datetime.now().strftime("%Y-%m-%d")
        if self.last_backup_date != today:
            logging.info('Backup started')
            destination = self.backup()
            self.archive(destination)
            
            self.last_backup_date = today
            logging.info('Backup finished')
        loop.call_at(loop.time() + n, self.callback, n, loop)
...

event_loop = asyncio.get_event_loop()
try:
    bk = GreenBack()
    bk.callback(60, event_loop)
    event_loop.run_forever()
finally:
    logging.info('closing event loop')
    event_loop.close()

On startup callback is executed. It checks the last_back_date against the current date and if they don’t match it runs the backup and updates the last_backup_date. If the dates do match and after running the backup, the callback method  is added to the event loop with a one minute delay. Calling event_loop.run_forever after the initial callback call means the program will wait forever and the process continues.

Now that I had a Python backup program I needed to create a Dockerfile that would be used to create a Docker image to setup the environment and start the program:

FROM ubuntu:xenial as ubuntu-env
WORKDIR /greenback

RUN apt update
RUN apt -y install python3 wget gnupg sysstat python3-pip

RUN pip3 install --upgrade pip
RUN pip3 install boto3 --upgrade
RUN pip3 install asyncio --upgrade

RUN echo 'deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main' > /etc/apt/sources.list.d/pgdg.list
RUN wget https://www.postgresql.org/media/keys/ACCC4CF8.asc
RUN apt-key add ACCC4CF8.asc

RUN apt update
RUN apt -y install postgresql-client-12

COPY db_backup.py ./
COPY backup_command.sh ./

ENTRYPOINT ["python3", "db_backup.py"]

The Dockerfile starts with an Ubuntu image. This is a bare bones, but fully functioning Ubuntu operating system. The Dockerfile then installs Python, its dependencies and the Greenback dependencies. Then it installs the PostgreSQL client, including adding the necessary repositories. Following that it copies the required Greenback files into the image and tells it how to run Greenback.

I like to automate as much as possible so while I did plenty of manual Docker image building, tagging and pushing to the repository during development, I also created a BitBucket Pipeline, which would do the same on every check in:

image: python:3.7.3

pipelines:
  default:
    - step:
          services:
            - docker
          script:
            - IMAGE="findmytea/greenback"
            - TAG=latest
            - docker login --username $DOCKER_USERNAME --password $DOCKER_PASSWORD
            - docker build -t $IMAGE:$TAG .
            - docker push $IMAGE:$TAG

Pipelines, BitBucket’s cloud based Continuous Integration and Continuous Deployment feature, is familiar with Python and Docker so it was quite simple to make it log in to Docker Hub, build, tag and push the image. To enable the pipeline all I had to do was add the bitbucket-pipelines.yml file to the root of the repository, checkin, follow the BitBucket pipeline process in the UI to enable it and add then add the build environment variables so the pipeline could log into Docker Hub. I’d already created the image repository in Docker Hub.

The Greenback image shouldn’t change very often and there isn’t a straightforward way of automating the updating of Docker images from Docker Hub, so I wrote a bash script to do it, deploy_greenback:

sudo docker pull findmytea/greenback
sudo docker kill greenback
sudo docker rm greenback
sudo docker run -d --name greenback  --restart always --env-file=.env findmytea/
greenback:latest
sudo docker ps
sudo docker logs -f greenback

Now, with a single command I can fetch the latest Greenback image, stop and remove the currently running image instance, install the new image, list the running images to reassure myself the new instance is running and follow the Greenback logs. When the latest image is run, it is named for easy identification, configured to restart when the Docker service is restarted and told where to read the environment variables from. The environment variables are in a local file called .env:

DATABASE_URL=...
SPACES_KEY=...
SPACES_SECRET=...
SPACES_URL=https://ams3.digitaloceanspaces.com

And that’s it! Greenback is now running in a Docker image instance on the application server and backs up the database to Spaces just after midnight every night.

Finally

While Greenback isn’t a perfect solution, it works, is configurable, a good platform for future enhancements and should require minimal configuration to be used with other projects in the future.

Greenback is checked into a public BitBucket repository and the full code can be found here:

https://bitbucket.org/findmytea/greenback/

The Greenback Docker image is in a public repository on Docker Hub and can be pulled with Docker:

docker pull findmytea/greenback

Comments

Popular posts from this blog

Write Your Own Load Balancer: A worked Example

I was out walking with a techie friend of mine I’d not seen for a while and he asked me if I’d written anything recently. I hadn’t, other than an article on data sharing a few months before and I realised I was missing it. Well, not the writing itself, but the end result. In the last few weeks, another friend of mine, John Cricket , has been setting weekly code challenges via linkedin and his new website, https://codingchallenges.fyi/ . They were all quite interesting, but one in particular on writing load balancers appealed, so I thought I’d kill two birds with one stone and write up a worked example. You’ll find my worked example below. The challenge itself is italics and voice is that of John Crickets. The Coding Challenge https://codingchallenges.fyi/challenges/challenge-load-balancer/ Write Your Own Load Balancer This challenge is to build your own application layer load balancer. A load balancer sits in front of a group of servers and routes client requests across all of the serv

Catalina-Ant for Tomcat 7

I recently upgraded from Tomcat 6 to Tomcat 7 and all of my Ant deployment scripts stopped working. I eventually worked out why and made the necessary changes, but there doesn’t seem to be a complete description of how to use Catalina-Ant for Tomcat 7 on the web so I thought I'd write one. To start with, make sure Tomcat manager is configured for use by Catalina-Ant. Make sure that manager-script is included in the roles for one of the users in TOMCAT_HOME/conf/tomcat-users.xml . For example: <tomcat-users> <user name="admin" password="s3cr£t" roles="manager-gui, manager-script "/> </tomcat-users> Catalina-Ant for Tomcat 6 was encapsulated within a single JAR file. Catalina-Ant for Tomcat 7 requires four JAR files. One from TOMCAT_HOME/bin : tomcat-juli.jar and three from TOMCAT_HOME/lib: catalina-ant.jar tomcat-coyote.jar tomcat-util.jar There are at least three ways of making the JARs available to Ant: Copy the JARs into th

Bloodstock 2009

This year was one of the best Bloodstock s ever, which surprised me as the line up didn't look too strong. I haven't come away with a list of bands I want to buy all the albums of, but I did enjoy a lot of the performances. Insomnium[6] sound a lot like Swallow the Sun and Paradise Lost. They put on a very good show. I find a lot of old thrash bands quite boring, but Sodom[5] were quite good. They could have done with a second guitarist and the bass broke in the first song and it seemed to take ages to get it fixed. Saxon[8] gave us some some classic traditional heavy metal. Solid, as expected. The best bit was, following the guitarist standing on a monitor, Biff Bifford ripped off the sign saying "DO NOT STAND" and showed it to the audience. Once their sound was sorted, Arch Enemy[10] stole the show. They turned out not only to be the best band of the day, but of the festival, but then that's what you'd expect from Arch Enemy. Carcass[4] were very disappoin