Why
When Naked Element was still a thing, we used DigitalOcean almost exclusively for our client’s hosting. For the sorts of projects we were doing it was the most straightforward and cost effective solution. DigitalOcean provided managed databases, but there was no facility to automatically back them up. This led us to develop a Python based program which was triggered once a day to perform the backup, push it to AWS S3 and send a confirmation or failure email.
We used Python due to familiarity, ease of use and low installation dependencies. I’ll demonstrate this later on in the Dockerfile. S3 was used for storage as DigitalOcean did not have their equivalent, ‘Spaces’, available in their UK data centre. The closest is in Amsterdam, but our clients preferred to have their data in the UK.
Fast forward to May 2020 and I’m working on a personal project which uses a PostgreSQL database. I tried to use a combination of AWS and Terraform for the project’s infrastructure (as this is what I am using for my day job) but it just became too much effort to bend AWS to my will and it’s also quite expensive. I decided to move back to DigitalOcean and got the equivalent setup sorted in a day. I could have taken advantage of AWS’ free tier for the database for 12 months, but AWS backup storage is not free and I wanted as much as possible with one provider and within the same virtual private network (VPC).
I was back to needing my own backup solution. The new project I am working on uses Docker to run the main service. My Droplet (that’s what Digital Ocean calls its Linux server instances) setup up is minimal: non-root user setup, firewall configuration and Docker install. The DigitalOcean Market Place includes a Docker image so most of that is done for me with a few clicks. I could have also installed Python and configured a backup program to run each evening. I’d also have to install the right version of the PostgreSQL client, which isn’t currently in the default Ubuntu repositories, so is a little involved. As I was already using Docker it made sense to create a new Docker image to install everything and run a Python programme to schedule and perform the backups. Of course some might argue that a whole Ubuntu install and configure in a Docker image is a bit much for one backup scheduler, but once it’s done it’s done and can easily be installed and run elsewhere as many times as is needed.
There are two more decisions to note. My new backup solution will use DigitalOcean spaces, as I’m not bothered about my data being in Amsterdam and I haven’t implemented an email server yet so there are no notification emails. This resulted in me jumping out of bed as soon as I woke each morning to check Spaces to see if the backup had worked, rather than just checking for an email. It took two days to get it all working correctly!
What
I reached for Naked Element’s trusty Python backup program affectionately named Greenback after the arch enemy of Danger Mouse (Green-back up, get it? No, me neither…) but discovered it was too specific and would need some work, but would serve as a great template to start with.
It’s worth nothing that I am a long way from a Python expert. I’m in the ‘reasonable working knowledge with lots of help from Google’ category. The first thing I needed the program to do was create the backup. At this point I was working locally where I had the correct PostgreSQL client installed, db_backup.py:
db_connection_string=os.environ['DATABASE_URL']
class GreenBack:
def backup(self):
datestr = datetime.now().strftime("%d_%m_%Y_%H_%M_%S")
backup_suffix = ".sql"
backup_prefix = "backup_"
destination = backup_prefix + datestr + backup_suffix
backup_command = 'sh backup_command.sh ' + db_connection_string + ' ' + destination
subprocess.check_output(backup_command.split(' '))
return destination
I want to keep anything sensitive out of the code and out of source control, so I’ve brought in the connection string from an environment variable. The method constructs a filename based on the current date and time, calls an external bash script to perform the backup:
# connection string
# destination
pg_dump $1 > $2
and returns the backup file name. Of course for Ubuntu I had to make the bash script executable. Next I needed to push the backup file to Spaces, which means more environment variables:
region=''
access_key=os.environ['SPACES_KEY']
secret_access_key=os.environ['SPACES_SECRET']
bucket_url=os.environ['SPACES_URL']
backup_folder='dbbackups'
bucket_name='findmytea'
So that the program can access Spaces and another method:
class GreenBack:
...
def archive(self, destination):
session = boto3.session.Session()
client = session.client('s3',
region_name=region,
endpoint_url=bucket_url,
aws_access_key_id=access_key,
aws_secret_access_key=secret_access_key)
client.upload_file(destination, bucket_name, backup_folder + '/' + destination)
os.remove(destination)
It’s worth noting that DigitalOcean implemented the Spaces API to match the AWS S3 API so that the same tools can be used. The archive method creates a session and pushes the backup file to Spaces and then deletes it from the local file system. This is for reasons of disk space and security. A future enhancement to Greenback would be to automatically remove old backups from Spaces after a period of time.
The last thing the Python program needs to do is schedule the backups. A bit of Googling revealed an event loop which can be used to do this:
class GreenBack:
last_backup_date = ""
def callback(self, n, loop):
today = datetime.now().strftime("%Y-%m-%d")
if self.last_backup_date != today:
logging.info('Backup started')
destination = self.backup()
self.archive(destination)
self.last_backup_date = today
logging.info('Backup finished')
loop.call_at(loop.time() + n, self.callback, n, loop)
...
event_loop = asyncio.get_event_loop()
try:
bk = GreenBack()
bk.callback(60, event_loop)
event_loop.run_forever()
finally:
logging.info('closing event loop')
event_loop.close()
On startup callback is executed. It checks the last_back_date against the current date and if they don’t match it runs the backup and updates the last_backup_date. If the dates do match and after running the backup, the callback method is added to the event loop with a one minute delay. Calling event_loop.run_forever after the initial callback call means the program will wait forever and the process continues.
Now that I had a Python backup program I needed to create a Dockerfile that would be used to create a Docker image to setup the environment and start the program:
FROM ubuntu:xenial as ubuntu-env
WORKDIR /greenback
RUN apt update
RUN apt -y install python3 wget gnupg sysstat python3-pip
RUN pip3 install --upgrade pip
RUN pip3 install boto3 --upgrade
RUN pip3 install asyncio --upgrade
RUN echo 'deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main' > /etc/apt/sources.list.d/pgdg.list
RUN wget https://www.postgresql.org/media/keys/ACCC4CF8.asc
RUN apt-key add ACCC4CF8.asc
RUN apt update
RUN apt -y install postgresql-client-12
COPY db_backup.py ./
COPY backup_command.sh ./
ENTRYPOINT ["python3", "db_backup.py"]
The Dockerfile starts with an Ubuntu image. This is a bare bones, but fully functioning Ubuntu operating system. The Dockerfile then installs Python, its dependencies and the Greenback dependencies. Then it installs the PostgreSQL client, including adding the necessary repositories. Following that it copies the required Greenback files into the image and tells it how to run Greenback.
I like to automate as much as possible so while I did plenty of manual Docker image building, tagging and pushing to the repository during development, I also created a BitBucket Pipeline, which would do the same on every check in:
image: python:3.7.3
pipelines:
default:
- step:
services:
- docker
script:
- IMAGE="findmytea/greenback"
- TAG=latest
- docker login --username $DOCKER_USERNAME --password $DOCKER_PASSWORD
- docker build -t $IMAGE:$TAG .
- docker push $IMAGE:$TAG
Pipelines, BitBucket’s cloud based Continuous Integration and Continuous Deployment feature, is familiar with Python and Docker so it was quite simple to make it log in to Docker Hub, build, tag and push the image. To enable the pipeline all I had to do was add the bitbucket-pipelines.yml file to the root of the repository, checkin, follow the BitBucket pipeline process in the UI to enable it and add then add the build environment variables so the pipeline could log into Docker Hub. I’d already created the image repository in Docker Hub.
The Greenback image shouldn’t change very often and there isn’t a straightforward way of automating the updating of Docker images from Docker Hub, so I wrote a bash script to do it, deploy_greenback:
sudo docker pull findmytea/greenback
sudo docker kill greenback
sudo docker rm greenback
sudo docker run -d --name greenback --restart always --env-file=.env findmytea/
greenback:latest
sudo docker ps
sudo docker logs -f greenback
Now, with a single command I can fetch the latest Greenback image, stop and remove the currently running image instance, install the new image, list the running images to reassure myself the new instance is running and follow the Greenback logs. When the latest image is run, it is named for easy identification, configured to restart when the Docker service is restarted and told where to read the environment variables from. The environment variables are in a local file called .env:
DATABASE_URL=...
SPACES_KEY=...
SPACES_SECRET=...
SPACES_URL=https://ams3.digitaloceanspaces.com
And that’s it! Greenback is now running in a Docker image instance on the application server and backs up the database to Spaces just after midnight every night.
Finally
While Greenback isn’t a perfect solution, it works, is configurable, a good platform for future enhancements and should require minimal configuration to be used with other projects in the future.Greenback is checked into a public BitBucket repository and the full code can be found here:
https://bitbucket.org/findmytea/greenback/
The Greenback Docker image is in a public repository on Docker Hub and can be pulled with Docker:
docker pull findmytea/greenback
Comments
Post a Comment