Getting Hands-On with Apache Iceberg: From Docker to AWS in a Day

On Tuesday this week, I was back at AWS on Holborn Viaduct learning about Apache Iceberg with a colleague. Iceberg is a table format standard for managing large-scale datasets and their metadata. The AWS-managed approach is to store the data in S3, meaning you only pay for storage and bandwidth. Other tools, such as Apache Spark, can then be used on top to build a data warehouse - or in this case, a Lakehouse. AWS also supports self-hosting, but this requires you to configure and manage aspects such as maintenance yourself.

One of the major advantages Iceberg has over other data warehouses is its support for True ACID transactions, and its table format standard means you don’t have to run a database engine. Anything which wants to access the data can go straight to S3. As a software engineer with a background in object orientation, this feels potentially brittle and a coupling nightmare, but I am sure it works in practice.

The pre-workshop presentations made Iceberg feel simple and easy to use with lots of ways to get data in and out, such as SQL, Zero ETL, REST, batch and streaming. It can be easily integrated with RDS (Postgres, etc), AWS Lambdas, DynamoDB, Redshift and more, in near real time, with services such as AWS Firehouse.

Something which stood out for me is that every time a dataset is updated, a full copy of the database is made and stored. This can happen up to 60 times an hour. That’s a lot of data. This is why it’s important to have maintenance configured so that storage doesn’t become unmanageable, or worse, expensive.

The AWS presenters were great and really easy to listen to, as was the customer panel. However, I really couldn’t wait for the afternoon's workshop, so I started playing by myself while I was listening.

In about 20 minutes, with a little help from Claude Code, I had Apache Iceberg up and running in docker containers on my Linux laptop, and was able to create tables, and insert and query data with Spark. I was keen to prove I could access the data programmatically, so with more help from Claude I created a RESTful catalog and was able to use Curl to query pages of data.

I’m a massive Digital Ocean fan, so initially I had Claude create Terraform to run Iceberg in a droplet and its S3 compatible storage. However this would mean paying about $40 a month just to run the droplet, so I asked Claude to generate the Terraform for AWS instead. This brought the cost down to next to nothing, for small payloads at least.

The afternoon workshop came quite late, mostly due to the knock on effect of starting late due to train strikes and questions asked in the earlier sessions. I love AWS workshops, they’re one of the highlights of the AWS Summit for me. However, there is often a lot of “copy this here”, “configure this service like this”, “push this button and hope it works”. Unfortunately it didn’t for me on this occasion. What would be great to see instead is how to create the Terraform to create, configure and integrate Apache Iceberg in AWS, but I know most people point and click in the AWS console.

However, that didn’t detract from what was a really interesting, useful day where I learnt a lot.

Paul Grenyer

Search This Blog

Getting Hands-On with Apache Iceberg: From Docker to AWS in a Day

Comments

Post a Comment

Popular posts from this blog

Write Your Own Load Balancer: A worked Example

Catalina-Ant for Tomcat 7

Do software engineering professionals still read? - survey results