Skip to main content

Getting Hands-On with Apache Iceberg: From Docker to AWS in a Day

 
On Tuesday this week, I was back at AWS on Holborn Viaduct learning about Apache Iceberg with a colleague. Iceberg is a table format standard for managing large-scale datasets and their metadata. The AWS-managed approach is to store the data in S3, meaning you only pay for storage and bandwidth. Other tools, such as Apache Spark, can then be used on top to build a data warehouse - or in this case, a Lakehouse. AWS also supports self-hosting, but this requires you to configure and manage aspects such as maintenance yourself.

One of the major advantages Iceberg has over other data warehouses is its support for True ACID transactions, and its table format standard means you don’t have to run a database engine. Anything which wants to access the data can go straight to S3. As a software engineer with a background in object orientation, this feels potentially brittle and a coupling nightmare, but I am sure it works in practice.

 

The pre-workshop presentations made Iceberg feel simple and easy to use with lots of ways to get data in and out, such as SQL, Zero ETL, REST, batch and streaming. It can be easily integrated with RDS (Postgres, etc), AWS Lambdas, DynamoDB, Redshift and more, in near real time, with services such as AWS Firehouse. 

Something which stood out for me is that every time a dataset is updated, a full copy of the database is made and stored. This can happen up to 60 times an hour. That’s a lot of data. This is why it’s important to have maintenance configured so that storage doesn’t become unmanageable, or worse, expensive. 

The AWS presenters were great and really easy to listen to, as was the customer panel. However, I really couldn’t wait for the afternoon's workshop, so I started playing by myself while I was listening.

In about 20 minutes, with a little help from Claude Code, I had Apache Iceberg up and running in docker containers on my Linux laptop, and was able to create tables, and insert and query data with Spark. I was keen to prove I could access the data programmatically, so with more help from Claude I created a RESTful catalog and was able to use Curl to query pages of data.

 

I’m a massive Digital Ocean fan, so initially I had Claude create Terraform to run Iceberg in a droplet and its S3 compatible storage. However this would mean paying about $40 a month just to run the droplet, so I asked Claude to generate the Terraform for AWS instead. This brought the cost down to next to nothing, for small payloads at least.

The afternoon workshop came quite late, mostly due to the knock on effect of starting late due to train strikes and questions asked in the earlier sessions. I love AWS workshops, they’re one of the highlights of the AWS Summit for me. However, there is often a lot of “copy this here”, “configure this service like this”, “push this button and hope it works”. Unfortunately it didn’t for me on this occasion. What would be great to see instead is how to create the Terraform to create, configure and integrate Apache Iceberg in AWS, but I know most people point and click in the AWS console.

However, that didn’t detract from what was a really interesting, useful day where I learnt a lot.


 

 

Comments

Popular posts from this blog

Write Your Own Load Balancer: A worked Example

I was out walking with a techie friend of mine I’d not seen for a while and he asked me if I’d written anything recently. I hadn’t, other than an article on data sharing a few months before and I realised I was missing it. Well, not the writing itself, but the end result. In the last few weeks, another friend of mine, John Cricket , has been setting weekly code challenges via linkedin and his new website, https://codingchallenges.fyi/ . They were all quite interesting, but one in particular on writing load balancers appealed, so I thought I’d kill two birds with one stone and write up a worked example. You’ll find my worked example below. The challenge itself is italics and voice is that of John Crickets. The Coding Challenge https://codingchallenges.fyi/challenges/challenge-load-balancer/ Write Your Own Load Balancer This challenge is to build your own application layer load balancer. A load balancer sits in front of a group of servers and routes client requests across all of the serv...

Catalina-Ant for Tomcat 7

I recently upgraded from Tomcat 6 to Tomcat 7 and all of my Ant deployment scripts stopped working. I eventually worked out why and made the necessary changes, but there doesn’t seem to be a complete description of how to use Catalina-Ant for Tomcat 7 on the web so I thought I'd write one. To start with, make sure Tomcat manager is configured for use by Catalina-Ant. Make sure that manager-script is included in the roles for one of the users in TOMCAT_HOME/conf/tomcat-users.xml . For example: <tomcat-users> <user name="admin" password="s3cr£t" roles="manager-gui, manager-script "/> </tomcat-users> Catalina-Ant for Tomcat 6 was encapsulated within a single JAR file. Catalina-Ant for Tomcat 7 requires four JAR files. One from TOMCAT_HOME/bin : tomcat-juli.jar and three from TOMCAT_HOME/lib: catalina-ant.jar tomcat-coyote.jar tomcat-util.jar There are at least three ways of making the JARs available to Ant: Copy the JARs into th...

Do software engineering professionals still read? - survey results

  In order to gauge the potential audience for my book, So you think you can lead a team? , I conducted a small survey of my colleagues, co-workers and anyone from Linked. I read regularly, for work and pleasure, and assumed everyone else did too but did the responses I received confirm this? I polled 173 people, all within the software engineering field (including Product, etc), with a range of ages and years of experience in their role. What surprised me the most was that the majority of people, young or old, just starting or seasoned, still prefer reading physical books to blogs or e-readers. It also seemed that the older and more experienced were the most keen in learning more, and reading to expand or update their knowledge.  When it comes to reading habits between different roles the survey showed that software engineers and team leads read more regularly for their career than other roles, with 55 years old and over and 16+ years experience being the biggest readers over...