How I reduced $10000 Monthly AWS Glue Bill to $400 using Airflow
Yes that's a 96% reduction!
During my time as a Devops Engineer at Vance, we were running around 80 ETL pipelines on AWS Glue, but as our workloads scaled, so did our costs — hitting a staggering $10,000 per month. This wasn’t sustainable. After analyzing our pipeline, we realized that AWS Glue’s serverless nature was costing us heavily for idle time and unnecessary compute
To fix this, I migrated our ETL workloads to Apache Airflow, running on EC2 instances with ECS, and orchestrated everything using Terraform. The result? A 96% cost reduction, bringing our bill down to just $400 per month, without compromising on performance.
While Airflow is a great alternative to Glue, there’s little documentation on setting it up properly with Terraform with Celery Executor — especially for cost optimization. This blog walks you through how we did it, the challenges we faced, and how you can do the same to slash your AWS Glue costs.
Needless to say, this was indeed a war, credits to my manager Rishabh Lakhotia who went through this circus with me, you are indeed a god sir.
Intro
I am Akash Singh, a Backend Developer and Open Source Contributor.
Here is my LinkedIn, GitHub and Twitter
I go by the name SkySingh04 online.
— — — — —
The three parts of the pain
Migrating from AWS Glue to Apache Airflow involves setting up three core components:
1. Webserver — The UI for managing DAGs (Directed Acyclic Graphs) and monitoring job execution.
2. Scheduler— Responsible for triggering and scheduling DAG runs.
3. Workers — Execute the actual tasks in the DAGs.
Using Terraform, we provisioned ECS to run all three parallely and enable them to communicate with each other, which we will get to next.
Once Airflow is up and running, the next step is to migrate our ETL workflows. We will Glue jobs into Airflow DAGs, and then nuke the the Glue jobs, marking the final step in cutting down our AWS costs.
The Magical Dockerfile
You can use the following Dockerfile and push it to ECR and reference it in the upcoming configs :
```
FROM apache/airflow:latest-python3.9
USER root
RUN apt-get update && \
apt-get install -y — no-install-recommends \
git \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN mkdir -p /opt/airflow/dags /opt/airflow/logs && \
chown -R airflow:root /opt/airflow && \
chmod -R 755 /opt/airflow/logs
USER airflow
RUN pip install — no-cache-dir \
apache-airflow-providers-github \
apache-airflow-providers-amazon \
apache-airflow-providers-mysql \
apache-airflow-providers-mongo \
apache-airflow[celery,redis] \
pandas
COPY — chown=airflow:root dags/* /opt/airflow/dags/
ENV AIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logs \
AIRFLOW__LOGGING__WORKER_LOG_SERVER_PORT=8793 \
AIRFLOW__LOGGING__LOGGING_LEVEL=INFO \
AIRFLOW__LOGGING__LOG_FORMAT=’[%(asctime)s] {%(filename)s:%(lineno)d} %(levelname)s — %(message)s’ \
AIRFLOW__LOGGING__SIMPLE_LOG_FORMAT=’%(asctime)s %(levelname)s — %(message)s’ \
AIRFLOW__LOGGING__DAG_PROCESSOR_LOG_TARGET=file \
AIRFLOW__LOGGING__TASK_LOG_READER=task \
AIRFLOW__LOGGING__DAG_FILE_PROCESSOR_LOG_TARGET=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log \
AIRFLOW__LOGGING__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
ENV AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
RUN mkdir -p /opt/airflow/logs/scheduler \
/opt/airflow/logs/web \
/opt/airflow/logs/worker \
/opt/airflow/logs/dag_processor_manager \
/opt/airflow/logs/task_logs
USER root
RUN chown -R airflow:root /opt/airflow && \
chmod -R 755 /opt/airflow
USER airflow
```
This Dockerfile will be used for all three of our components and very nicely setups logging as well. The DAGs are directly baked into the Docker Image, we will get to that in a bit. Builld the image, tag it, push it to ECR and move to the next step!
Airflow Web Server
A Terraform script can be written to set up Apache Airflow on AWS using ECS (Elastic Container Service) with EC2 launch type. We need to make sure to add:
1. CloudWatch Logging:
— Creates a log group (`/ecs/airflow`) with a retention of 3 days.
2. Security Groups:
— Allows inbound HTTP (port 80) and HTTPS (port 443) traffic for the Application Load Balancer (ALB).
— Enables unrestricted outbound traffic.
3. TLS/SSL with ACM & Route 53:
— Provisions an ACM (AWS Certificate Manager) certificate for airflow.internal.example.com using DNS validation.
— Configures Route 53 DNS records to resolve the Airflow URL to the ALB.
4. Application Load Balancer (ALB):
— Creates an internal ALB for the Airflow webserver, supporting IPv4 & IPv6 (`dualstack`).
— Configures an HTTP listener (port 80) to redirect traffic to HTTPS (port 443).
— Sets up an HTTPS listener (port 443) to forward requests to the ECS target group.
5. ECS Task Definition for Airflow Webserver:
— Defines an ECS task for the Airflow webserver running on an EC2-backed ECS cluster.
— Uses a Docker image stored in AWS ECR (`aws_ecr_repository.airflow.repository_url:latest`).
— Allocates 2GB memory (`2048MB`).
— Maps container port 8080 to the host for web access.
— Defines a health check (`http://localhost:8080/health`).
6. ECS Service for Airflow:
— Creates an ECS service named ”airflow-webserver” with 1 desired task.
— Associates the ECS service with the ALB target group for load balancing.
— Enables execute-command
to allow debugging via AWS SSM.
— Uses a capacity provider strategy for ECS resource management.
7. DNS Configuration:
— Configures a Route 53 A record (`airflow.internal.example.com`) pointing to the ALB.
The Terraform script includes several environment variables in the ECS task definition:
1. Database Connection (`AIRFLOW__DATABASE__SQL_ALCHEMY_CONN`):
— Specifies the PostgreSQL database connection string for Airflow’s metadata database.
— Uses AWS KMS-encrypted secrets to securely store the database password.
2. User Management:
— _AIRFLOW_WWW_USER_CREATE
: Ensures the default Airflow web user is created.
— _AIRFLOW_WWW_USER_USERNAME
: Sets the username (default: airflow
).
— _AIRFLOW_WWW_USER_PASSWORD
: Stores the password securely via AWS KMS secrets.
3. Security & Web Configuration:
— AIRFLOW__WEBSERVER__EXPOSE_CONFIG
: Enables exposing Airflow configuration via the web UI.
— AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK
: Enables a built-in scheduler health check.
4. Database Migrations & Initialization:
— _AIRFLOW_DB_MIGRATE
: Ensures Airflow runs necessary database migrations on startup.
Now go ahead and run terraform plan
and terraform apply
and you should see a lot of resources being created. If everything went correctly, you will see the airflow ui on the url you specified :
Airflow Scheduler
The Airflow Scheduler is responsible for orchestrating DAG executions and ensuring scheduled tasks run at the correct time. A Terraform script can be written to provision the scheduler as an ECS service, configures CloudWatch logging, and enables auto-scaling to manage resource usage effectively.
While most of this is similar to the webserver, in summary, we need to added :
- Logs Scheduler Execution in CloudWatch (`/ecs/airflow-scheduler/`).
- Monitors Performance via StatsD Metrics (`airflow-metrics` namespace).
- Runs in an ECS Cluster with Auto Scaling, ensuring efficient resource allocation.
- Uses CloudWatch Agent for Monitoring, helping analyze task execution times.
- Secured by a Restricted Security Group, blocking unwanted traffic.
Now, go ahead and run terraform plan
and terraform apply
, and the Airflow Scheduler will be provisioned successfully! 🚀
Airflow Worker
The Airflow worker service is deployed as an ECS service on EC2 instances with auto-scaling based on memory utilization. It runs the Celery workers, which execute tasks from the DAGs, and will require Redis, which we’ll set up next.
The important things to note are :
- Uses CeleryExecutor, meaning tasks are distributed among workers.
- Logs are sent to CloudWatch for monitoring.
- Workers scale dynamically between 0 and 5 based on memory utilization.
- Each worker runs as a container inside ECS, managed by an autoscaling policy with a target of 60% memory utilization.
- `DUMB_INIT_SETSID=0` is set to handle proper signal propagation for Celery shutdown.
This entire setup made me cry because debugging autoscaling, log management, and task execution in ECS is a nightmare. Also, Redis isn’t even here yet, so the pain is far from over.
Redis and RDS
Setting up redis isn’t that bad , you can use the following terraform file:
```HCL
resource “aws_elasticache_subnet_group” “airflow” {
name = “airflow-redis-subnet-group”
subnet_ids = aws_subnet.airflow[*].id
tags = merge(
{
name = “airflow-redis-subnet-group”
},
local.common_tags
)
}
resource “aws_security_group” “airflow_redis” {
name_prefix = “airflow-redis”
vpc_id = data.aws_vpc.this.id
tags = merge(
{
Name = “airflow-redis”
},
local.common_tags
)
}
resource “aws_security_group_rule” “airflow_redis_inbound” {
type = “ingress”
from_port = 6379
to_port = 6379
protocol = “tcp”
cidr_blocks = [data.aws_vpc.this.cidr_block]
security_group_id = aws_security_group.airflow_redis.id
description = “Allow Redis from internal network”
}
resource “aws_elasticache_cluster” “airflow” {
cluster_id = “airflow”
engine = “redis”
node_type = “cache.t4g.small”
num_cache_nodes = 1
parameter_group_name = “default.redis5.0”
engine_version = “5.0.6”
port = 6379
subnet_group_name = aws_elasticache_subnet_group.airflow.name
security_group_ids = [aws_security_group.airflow_redis.id]
tags = merge(
{
name = “airflow-redis-server”
},
local.common_tags
)
}
resource “aws_security_group_rule” “airflow_redis_outbound” {
type = “egress”
from_port = 0
to_port = 0
protocol = “-1”
cidr_blocks = [“0.0.0.0/0”]
security_group_id = aws_security_group.airflow_redis.id
}
```
And similarly, we will setup RDS as well for airflow:
```HCL
# Security Groups
resource “aws_security_group” “airflow_rds” {
lifecycle {
create_before_destroy = true
}
name_prefix = “airflow-rds-default-”
description = “Allow TLS inbound traffic and all outbound traffic for airflow”
vpc_id = data.aws_vpc.this.id
tags = {
Name = “airflow-rds-default”
}
}
resource “aws_security_group_rule” “airflow_rds_inbound” {
type = “ingress”
from_port = 0
to_port = 0
protocol = “-1”
cidr_blocks = [data.aws_vpc.this.cidr_block]
security_group_id = aws_security_group.airflow_rds.id
description = “Allow all from internal network”
}
resource “aws_db_subnet_group” “airflow” {
name = “postgres-airflow”
subnet_ids = aws_subnet.airflow[*].id
}
resource “aws_db_instance” “airflow” {
db_name = “any db name”
apply_immediately = true
allocated_storage = “100”
storage_type = “gp3”
engine = “postgres”
engine_version = “17.2”
auto_minor_version_upgrade = true
instance_class = “db.t4g.micro”
username = “airflow”
password = data.aws_kms_secrets.airflow.plaintext[“db_password”]
multi_az = false
publicly_accessible = false
deletion_protection = false
skip_final_snapshot = true
identifier = “airflow”
vpc_security_group_ids = [aws_security_group.airflow_rds.id]
db_subnet_group_name = aws_db_subnet_group.airflow.name
}
```
Go ahead and create all of these resources as well using terraform!
The ENV Configurations to make all of this work
To make Airflow work properly in ECS with CeleryExecutor, several environment variables are required for logging, task execution, database connections, Redis as the message broker, and external integrations. These are defined in Terraform locals and passed into the Airflow containers.
— -
1️⃣ Core Airflow Configuration
- Instance Name:
— ”AIRFLOW__WEBSERVER__INSTANCE_NAME” = “airflow-webserver”
— Helps identify the webserver instance.
- Executor:
— ”AIRFLOW__CORE__EXECUTOR” = “CeleryExecutor”
— Uses CeleryExecutor to distribute tasks across multiple workers instead of running them sequentially in a single instance.
- Database Connection:
— ”AIRFLOW__CORE__SQL_ALCHEMY_CONN”
— Connects to PostgreSQL, using credentials stored in AWS KMS secrets.
- Load Examples:
— ”AIRFLOW__CORE__LOAD_EXAMPLES” = “True”
— Controls whether example DAGs should be loaded.
— -
2️⃣ Logging Configuration (AWS CloudWatch & S3)
- Log Level:
— ”AIRFLOW__LOGGING__LOGGING_LEVEL” = “DEBUG”
— Enables verbose logging for debugging.
- Remote Logging to CloudWatch:
— ”AIRFLOW__LOGGING__REMOTE_LOGGING” = “True”
— ”AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID” = “aws_conn”
— ”AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER” = “s3://abc”
— Stores logs in S3 and CloudWatch, making them accessible even if containers restart.
— -
3️⃣ Celery & Redis Configuration (Message Queue & Task Result Storage)
- Message Queue (Redis):
— ”AIRFLOW__CELERY__BROKER_URL” = “redis://${aws_elasticache_cluster.airflow.cache_nodes[0].address}:6379/0”
— Celery uses Redis for task queuing (yet to be set up, another source of pain).
- Task Result Storage (PostgreSQL):
— ”AIRFLOW__CELERY__RESULT_BACKEND” = “db+postgresql://airflow:${data.aws_kms_secrets.airflow.plaintext[“db_password”]}@${aws_db_instance.airflow.endpoint}/airflow”
— Task execution results are stored in PostgreSQL, ensuring persistence.
- Celery Transport Options:
— ”AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__VISIBILITY_TIMEOUT” = “1800”
— Ensures tasks are not marked as failed too soon.
— -
4️⃣ SMTP (Email Alerts for DAG Failures & Notifications)
- SMTP Configuration:
— ”AIRFLOW__SMTP__SMTP_HOST” = “d”
— ”AIRFLOW__SMTP__SMTP_MAIL_FROM” = “abc@email.com”
— ”AIRFLOW__SMTP__SMTP_PORT” = “587”
— ”AIRFLOW__SMTP__SMTP_SSL” = “True”
— Used for sending failure notifications via email.
— -
5️⃣ AWS-Specific Configurations
- Region Setting:
— ”AWS_DEFAULT_REGION” = local.region
— Ensures Terraform and Airflow components run in the correct AWS region.
- Fluent Bit Logging for Observability:
— Uses Fluent Bit (`aws-for-fluent-bit:stable`) for log collection.
— -
6️⃣ External Integrations (GitHub & AWS Secrets Manager)
- GitHub Connection (Airflow Providers):
— ”AIRFLOW__PROVIDERS__GITHUB__GITHUB_CONN_ID” = “github_default”
— ”AIRFLOW__PROVIDERS__GITHUB__ACCESS_TOKEN” = data.aws_kms_secrets.airflow.plaintext[“github_token”]
— Enables Airflow DAGs to interact with GitHub APIs.
— -
Pain Points 😭
- Setting up Redis for Celery is a huge pain because of networking and IAM role issues.
- Debugging log storage in S3 & CloudWatch while handling permissions is frustrating.
- Managing AWS Secrets Manager & KMS decryption for credentials adds complexity.
- Auto-scaling workers based on Redis queue depth & CPU/memory usage needs fine-tuning.
Okay yes lets finally move the DAGs now!
Now that the Airflow infrastructure is mostly set up (minus the Redis pain 😭), it’s time to move our DAGs. Instead of mounting them dynamically, we are baking them directly into the Docker image. This ensures that every container running the Airflow scheduler or worker has the DAGs preloaded without relying on external storage.
1️⃣ How We Bake DAGs into the Docker Image
In our Dockerfile (which we wrote earlier), we add the DAGs by copying them into the /dags
directory inside the container
2️⃣ Why This Approach?
- ✅ No need for external DAG storage (like S3, EFS, or Git sync).
- ✅ Ensures version control — DAGs are part of the Docker build process, so each deployment gets a known DAG version.
- ✅ Simplifies deployments — no extra steps to copy DAGs at runtime.
3️⃣ Building & Pushing the Docker Image
Once the DAGs are added, we build and push the image:
```sh
docker build -t airflow-custom:latest .
docker tag airflow-custom:latest <AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/airflow:latest
docker push <AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/airflow:latest
```
4️⃣ Updating ECS to Use the New Image
Since we bake the DAGs into the image, we just update ECS to pull the latest image, and the DAGs will be there.
```sh
aws ecs update-service \
— cluster airflow-cluster \
— service airflow-scheduler \
— force-new-deployment
```
This triggers a rolling restart of the scheduler, ensuring that the new DAGs are loaded.
— -
Pain & Next Steps 😭
- DAG Debugging: If a DAG has syntax errors, ECS will restart the scheduler on loop until it’s fixed.
- Hot Reloading? Baking DAGs means redeploying on every DAG update — fine for now, but we might add a mounted volume or Git sync later.
- Testing DAGs Before Baking: To avoid bad deployments, we should test DAGs locally before adding them to the image.
— -
Final Push: DAGs Are Moving In! 🏠
DAGs are now inside the container, meaning no runtime copying, no missing DAG issues, and one less thing to worry about — until the next fire starts. 🔥
And Now I Rest on a Pile of Blood and Bodies⚰️
It took two brutal days, but we slowly closed every Glue job, one by one, like a sniper picking off targets. Each shutdown was met with anticipation — would Airflow take over without issues, or would we be diving into yet another debugging nightmare?
With each transition, we watched the DAGs spin up, monitored task executions, and prayed that Celery wouldn’t betray us. Logs were combed through, retries were tweaked, and countless cups of coffee were consumed.
And finally, at the end of this war, the numbers spoke for themselves:
🚀 96% cost reduction in Glue job expenses.
🔥 Airflow fully operational, with tasks running efficiently across ECS workers.
💀 Redis survived (barely), after nearly making us lose our minds.
This migration wasn’t just a deployment; it was a battle of wills, and somehow, against all odds, we came out victorious. Now, as the dust settles, I take a breath and rest — not because the war is over, but because the next battle is just around the corner.