
If you’ve ever tried running Nextflow workflows in the cloud, you know the struggle is real. Local machines? Too slow. Manual scaling? A nightmare. That’s why I decided to dive headfirst into running Nextflow on Amazon EKS (Elastic Kubernetes Service).
In this guide, I’ll walk you through exactly how I set up a Nextflow environment on EKS, complete with spot instances for cost savings and secure S3 access using Pod Identity.
Before we dive into the “why AWS Batch” question, let’s talk about what Nextflow actually is.
Nextflow is a workflow orchestration engine designed for data-intensive computational pipelines. Think of it as a sophisticated task manager that can:
Originally built for genomics (where a single analysis might have 20+ steps and run for days), Nextflow is now popular in data science, ML pipelines, and any field dealing with complex data processing.
The key feature? Nextflow utilises containers (such as Docker) for each task, ensuring your analysis is reproducible; it’ll run the same way on a laptop as it does on a cloud cluster.
Why EKS for Nextflow?
Before we jump in, let’s talk about why this setup makes sense.
Nextflow is fantastic for orchestrating complex computational workflows, but it needs resources and lots of them. Running everything locally or on a single EC2 instance quickly becomes a bottleneck. Kubernetes, on the other hand, gives us:
EKS takes all of this and makes it managed, which means less time wrestling with control planes and more time getting actual work done.
Here’s what we’re building:
Let’s break it down phase by phase.
Before we dive into building the cluster, we need to make sure our local environment is set up properly. You’ll need two essential tools: the AWS CLI and kubectl.
If you haven’t already, install and configure the AWS CLI. This is how you’ll interact with AWS services from your terminal.
Install the AWS CLI ( if needed ) :
# For Linux
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# For macOS
brew install awscliNow configure it with your credentails
aws configureus-east-1 for this guidejson works greatPro tip: If you’re working with multiple AWS accounts or profiles, use aws configure --profile <profile-name> to keep things organized.
kubectl is your command-line tool for interacting with Kubernetes clusters. You absolutely need this.
For Linux:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectlFor Mac:
brew install kubectlVerify the installation:
kubectl version --clientWhile we’re at it, let’s install eksctl too. We'll need it for creating the cluster:
# For Linux
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin
# For macOS
brew install eksctlAlright, tools ready!
Setting Up the EKS Cluster
First, we need a cluster. I’m using eksctl because it handles a lot of the heavy lifting for us.
eksctl create cluster \
--name nextflow-eks-cluster \
--region us-east-1 \
--nodegroup-name spot-nodes \
--instance-types t3.medium \
--managed \
--spot \
--nodes 1 \
--with-oidc \
--vpc-nat-mode Disable
A quick note on that last flag: I disabled NAT Gateway here to keep costs down during testing. In production, you’ll want to enable it for better security and connectivity.
The --spot flag is doing the real magic here. It tells EKS to use spot instances, which are way cheaper than on-demand instances. Perfect for batch workloads where a little interruption risk is acceptable.
Installing Pod Identity Agent
Now here’s where things get interesting. Instead of creating IAM users and storing access keys, we’re using EKS Pod Identity. This lets our pods assume IAM roles directly.
eksctl create addon \
--cluster nextflow-eks-cluster \
--name eks-pod-identity-agent \
--region us-east-1This add-on runs as a DaemonSet on your cluster and handles all the credential magic behind the scenes.
S3 Bucket Setup
Head over to the S3 console and create a bucket. Inside it, create a folder called inputs . This is where we'll store our input data.
For this example, I’m using a FASTQ file (bioinformatics folks will know what this is). You can grab a sample dataset from public genomics repositories if you want to follow along.
IAM Permissions
We need to create a role that allows our Nextflow pods to:
eksctl create podidentityassociation \
--cluster nextflow-eks-cluster \
--namespace nextflow \
--serviceaccount nextflow-sa \
--role-name nextflow-pod-role \
--permission-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
--region us-east-1You can also do this through the AWS console
Now that our infrastructure is ready, let’s configure Kubernetes to handle our Nextflow workloads.
Creating the Namespace and Service Account
Namespaces keep things organized. Let’s create one specifically for Nextflow:
kubectl create namespace nextflow
kubectl create serviceaccount nextflow-sa --namespace nextflow
RBAC: Giving Permissions
RBAC (Role-Based Access Control) is how we tell Kubernetes what our Nextflow head pod is allowed to do. We need it to create worker pods, check their status, and clean them up when done.
Save this as rbac.yaml:
# Define what the role can do
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: nextflow-role
namespace: nextflow
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "pods/status"]
verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "watch", "create", "delete"]
---
# Bind the role to our service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: nextflow-rolebinding
namespace: nextflow
subjects:
- kind: ServiceAccount
name: nextflow-sa
namespace: nextflow
roleRef:
kind: Role
name: nextflow-role
apiGroup: rbac.authorization.k8s.ioApply it:
kubectl apply -f rbac.yaml
SSH Keys for GitHub Access
Since we’re pulling our Nextflow pipeline from GitHub, we need authentication. SSH keys are the way to go.
Spoiler alert: If your GitHub repo is public, you can skip this entire SSH key setup and use HTTPS URLs to clone. But for private repos, SSH keys are essential.
On your local machine, generate a new key pair:
ssh-keygen -t ed25519 -C "nextflow-eks-pipeline"
When prompted, save it to a specific location, like ~/.ssh/nextflow_eks_key (or just hit Enter to use the default location). Don't set a passphrase; Kubernetes won't be able to use a password-protected key.
Now store the private key as a Kubernetes secret:
kubectl create secret generic git-ssh-key \
--from-file=ssh-privatekey=/path/to/your/private/key \
-n nextflowTake the public key you just generated and add it to your GitHub repository:
This is where it all comes together. We need two things: our Nextflow pipeline code and a Kubernetes Job to run it.
The Nextflow Pipeline
For this example, we’re running a bioinformatics quality control workflow on genomic sequencing data. Specifically:
What we’re doing:
Think of it this way: FASTQC is your detailed inspector, checking each piece individually, and MultiQC is the manager who summarises everything into a single executive report.
main.nf- This is your pipeline definition:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.reads = "s3://<your-bucket-name>/inputs/dataset.fastq"
process FASTQC {
container 'biocontainers/fastqc:v0.11.9_cv8'
publishDir "${params.outdir}/fastqc", mode: 'copy'
input:
path reads
output:
tuple path("*.html"), path("*_fastqc.zip")
script:
"""
echo "Running FASTQC on ${reads}"
fastqc ${reads}
echo "FASTQC completed for ${reads}"
"""
}
process MULTIQC {
container 'multiqc/multiqc:dev'
publishDir "${params.outdir}/multiqc", mode: 'copy'
input:
path fastqc_reports
output:
path "multiqc_report.html"
path "multiqc_data"
script:
"""
echo "Aggregating FASTQC reports with MultiQC"
multiqc .
echo "MultiQC report generated successfully"
"""
}
workflow {
reads_ch = Channel.fromPath(params.reads)
fastqc_reports_ch = FASTQC(reads_ch)
MULTIQC(fastqc_reports_ch.collect())
}Important notes:
<your-bucket-name>.git to match your S3 Bucketnextflow.config- This tells Nextflow how to run on Kubernetes:
plugins {
id 'nf-wave'
id 'nf-amazon'
}
wave.enabled = true
fusion.enabled = true
fusion.exportStorageCredentials = true
process {
executor = 'k8s'
cpus = 1
memory = '2 GB'
maxForks = 1
withName: 'FASTQC' {
container = 'biocontainers/fastqc:v0.11.9_cv8'
}
withName: 'MULTIQC' {
container = 'multiqc/multiqc:dev'
}
}
k8s {
namespace = 'nextflow'
serviceAccount = 'nextflow-sa'
}Push this to your GitHub repo.
The Kubernetes Job
Now for the piece that ties everything together. But first, let’s talk about why we’re using a Kubernetes Job instead of a Deployment or a plain Pod.
Why a Job?
Kubernetes gives us several ways to run workloads, and choosing the right one matters:
backoffLimit), and can automatically clean itself up when done (via ttlSecondsAfterFinished).For Nextflow pipelines, we want:
That’s exactly what a Job gives us.
Save this as nextflow-job.yaml:
apiVersion: batch/v1
kind: Job
metadata:
namespace: nextflow
generateName: "nextflow-job-"
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 300
template:
spec:
serviceAccountName: nextflow-sa
restartPolicy: Never
initContainers:
- name: fetch-pipeline
image: alpine/git:latest
env:
- name: BRANCH
value: "main" # Change to your branch name
command:
- sh
- -c
- |
set -e
mkdir -p /root/.ssh
cp /var/ssh/ssh-privatekey /root/.ssh/id_rsa
chmod 600 /root/.ssh/id_rsa
ssh-keyscan github.com >> /root/.ssh/known_hosts
git clone -b "${BRANCH}" git@github.com:<your-username>/<your-repo>.git /workspace
volumeMounts:
- name: workspace
mountPath: /workspace
- name: git-ssh-key
mountPath: /var/ssh
readOnly: true
containers:
- name: nextflow
image: nextflow/nextflow:25.12.0-edge
env:
- name: NXF_PLUGINS_DEFAULT
value: "nf-amazon,nf-wave,nf-k8s"
- name: JOB_NAME
value: "fastqc-analysis"
command:
- sh
- -c
- |
set -e
cd /workspace
WORK_DIR="s3://<your-bucket-name>/work/${JOB_NAME}"
RESULTS_DIR="s3://<your-bucket-name>/results/${JOB_NAME}"
echo "Work directory: $WORK_DIR"
echo "Results directory: $RESULTS_DIR"
exec nextflow run main.nf \
-c nextflow.config \
--outdir $RESULTS_DIR \
-work-dir "$WORK_DIR" \
-resume
volumeMounts:
- name: workspace
mountPath: /workspace
volumes:
- name: workspace
emptyDir: {}
- name: git-ssh-key
secret:
secretName: git-ssh-key
defaultMode: 0400Important notes:
<your-username>/<your-repo>.git to match your GitHub repo<your-bucket-name> to match your S3 bucketgenerateName field means each run creates a unique jobRunning the Pipeline
Here’s the fun part. Launch your pipeline:
kubectl create -f nextflow-job.yaml
Note: We use kubectl create instead of kubectl apply because generateName creates a new job each time.
Monitoring Your Workflow
Watch the job spin up :
kubectl get jobs -n nextflow
Check the logs in real-time:
kubectl logs job/<job-name> -n nextflow -f
You’ll see Nextflow do its thing, spawning worker pods, running tasks, and saving results to S3.
Checking Your Results
Once everything completes, head to S3. You should see your FASTQC HTML reports and MultiQC summary. Beautiful!
Cleanup
The ttlSecondsAfterFinished: 300 Setting in our Job spec means Kubernetes automatically deletes completed jobs after 5 minutes. This keeps your cluster clean.
Running Nextflow on EKS gives you the best of both worlds: the power and flexibility of Nextflow with the scalability and reliability of Kubernetes. Sure, there’s a learning curve, but once you have this foundation in place, you can run virtually any computational workflow efficiently and cost-effectively.
If you found this helpful, consider giving it a clap or sharing it with your DevOps and bioinformatics friends. And if you run into issues or have questions, drop a comment below. I’m always happy to help troubleshoot.
Happy Hosting!