How We Built Scalable Bioinformatics Pipelines on AWS EKS Using Nextflow

If you’ve ever tried running Nextflow workflows in the cloud, you know the struggle is real. Local machines? Too slow. Manual scaling? A nightmare. That’s why I decided to dive headfirst into running Nextflow on Amazon EKS (Elastic Kubernetes Service).

In this guide, I’ll walk you through exactly how I set up a Nextflow environment on EKS, complete with spot instances for cost savings and secure S3 access using Pod Identity.

What is Nextflow?

Before we dive into the “why EKS” question, let’s talk about what Nextflow actually is.

Nextflow is a workflow orchestration engine designed for data-intensive computational pipelines. Think of it as a sophisticated task manager that can:

Chain together multiple steps: Take raw data, run it through analysis tools, and generate reports in a defined sequence
Handle dependencies automatically: Task B won’t run until Task A finishes successfully
Parallelize work: If you have 100 samples to analyze, Nextflow can process them all at once (resource permitting)
Resume from failures: If your pipeline crashes at step 5 out of 10, you can restart from step 5 instead of starting over

It was originally built for bioinformatics (genomics workflows can have dozens of steps and take days to run), but it’s now used in machine learning, data science, and anywhere you have complex, multi-step computational workflows.

The key feature? Nextflow uses containers (like Docker) for each task, which means your analysis is reproducible’ll run the same way on my laptop as it does on your cloud cluster.

Why EKS for Nextflow?

Before we jump in, let’s talk about why this setup makes sense.

Nextflow is fantastic for orchestrating complex computational workflows, but it needs resources and lots of them. Running everything locally or on a single EC2 instance quickly becomes a bottleneck. Kubernetes, on the other hand, gives us:

Auto-scaling: Spin up pods as needed, shut them down when done
Cost optimization: Use spot instances to cut costs by up to 90%
Flexibility: Run multiple workflows concurrently without stepping on each other’s toes
Reliability: Built-in retry logic and fault tolerance

EKS takes all of this and makes it managed, which means less time wrestling with control planes and more time getting actual work done.

The Game Plan

Here’s what we’re building:

An EKS cluster optimized for batch workloads
Secure S3 access using Pod Identity
Kubernetes RBAC so our Nextflow head pod can manage worker pods
GitHub integration using SSH keys for pipeline code management

Let’s break it down phase by phase.

Prerequisites: Getting Your Tools Ready

Before we dive into building the cluster, we need to make sure our local environment is set up properly. You’ll need two essential tools: the AWS CLI and kubectl.

Configure AWS CLI

If you haven’t already, install and configure the AWS CLI. This is how you’ll interact with AWS services from your terminal.

Install the AWS CLI (if needed):

# For Linux
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# For macOS
brew install awscli

Now configure it with your credentials:

aws configure

‍

You’ll be prompted for:

WS Access Key ID: Your IAM user access key
AWS Secret Access Key: Your secret key
Default region name: I’m using us-east-1 for this guide
Default output format: json works great

Pro tip: If you’re working with multiple AWS accounts or profiles, use aws configure --profile <profile-name> to keep things organized.

‍

Install kubectl

kubectl is your command-line tool for interacting with Kubernetes clusters. You absolutely need this.

For Linux:

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

For macOS:

brew install kubectl

Verify the installation:

kubectl version --client

‍

Install eksctl

While we’re at it, let’s install eksctl too. We'll need it for creating the cluster:

# For Linux
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# For macOS
brew install eksctl

Alright, tools ready!

‍

Phase 1: Building Our Infrastructure

Setting Up the EKS Cluster

First, we need a cluster. I’m using eksctl because it handles a lot of the heavy lifting for us.

eksctl create cluster \
  --name nextflow-eks-cluster \
  --region us-east-1 \
  --nodegroup-name spot-nodes \
  --instance-types t3.medium \
  --managed \
  --spot \
  --nodes 1 \
  --with-oidc \
  --vpc-nat-mode Disable

A quick note on that last flag: I disabled NAT Gateway here to keep costs down during testing. In production, you’ll want to enable it for better security and connectivity.

The --spot flag is doing the real magic here. It tells EKS to use spot instances, which are way cheaper than on-demand instances. Perfect for batch workloads where a little interruption risk is acceptable.

Installing Pod Identity Agent

Now here’s where things get interesting. Instead of creating IAM users and storing access keys, we’re using EKS Pod Identity. This lets our pods assume IAM roles directly.

eksctl create addon \
  --cluster nextflow-eks-cluster \
  --name eks-pod-identity-agent \
  --region us-east-1

This add-on runs as a DaemonSet on your cluster and handles all the credential magic behind the scenes.

S3 Bucket Setup

Head over to the S3 console and create a bucket. Inside it, create a folder called inputs . This is where we'll store our input data.

For this example, I’m using a FASTQ file (bioinformatics folks will know what this is). You can grab a sample dataset from public genomics repositories if you want to follow along.

IAM Permissions

We need to create a role that allows our Nextflow pods to:

Access S3 buckets
Create and manage other pods in the cluster

eksctl create podidentityassociation \
  --cluster nextflow-eks-cluster \
  --namespace nextflow \
  --serviceaccount nextflow-sa \
  --role-name nextflow-pod-role \
  --permission-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
  --region us-east-1

You can also do this through the AWS console

‍

Phase 2: Kubernetes Configuration

Now that our infrastructure is ready, let’s configure Kubernetes to handle our Nextflow workloads.

Creating the Namespace and Service Account

Namespaces keep things organized. Let’s create one specifically for Nextflow:

kubectl create namespace nextflow
kubectl create serviceaccount nextflow-sa --namespace nextflow

‍

RBAC: Giving Permissions

RBAC (Role-Based Access Control) is how we tell Kubernetes what our Nextflow head pod is allowed to do. We need it to create worker pods, check their status, and clean them up when done.

Save this as rbac.yaml:

# Define what the role can do
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: nextflow-role
  namespace: nextflow
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log", "pods/status"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "create", "delete"]

---
# Bind the role to our service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: nextflow-rolebinding
  namespace: nextflow
subjects:
  - kind: ServiceAccount
    name: nextflow-sa
    namespace: nextflow
roleRef:
  kind: Role
  name: nextflow-role
  apiGroup: rbac.authorization.k8s.io

Apply it:

kubectl apply -f rbac.yaml

‍

SSH Keys for GitHub Access

Since we’re pulling our Nextflow pipeline from GitHub, we need authentication. SSH keys are the way to go.

Spoiler alert: If your GitHub repo is public, you can skip this entire SSH key setup and use HTTPS URLs to clone. But for private repos, SSH keys are essential.

On your local machine, generate a new key pair:

ssh-keygen -t ed25519 -C "nextflow-eks-pipeline"

When prompted, save it to a specific location, like ~/.ssh/nextflow_eks_key (or just hit Enter to use the default location). Don't set a passphrase; Kubernetes won't be able to use a password-protected key.

Now store the private key as a Kubernetes secret:

kubectl create secret generic git-ssh-key \
  --from-file=ssh-privatekey=/path/to/your/private/key \
  -n nextflow

‍

Phase 3: GitHub Configuration

Take the public key you just generated and add it to your GitHub repository:

Go to your repo’s Settings → Deploy keys
Click Add deploy key
Paste your public key
Give it a descriptive name like “EKS Nextflow Runner.”
Check “Allow write access” if your workflow needs to push results back (optional)

Phase 4: The Implementation

This is where it all comes together. We need two things: our Nextflow pipeline code and a Kubernetes Job to run it.

The Nextflow Pipeline

For this example, we’re running a bioinformatics quality control workflow on genomic sequencing data. Specifically:

What we’re doing:

FASTQC: Analyzes our raw sequencing data (FASTQ file) and generates quality control reports. It checks things like read quality scores, sequence duplication levels, GC content, and potential contamination. Each dataset gets its own HTML report with pretty graphs.
MultiQC: Takes all those individual FASTQC reports and aggregates them into a single, beautiful summary report. Instead of opening 10 different HTML files, you get a comprehensive view of your data quality in one place.

Think of it this way: FASTQC is your detailed inspector, checking each piece individually, and MultiQC is the manager who summarises everything into a single executive report.

main.nf- This is your pipeline definition:

#!/usr/bin/env nextflow
nextflow.enable.dsl=2

params.reads = "s3://<your-bucket-name>/inputs/dataset.fastq"

process FASTQC {
    container 'biocontainers/fastqc:v0.11.9_cv8'
    publishDir "${params.outdir}/fastqc", mode: 'copy'
    
    input:
    path reads
    
    output:
    tuple path("*.html"), path("*_fastqc.zip")
    
    script:
    """
    echo "Running FASTQC on ${reads}"
    fastqc ${reads}
    echo "FASTQC completed for ${reads}"
    """
}

process MULTIQC {
    container 'multiqc/multiqc:dev'
    publishDir "${params.outdir}/multiqc", mode: 'copy'
    
    input:
    path fastqc_reports
    
    output:
    path "multiqc_report.html"
    path "multiqc_data"
    
    script:
    """
    echo "Aggregating FASTQC reports with MultiQC"
    multiqc .
    echo "MultiQC report generated successfully"
    """
}

workflow {
    reads_ch = Channel.fromPath(params.reads)
    fastqc_reports_ch = FASTQC(reads_ch)
    MULTIQC(fastqc_reports_ch.collect())
}

Important notes:

Update <your-bucket-name>.git to match your S3 Bucket

nextflow.config- This tells Nextflow how to run on Kubernetes:

plugins {
    id 'nf-wave'
    id 'nf-amazon'
}

wave.enabled = true
fusion.enabled = true
fusion.exportStorageCredentials = true 

process {
    executor = 'k8s'
    cpus = 1
    memory = '2 GB'
    maxForks = 1

    withName: 'FASTQC' {
        container = 'biocontainers/fastqc:v0.11.9_cv8'
    }
    withName: 'MULTIQC' {
        container = 'multiqc/multiqc:dev'
    }
}

k8s {
    namespace = 'nextflow'
    serviceAccount = 'nextflow-sa'
}

Push this to your GitHub repo.

The Kubernetes Job

Now for the piece that ties everything together. But first, let’s talk about why we’re using a Kubernetes Job instead of a Deployment or a plain Pod.

Why a Job?

Kubernetes gives us several ways to run workloads, and choosing the right one matters:

Pod: Runs once, but if it fails, it’s done. No retry logic, no cleanup guarantees. Not ideal for batch workloads.
Deployment: Designed for long-running services that should stay up. If a pod dies, it gets restarted automatically. Great for web servers, terrible for batch jobs that need to run once and finish.
Job: Perfect for batch workloads! It runs to completion, has built-in retry logic (controlled by backoffLimit), and can automatically clean itself up when done (via ttlSecondsAfterFinished).

For Nextflow pipelines, we want:

Run once and finish
Automatic cleanup after completion
Clear success/failure status
No accidental restarts

That’s exactly what a Job gives us.

Save this as nextflow-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  namespace: nextflow
  generateName: "nextflow-job-"
spec:
  backoffLimit: 0
  ttlSecondsAfterFinished: 300
  template:
    spec:
      serviceAccountName: nextflow-sa
      restartPolicy: Never
      
      initContainers:
        - name: fetch-pipeline
          image: alpine/git:latest
          env:
            - name: BRANCH
              value: "main"  # Change to your branch name
          command:
            - sh
            - -c
            - |
              set -e
              mkdir -p /root/.ssh
              cp /var/ssh/ssh-privatekey /root/.ssh/id_rsa
              chmod 600 /root/.ssh/id_rsa
              ssh-keyscan github.com >> /root/.ssh/known_hosts
              git clone -b "${BRANCH}" git@github.com:<your-username>/<your-repo>.git /workspace
          volumeMounts:
            - name: workspace
              mountPath: /workspace
            - name: git-ssh-key
              mountPath: /var/ssh
              readOnly: true
      
      containers:
        - name: nextflow
          image: nextflow/nextflow:25.12.0-edge
          env:
            - name: NXF_PLUGINS_DEFAULT
              value: "nf-amazon,nf-wave,nf-k8s"
            - name: JOB_NAME
              value: "fastqc-analysis"
          command:
            - sh
            - -c
            - |
              set -e
              cd /workspace
              WORK_DIR="s3://<your-bucket-name>/work/${JOB_NAME}"
              RESULTS_DIR="s3://<your-bucket-name>/results/${JOB_NAME}"
              echo "Work directory: $WORK_DIR"
              echo "Results directory: $RESULTS_DIR"
              exec nextflow run main.nf \
                -c nextflow.config \
                --outdir $RESULTS_DIR \
                -work-dir "$WORK_DIR" \
                -resume
          volumeMounts:
            - name: workspace
              mountPath: /workspace

      volumes:
        - name: workspace
          emptyDir: {}
        - name: git-ssh-key
          secret:
            secretName: git-ssh-key
            defaultMode: 0400

Important notes:

Update <your-username>/<your-repo>.git to match your GitHub repo
Update <your-bucket-name> to match your S3 bucket
The generateName field means each run creates a unique job

Running the Pipeline

Here’s the fun part. Launch your pipeline:

kubectl create -f nextflow-job.yaml

Note: We use kubectl create instead of kubectl apply because generateName creates a new job each time.

Monitoring Your Workflow

Watch the job spin up :

kubectl get jobs -n nextflow

Check the logs in real-time:

kubectl logs job/<job-name> -n nextflow -f

You’ll see Nextflow do its thing, spawning worker pods, running tasks, and saving results to S3.

Checking Your Results

Once everything completes, head to S3. You should see your FASTQC HTML reports and MultiQC summary. Beautiful!

Cleanup

The ttlSecondsAfterFinished: 300 Setting in our Job spec means Kubernetes automatically deletes completed jobs after 5 minutes. This keeps your cluster clean.

Conclusion

Running Nextflow on EKS gives you the best of both worlds: the power and flexibility of Nextflow with the scalability and reliability of Kubernetes. Sure, there’s a learning curve, but once you have this foundation in place, you can run virtually any computational workflow efficiently and cost-effectively.

If you found this helpful, consider giving it a clap or sharing it with your DevOps and bioinformatics friends. And if you run into issues or have questions, drop a comment below. I’m always happy to help troubleshoot.

Happy Hosting!