How to deploy production-grade bioinformatics pipelines using AWS HealthOmics?

What is AWS HealthOmics?

AWS HealthOmics is a HIPAA-eligible service that accelerates clinical diagnostic testing, drug discovery, and agriculture research by fully managing the complex infrastructure behind your bioinformatics workflows. HealthOmics supports industry-standard workflow languages (WDL, Nextflow, CWL) and seamlessly scales bioinformatics infrastructure to support data from tens of thousands of tests per day — all with predictable cost per-sample. HealthOmics handles the technical complexities like managing compute resources and maintaining workflow engines so you can focus entirely on scientific breakthroughs.

Primary use cases include:

Clinical Diagnostics
Drug Discovery
Agricultural Research

Key benefits are:

Scale without complexity
Focus on science, not infrastructure
Built-in compliance features

Pre-requisites for running the bioinformatics workflow on AWS HealthOmics.

There are a total of three prerequisites for deploying the workflow on AWS HealthOmics:

Create an S3 bucket to store reads and to collect the outputs.
Create an ECR repo to hold the Docker images of the tools that would be used in the workflow.
Create an IAM service role for HealthOmics that would allow it to access S3 and ECR.

You can refer to this demo Nextflow pipeline or you can use your own.

How to deploy a Nextflow Pipeline on AWS HealthOmics?

Step 1: Go to the HealthOmics homepage on AWS

Search for HealthOmics in the search bar and navigate to the Private workflows section on the left-hand side.

Note: AWS HealthOmics also provides Ready2Run workflows, which you can run directly on your data. The costing for each of those workflows is provided along with them.

Also, under the Storage section, you can see: Reference Stores and Sequence Stores. You can use these stores (i.e., databases) to store your sequences and references. However, for this tutorial, we will stick to S3 buckets.

Step 2: Create a workflow

Scroll down and simply click on the Create Workflow button.

Step 3: Creating the workflow

Creating a bioinformatics workflow in AWS HealthOmics is a straightforward four-step process.

Note: All the fields marked as optional can be skipped. They are as follows:‍‍

Step I: Define the workflow

Provide a name for your workflow and a description (optional).

Next, choose the Workflow language. Although this is optional (as HealthOmics would automatically identify it), it is better to select it to reduce latency.

Then, we have the option to select the workflow definition source. It can be any one of the following: S3, a git repository, or direct upload from local source. We shall proceed forward with S3 as all of our data is stored in the S3 bucket.

Next, we must select dynamic storage as we don’t have a very high resource requirement.

3(i)(3). Default run storage configuration.

We will leave all the other optional fields blank.
The first step towards building a bioinformatics workflow in AWS HealthOmics is complete!

Step II: Add workflow parameters — optional

Although this is an optional step, we shall pass the path to the reads (on S3) via parameters, as our pipeline expects a reads parameter that contains the location of the input reads.

Step III: Container URI remapping — optional

We will choose the source of the mapping file as None (skip remapping). This is because we have already uploaded the Docker images of the tools used to our ECR repo, and we have embedded those links into the source code itself.

Step IV: Review and create

This is just an overview of our configuration. You can edit the fields here and cross-check if anything seems out of place. We are all set to go; you can just scroll to the bottom of the page and click on the Create workflow button.

Congratulations, you just created (probably your first) Nextflow workflow on AWS HealthOmics 🎉 .

HealthOmics might take a few minutes to get everything up and running, so don’t panic if you see such screens for a few minutes.

3(iv)(1). AWS HealthOmics is creating your workflow.

Once the workflow is ready to run, HealthOmics will display a pop-up:

3(iv)(2). AWS HealthOmics successfully created the workflow.

Step 4: Running the workflow

Once your workflow is ready, you can run it with just a single click!

Start run button for running the workflow.

Now, running the workflow is again divided into four steps.

Step I: Specify run details

We need to select the Owned workflow as we just created one. You can provide any name to this particular run. Also, you can provide a run priority; we would be omitting that for now. Lastly, and once again, we choose the run storage type as Dynamic, as our workflow isn’t very resource-heavy.

Then, we need to provide the output destination for our workflow, which would again be an S3 bucket.
Also, we need to create an IAM Service which would allow access of S3 to HealthOmics.

4(i)(2). S3 output destination and Service role.

Step II: Add parameter values

Next, we would need to provide the values for the parameters we just set. We will choose the manual option as we have only one parameter. We need to provide the path to our input file here (i.e., the reads. fastq file)

Step III: Add run group, run cache, and tags

This is a completely optional step; it is meant to help organize and optimize industry-grade workflows. We won’t trouble ourselves with this now.

Step IV: Review and start running

Once again, we have the option to edit and cross-check our run configuration. We can safely proceed with the start run.

4(iv)(1). AWS HealthOmics is preparing to run your workflow.

4(iv)(2). AWS HealthOmics is running your workflow.

Once the run is successful, you’ll receive the following pop-up:

4(iv)(3). Your workflow was executed successfully!

You can view the Run Summary at the bottom of the page:

4(iv)(4). Run tasks for your private workflows.

Congratulations 🥳 Your workflow run is complete and successful, the outputs will be present in the S3 bucket (destination provided earlier during Run Configuration).