Using a biobox Docker image

Bioboxes aims to make it much simpler for anyone to use the most recent advances in bioinformatics software. This page will provide a short example of using a biobox genome assembler. The purpose of this guide is to illustrate how bioboxes work and this could then be applied for any application for which a biobox exists, not only genome assembly.

This tutorial will use real sequencing data so that the example biobox can be run as you might do so with your own data. The data is available for download and is a FASTQ file of Illumina reads from a real genome which was sequenced at the JGI. This file can be downloaded using wget.

$ mkdir input_data
$ wget \
    --output-document input_data/reads.fq.gz \
    'https://www.dropbox.com/s/uxgn6cqngctqv74/reads.fq.gz?dl=1'

Create a biobox.yaml file

The inputs to a biobox are specified using a file named 'bioboxes.yaml' An example file where we specific this data is:

---
version: "0.9.0"
arguments:
  - fastq:
    - id: "test_reads"
      type: "paired"
      value: "/bbx/input/reads.fq.gz"

In this file we specify the current bioboxes version 0.9, along with the arguments to the biobox. In this case we're giving a single FASTQ file. This argument has the identifier test_reads and the type is paired because this is the type of sequencing data. The final argument specifies the location of the files. In the biobox.yaml file this is in the directory named /bbx/input/ which is where you will place the reads in the biobox container.

This biobox.yaml file can be created as follows:

cat << EOF > input_data/biobox.yaml
---
version: "0.9.0"
arguments:
  - fastq:
    - id: "test_reads"
      type: "paired"
      value: "/bbx/input/reads.fq.gz"
EOF

Run the biobox

The input data and biobox.yaml file are all that's required to test a biobox. Run the following command to use the velvet biobox to assemble the test reads:

mkdir -p output_data
docker run \
  --volume="$(pwd)/input_data:/bbx/input:ro" \
  --volume="$(pwd)/output_data:/bbx/output:rw" \
  --rm \
  bioboxes/velvet \
  default

This uses $(pwd) syntax. If you are unfamiliar the command pwd returns the current working directory. The construct $(...) replaces itself with the result of evaluating the contents inside the parenthesis. Therefore $(pwd) will be replaced with the current working directory you are in. This is necessary because the --volume flags require the full directory path.

The --volume flag is used to link a directory on your computer to a directory inside the biobox. In the example you mount the directory input_data to the directory /bbx/input, the ro is an abbreviation of read-only. This means only data can be read from the directory. You will generally always want to use ro for your input data to prevent a biobox accidentally changing it. The second volume mounts output_data to /bbx/output inside the biobox. This is the location where the results with be created. The rw means read-write and allows the biobox to write to this location.

The --rm flag specifies that the biobox container should be removed after it has finished running. If you don't specify this, then your computer may fill up with finish bioboxes each time you start one. This will cause you to run out of disk space if there enough of them.