Using a biobox Docker image
Bioboxes aims to make it much simpler for anyone to use the most recent advances in bioinformatics software. This page will provide a short example of using a biobox genome assembler. The purpose of this guide is to illustrate how bioboxes work and this could then be applied for any application for which a biobox exists, not only genome assembly.
This tutorial will use real sequencing data so that the example biobox can be
run as you might do so with your own data. The data is available for
download and is a FASTQ file of Illumina reads from a real genome which
was sequenced at the JGI. This file can be downloaded using wget
.
$ mkdir input_data
$ wget \
--output-document input_data/reads.fq.gz \
'https://www.dropbox.com/s/uxgn6cqngctqv74/reads.fq.gz?dl=1'
Create a biobox.yaml file
The inputs to a biobox are specified using a file named 'bioboxes.yaml' An example file where we specific this data is:
---
version: "0.9.0"
arguments:
- fastq:
- id: "test_reads"
type: "paired"
value: "/bbx/input/reads.fq.gz"
In this file we specify the current bioboxes version 0.9
, along with the
arguments to the biobox. In this case we're giving a single FASTQ file. This
argument has the identifier test_reads
and the type is paired
because this
is the type of sequencing data. The final argument specifies the location of
the files. In the biobox.yaml file this is in the directory named /bbx/input/
which is where you will place the reads in the biobox container.
This biobox.yaml
file can be created as follows:
cat << EOF > input_data/biobox.yaml
---
version: "0.9.0"
arguments:
- fastq:
- id: "test_reads"
type: "paired"
value: "/bbx/input/reads.fq.gz"
EOF
Run the biobox
The input data and biobox.yaml file are all that's required to test a biobox. Run the following command to use the velvet biobox to assemble the test reads:
mkdir -p output_data
docker run \
--volume="$(pwd)/input_data:/bbx/input:ro" \
--volume="$(pwd)/output_data:/bbx/output:rw" \
--rm \
bioboxes/velvet \
default
This uses $(pwd)
syntax. If you are unfamiliar the command pwd
returns the
current working directory. The construct $(...)
replaces itself with the
result of evaluating the contents inside the parenthesis. Therefore $(pwd)
will be replaced with the current working directory you are in. This is
necessary because the --volume
flags require the full directory path.
The --volume
flag is used to link a directory on your computer to a directory
inside the biobox. In the example you mount the directory input_data
to the
directory /bbx/input
, the ro
is an abbreviation of read-only. This means
only data can be read from the directory. You will generally always want to use
ro
for your input data to prevent a biobox accidentally changing it. The
second volume mounts output_data
to /bbx/output
inside the biobox. This is
the location where the results with be created. The rw
means read-write and
allows the biobox to write to this location.
The --rm
flag specifies that the biobox container should be removed after it
has finished running. If you don't specify this, then your computer may fill up
with finish bioboxes each time you start one. This will cause you to run out of
disk space if there enough of them.