Putting everything together

The previous guides described building the individual parts of a biobox. The guide puts all these parts together to create a working biobox that you can give to users. This includes parsing the biobox.yaml file.

Now that we have in the previous sections integrated the file-validator, specified the task, we will write a script to combine everything. This script serves as an entrypoint to your container. An entrypoint points to a binary inside your container that will be executed on a docker run command. Command line arguments that will be appended after docker run will also be available to your Entrypoint. That means that the task in the command docker run task will be available as the first argument to your Entrypoint. You can configure the entrypoint in your Dockerfile with ENTRYPOINT ["/path/to/your/script/inside/the/container"]

This script will do the following:

  1. Check if the provided biobox.yaml is in the correct format using the file-validator.
  2. Fetch the parameter provided by the input biobox.yaml.
  3. Run the specified task.
  4. Generate an output YAML file and return the assembled contigs.

Example

Let's go through the parts of the script. At the end of this section you find the entire script and an updated Dockerfile. The first part of the script checks the given /bbx/input/biobox.yaml file with the validate-biobox-file:

#!/bin/bash

# exit script if one command fails
set -o errexit

# exit script if Variable is not set
set -o nounset

INPUT=/bbx/input/biobox.yaml
OUTPUT=/bbx/output
METADATA=/bbx/metadata

# Since this script is the entrypoint to your container
# you can access the task in `docker run task` as the first argument
TASK=$1

# Ensure the biobox.yaml file is valid
validate-biobox-file \
  --input ${INPUT} \
  --schema /schema.yaml \

mkdir -p ${OUTPUT}


You can savely reuse this part in your biobox implementation since all biobox RFCs have to use the validate-biobox-file binary.

The next part transforms the yaml to json and uses the jq tool to fetch the paths to fastq files. Jq is used to slice, filter, map and even to manipulate the json data by using pipes. In the example we try to access the following yaml:

---
version: 0.9.0
arguments:
  - fastq:
    - id: "pe"
      value: "/test1/reads.fastq.gz"
      type: paired
    - id: "pe_1"
      value: "/test2/reads.fastq.gz"

You see below that we first fetch the array in the arguments property with .arguments[] then we select the fastq property and access the value entry in each array item with the .fastq[].value directive. The last part | -short \(.) | tr '\n' ' ' allows to append each entry -short and to replace the newline with a whitespace. -short must be specified for the velvet command. The result of the jq command is /test1/reads.fastq.gz -short /test2/reads.fastq.gz -short.

# Parse the read locations from this file
READS=$(yaml2json < ${INPUT} \
        | jq --raw-output '.arguments[] | select(has("fastq")) | .fastq[].value | "-short \(.)"' \
        | tr '\n' ' ')

#create temporary directory in /tmp
TMP_DIR=$(mktemp -d)

This part access the task provided to the docker container by using egrep on the Taskfile (see Create a Task )

# Use grep to get $TASK in /Taskfile
CMD=$(egrep ^${TASK}: /Taskfile | cut -f 2 -d ':')
if [[ -z ${CMD} ]]; then
  echo "Abort, no task found for '${TASK}'."
  exit 1
fi

# if /bbx/metadata is mounted create log.txt
if [ -d "$METADATA" ]; then
  CMD="($CMD) >& $METADATA/log.txt"
fi

# Run the given task with eval.
# Eval evaluates a String as if you would use it on a command line.
eval ${CMD}


The last part copies the contigs to the output directory and creates the output.yaml which also is specified in the rfc.

cat << EOF > ${OUTPUT}/biobox.yaml
version: 0.9.0
arguments:
  - fasta:
    - id: velvet_contigs_1
      value: contigs.fa
      type: contigs
EOF

The final script that we call assemble should be placed in the same directory of your Dockerfile and looks like this:

#!/bin/bash

# exit script if one command fails
set -o errexit

# exit script if Variable is not set
set -o nounset

INPUT=/bbx/input/biobox.yaml
OUTPUT=/bbx/output
METADATA=/bbx/metadata

# Since this script is the entrypoint to your container
# you can access the task in `docker run task` as the first argument
TASK=$1

# Ensure the biobox.yaml file is valid
validate-biobox-file \
  --input ${INPUT} \
  --schema /schema.yaml \

mkdir -p ${OUTPUT}

# Parse the read locations from this file
READS=$(yaml2json < ${INPUT} \
        | jq --raw-output '.arguments[] | select(has("fastq")) | .fastq[].value | "-short \(.)"' \
        | tr '\n' ' ')

#create temporary directory in /tmp
TMP_DIR=$(mktemp -d)


# Use grep to get $TASK in /Taskfile
CMD=$(egrep ^${TASK}: /Taskfile | cut -f 2 -d ':')
if [[ -z ${CMD} ]]; then
  echo "Abort, no task found for '${TASK}'."
  exit 1
fi

# if /bbx/metadata is mounted create log.txt
if [ -d "$METADATA" ]; then
  CMD="($CMD) >& $METADATA/log.txt"
fi

# Run the given task with eval.
# Eval evaluates a String as if you would use it on a command line.
eval ${CMD}

cp ${TMP_DIR}/contigs.fa ${OUTPUT}

# This command writes yaml into the biobox.yaml until the EOF symbol is reached
cat << EOF > ${OUTPUT}/biobox.yaml
version: 0.9.0
arguments:
  - fasta:
    - id: velvet_contigs_1
      value: contigs.fa
      type: contigs
EOF

The final Dockerfile that has now additional RUN commands for downloading yaml2json and jq library now looks like this:

FROM ubuntu:14.04
MAINTAINER Michael Barton, mail@michaelbarton.me.uk

ENV PACKAGES make gcc wget libc6-dev zlib1g-dev ca-certificates xz-utils
RUN apt-get update -y && apt-get install -y --no-install-recommends ${PACKAGES}

ENV ASSEMBLER_DIR /tmp/assembler
ENV ASSEMBLER_URL https://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz
ENV ASSEMBLER_BLD make 'MAXKMERLENGTH=100' && mv velvet* /usr/local/bin/ && rm -r ${ASSEMBLER_DIR}

RUN mkdir ${ASSEMBLER_DIR}
RUN cd ${ASSEMBLER_DIR} &&\
    wget --quiet ${ASSEMBLER_URL} --output-document - |\
    tar xzf - --directory . --strip-components=1 && eval ${ASSEMBLER_BLD}

# Locations for biobox file validator
ENV VALIDATOR /bbx/validator/
ENV BASE_URL https://s3-us-west-1.amazonaws.com/bioboxes-tools/validate-biobox-file
ENV VERSION  0.x.y
RUN mkdir -p ${VALIDATOR}

# download the validate-biobox-file binary and extract it to the directory $VALIDATOR
RUN wget \
      --quiet \
      --output-document -\
      ${BASE_URL}/${VERSION}/validate-biobox-file.tar.xz \
    | tar xJf - \
      --directory ${VALIDATOR} \
      --strip-components=1

ENV PATH ${PATH}:${VALIDATOR}

# download the assembler schema
RUN wget \
    --output-document /schema.yaml \
    https://raw.githubusercontent.com/bioboxes/rfc/master/container/short-read-assembler/input_schema.yaml

ENV CONVERT https://github.com/bronze1man/yaml2json/raw/master/builds/linux_386/yaml2json
# download yaml2json and make it executable
RUN cd /usr/local/bin && wget --quiet ${CONVERT} && chmod 700 yaml2json

ENV JQ http://stedolan.github.io/jq/download/linux64/jq
# download jq and make it executable
RUN cd /usr/local/bin && wget --quiet ${JQ} && chmod 700 jq

# Add Taskfile to /
ADD Taskfile /

# Add assemble script to the directory /usr/local/bin inside the container.
# /usr/local/bin is appended to the $PATH variable what means that every script
# in that directory will be executed in the shell  without providing the path.
ADD assemble /usr/local/bin/

ENTRYPOINT ["assemble"]

Furthermore the Dockerfile sets the Entrypoint to the assemble script so that it will be executed on docker run. If you have followed the examples you should now have the following directory structure:

  • /Dockerfile

  • /assemble

  • /Taskfile

If you run now docker build -t velvet . in the same directory, you should have a biobox that accepts the tasks default and careful.