2. User Guide

2.1. Quick Start

Most of the time, Bioconvert simply requires the input and output filenames. If there is no ambiguity, the extension are used to infer the type of conversion you wish to perform.

For instance, to convert a FASTQ to a FASTA file, use this type of command:

bioconvert test.fastq test.fasta

If the converter fastq to fasta*² exists in **Bioconvert*, it will work out of the box. In order to get a list of all possible conversions, just type:

bioconvert --help

To obtain more specific help about a converter that you found in the list:

bioconvert fastq2fasta --help

Note

All converters are named as <input_extension>2<output_extension>

2.2. Explicit conversion

Sometimes, Bioconvert won't be able to know what you want solely based on the input and ouput extensions. So, you may need to be explicit and use a subcommand. For instance to use the converter fastq2fasta, type:

bioconvert fastq2fasta  input.fastq output.fasta

The rationale behind the subcommand choice is manyfold. First, you may have dedicated help for a given conversion, which may be different from one conversion to the other:

bioconvert fastq2fasta --help

Second, the extensions of your input and output may be non-standard or different from the choice made by the bioconvert developers. So, using the subcommand you can do:

bioconvert fastq2fasta  input.fq output.fas

where the extensions can actually be whatever you want.

If you do not provide the output file, it will be created based on the input filename by replacing the extension automatically. So this command:

bioconvert fastq2fasta input.fq

generates an output file called input.fasta. Note that it will be placed in the same directory as the input file, not locally. So:

bioconvert fastq2fasta ~/test/input.fq

will create the input.fasta file in the ~/test directory.

If an output file exists, it will not be overwritten. If you want to do so, use the --force argument:

bioconvert fastq2fasta  input.fq output.fa --force

2.3. Implicit conversion

If the extensions match the conversion name, you can perform implicit conversion:

bioconvert input.fastq output.fasta

Internally, a format may be registered with several extensions. For instance the extensions possible for a FastA file are fasta and fa so you can also write:

bioconvert input.fastq output.fa

2.4. Compression

Input files may be compressed. For instance, most FASTQ are compressed in GZ format. Compression are handled in some converters. Basically, most of the humand-readable files handle compression. For instance, all those commands should work and can be used to compress output files, or handle input compressed files:

bioconvert test.fastq.gz test.fasta
bioconvert test.fastq.gz test.fasta.gz
bioconvert test.fastq.gz test.fasta.bz2

Note that you can also decompress and compress into another compression keeping without doing any conversion (note the fastq extension in both input and output files):

bioconvert test.fastq.gz test.fastq.dsrc

2.5. Parallelization

In Bioconvert, if the input contains a wildcard such as * or ? characters, then, input filenames are treated separately and converted sequentially:

bioconvert fastq2fasta "*.fastq"

Note, however, that the files are processed sequentially one by one. So, we may want to parallelise the computation.

2.5.1. Iteration with unix commands

You can use a bash script under unix to run Bioconvert on a set of files. For instance the following script takes all files with the .fastq extension and convert them to fasta:

#!/bin/bash
FILES=*fastq
CONVERSION=fastq2fasta
for f in $FILES
do
  echo "Processing $f file..."
  bioconvert $CONVERSION $f  --force
done

Note, however, that this is still a sequential computation. Yet, you may now change it slightly to run the commands on a cluster. For instance, on a SLURM scheduler, you can use:

#!/bin/bash
FILES=*fastq
CONVERSION=fastq2fasta
for f in $FILES
do
  echo "Processing $f file..."
  sbatch -c 1 bioconvert $CONVERSION $f  --force
done

2.5.2. Snakemake option

If you have lots of files to convert, a snakemake pipeline is available in the Sequana project and can be installed using pip install sequana_bioconvert. It also installs bioconvert with an ap ptainer image that contains all dependencies for you.

Here is another way of running your jobs in parallel using a simple Snakefile (snakemake) that can be run easily either locally or on a cluster.

You can download the following file Snakefile

inext = "fastq"
outext = "fasta"
command = "fastq2fasta"

import glob
samples = glob.glob("*.{}".format(inext))
samples = [this.rsplit(".")[0] for this in samples]

rule all:
    input: expand("{{dataset}}.{}".format(outext), dataset=samples)

rule bioconvert:
    input: "{{dataset}}.{}".format(inext)
    output: "{{dataset}}.{}".format(outext)
    run:
        cmd = "bioconvert {} ".format(command) + "{input} {output}"
        shell(cmd)

and execute it locally as follows (assuming you have 4 CPUs):

snakemake -s Snakefile --cores 4

or on a cluster:

snakemake -s Snakefile --cluster "--mem=1000 -j 10"