Current version: 1.1.1, Jul 18, 2023

Bioconvert

Bioconvert is a collaborative project to facilitate the interconversion of life science data from one format to another.

https://badge.fury.io/py/bioconvert.svg https://github.com/bioconvert/bioconvert/actions/workflows/main.yml/badge.svg?branch=main https://coveralls.io/repos/github/bioconvert/bioconvert/badge.svg?branch=main Documentation Status https://img.shields.io/github/issues/bioconvert/bioconvert.svg https://anaconda.org/bioconda/bioconvert/badges/platforms.svg https://anaconda.org/bioconda/bioconvert/badges/version.svg https://anaconda.org/bioconda/bioconvert/badges/downloads.svg https://zenodo.org/badge/106598809.svg https://static.pepy.tech/personalized-badge/bioconvert?period=month&units=international_system&left_color=black&right_color=blue&left_text=Downloads/months https://raw.githubusercontent.com/bioconvert/bioconvert/main/doc/_static/logo_300x200.png
contributions:

Want to add a convertor ? Please join https://github.com/bioconvert/bioconvert/issues/1

Overview

Life science uses many different formats. They may be old, or with complex syntax and converting those formats may be a challenge. Bioconvert aims at providing a common tool / interface to convert life science data formats from one to another.

Many conversion tools already exist but they may be dispersed, focused on few specific formats, difficult to install, or not optimised. With Bioconvert, we plan to cover a wide spectrum of format conversions; we will re-use existing tools when possible and provide facilities to compare different conversion tools or methods via benchmarking. New implementations are provided when considered better than existing ones.

In Jan 2023, we had 50 formats, 100 direct conversions available.

https://raw.githubusercontent.com/bioconvert/bioconvert/main/doc/conversion.png

Installation

BioConvert is developped in Python. Please use conda or any Python environment manager to install BioConvert using the pip command:

pip install bioconvert

50% of the conversions should work out of the box. However, many conversions require external tools. This is why we recommend to use a conda environment. In particular, most external tools are available on the bioconda channel. For instance if you want to convert a SAM file to a BAM file you would need to install samtools as follow:

conda install -c bioconda samtools

Since bioconvert is available on bioconda on solution that installs BioConvert and all its dependencies is to use conda/mamba:

conda env create --name bioconvert mamba
conda activate bioconvert
mamba install bioconvert
bioconvert --help

See the Installation section for more details and alternative solutions (docker, singularity).

Quick Start

There are many conversions available. Type:

bioconvert --help

to get a list of valid method of conversions. Taking the example of a conversion from a FastQ file into a FastA file, you could do the conversion as follows:

bioconvert fastq2fasta input.fastq output.fasta
bioconvert fastq2fasta input.fq    output.fasta
bioconvert fastq2fasta input.fq.gz output.fasta.gz
bioconvert fastq2fasta input.fq.gz output.fasta.bz2

When there is no ambiguity, you can be implicit:

bioconvert input.fastq output.fasta

The default method of conversion is used but you may use another one. Checkout the available methods with:

bioconvert fastq2fasta --show-methods

For more help about a conversion, just type:

bioconvert fastq2fasta --help

and more generally:

bioconvert --help

You may also call BioConvert from a Python shell:

# import a converter
from bioconvert.fastq2fasta import FASTQ2FASTA

# Instanciate with infile/outfile names
convert = FASTQ2FASTA(infile, outfile)

# the conversion itself:
convert()

Available Converters

Conversion table

Converters

CI testing

Default method

abi2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/abi2fasta.yml/badge.svg

BIOPYTHON

abi2fastq

https://github.com/bioconvert/bioconvert/actions/workflows/abi2fastq.yml/badge.svg

BIOPYTHON

abi2qual

https://github.com/bioconvert/bioconvert/actions/workflows/abi2qual.yml/badge.svg

BIOPYTHON

bam2bedgraph

https://github.com/bioconvert/bioconvert/actions/workflows/bam2bedgraph.yml/badge.svg

BEDTOOLS

bam2bigwig

https://github.com/bioconvert/bioconvert/actions/workflows/bam2bigwig.yml/badge.svg

DEEPTOOLS

bam2cov

https://github.com/bioconvert/bioconvert/actions/workflows/bam2cov.yml/badge.svg

BEDTOOLS

bam2cram

https://github.com/bioconvert/bioconvert/actions/workflows/bam2cram.yml/badge.svg

SAMTOOLS

bam2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/bam2fasta.yml/badge.svg

SAMTOOLS

bam2fastq

https://github.com/bioconvert/bioconvert/actions/workflows/bam2fastq.yml/badge.svg

SAMTOOLS

bam2json

https://github.com/bioconvert/bioconvert/actions/workflows/bam2json.yml/badge.svg

BAMTOOLS

bam2sam

https://github.com/bioconvert/bioconvert/actions/workflows/bam2sam.yml/badge.svg

SAMBAMBA

bam2tsv

https://github.com/bioconvert/bioconvert/actions/workflows/bam2tsv.yml/badge.svg

SAMTOOLS

bam2wiggle

https://github.com/bioconvert/bioconvert/actions/workflows/bam2wiggle.yml/badge.svg

WIGGLETOOLS

bcf2vcf

https://github.com/bioconvert/bioconvert/actions/workflows/bcf2vcf.yml/badge.svg

BCFTOOLS

bcf2wiggle

https://github.com/bioconvert/bioconvert/actions/workflows/bcf2wiggle.yml/badge.svg

WIGGLETOOLS

bed2wiggle

https://github.com/bioconvert/bioconvert/actions/workflows/bed2wiggle.yml/badge.svg

WIGGLETOOLS

bedgraph2bigwig

https://github.com/bioconvert/bioconvert/actions/workflows/bedgraph2bigwig.yml/badge.svg

UCSC

bedgraph2cov

https://github.com/bioconvert/bioconvert/actions/workflows/bedgraph2cov.yml/badge.svg

BIOCONVERT

bedgraph2wiggle

https://github.com/bioconvert/bioconvert/actions/workflows/bedgraph2wiggle.yml/badge.svg

WIGGLETOOLS

bigbed2bed

https://github.com/bioconvert/bioconvert/actions/workflows/bigbed2bed.yml/badge.svg

DEEPTOOLS

bigbed2wiggle

https://github.com/bioconvert/bioconvert/actions/workflows/bigbed2wiggle.yml/badge.svg

WIGGLETOOLS

bigwig2bedgraph

https://github.com/bioconvert/bioconvert/actions/workflows/bigwig2bedgraph.yml/badge.svg

DEEPTOOLS

bigwig2wiggle

https://github.com/bioconvert/bioconvert/actions/workflows/bigwig2wiggle.yml/badge.svg

WIGGLETOOLS

bplink2plink

https://github.com/bioconvert/bioconvert/actions/workflows/bplink2plink.yml/badge.svg

PLINK

bplink2vcf

https://github.com/bioconvert/bioconvert/actions/workflows/bplink2vcf.yml/badge.svg

PLINK

bz22gz

https://github.com/bioconvert/bioconvert/actions/workflows/bz22gz.yml/badge.svg

Unix commands

clustal2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/clustal2fasta.yml/badge.svg

BIOPYTHON

clustal2nexus

https://github.com/bioconvert/bioconvert/actions/workflows/clustal2nexus.yml/badge.svg

GOALIGN

clustal2phylip

https://github.com/bioconvert/bioconvert/actions/workflows/clustal2phylip.yml/badge.svg

BIOPYTHON

clustal2stockholm

https://github.com/bioconvert/bioconvert/actions/workflows/clustal2stockholm.yml/badge.svg

BIOPYTHON

cram2bam

https://github.com/bioconvert/bioconvert/actions/workflows/cram2bam.yml/badge.svg

SAMTOOLS

cram2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/cram2fasta.yml/badge.svg

SAMTOOLS

cram2fastq

https://github.com/bioconvert/bioconvert/actions/workflows/cram2fastq.yml/badge.svg

SAMTOOLS

cram2sam

https://github.com/bioconvert/bioconvert/actions/workflows/cram2sam.yml/badge.svg

SAMTOOLS

csv2tsv

https://github.com/bioconvert/bioconvert/actions/workflows/csv2tsv.yml/badge.svg

BIOCONVERT

csv2xls

https://github.com/bioconvert/bioconvert/actions/workflows/csv2xls.yml/badge.svg

Pandas

dsrc2gz

https://github.com/bioconvert/bioconvert/actions/workflows/dsrc2gz.yml/badge.svg

DSRC software

embl2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/embl2fasta.yml/badge.svg

BIOPYTHON

embl2genbank

https://github.com/bioconvert/bioconvert/actions/workflows/embl2genbank.yml/badge.svg

BIOPYTHON

fasta2clustal

https://github.com/bioconvert/bioconvert/actions/workflows/fasta2clustal.yml/badge.svg

BIOPYTHON

fasta2faa

https://github.com/bioconvert/bioconvert/actions/workflows/fasta2faa.yml/badge.svg

BIOCONVERT

fasta2fasta_agp

https://github.com/bioconvert/bioconvert/actions/workflows/fasta2fasta_agp.yml/badge.svg

BIOCONVERT

fasta2fastq

https://github.com/bioconvert/bioconvert/actions/workflows/fasta2fastq.yml/badge.svg

PYSAM

fasta2genbank

https://github.com/bioconvert/bioconvert/actions/workflows/fasta2genbank.yml/badge.svg

BIOCONVERT

fasta2nexus

https://github.com/bioconvert/bioconvert/actions/workflows/fasta2nexus.yml/badge.svg

GOALIGN

fasta2phylip

https://github.com/bioconvert/bioconvert/actions/workflows/fasta2phylip.yml/badge.svg

BIOPYTHON

fasta2twobit

https://github.com/bioconvert/bioconvert/actions/workflows/fasta2twobit.yml/badge.svg

UCSC

fasta_qual2fastq

https://github.com/bioconvert/bioconvert/actions/workflows/fasta_qual2fastq.yml/badge.svg

PYSAM

fastq2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/fastq2fasta.yml/badge.svg

BIOCONVERT available

fastq2fasta_qual

https://github.com/bioconvert/bioconvert/actions/workflows/fastq2fasta_qual.yml/badge.svg

BIOCONVERT

fastq2qual

https://github.com/bioconvert/bioconvert/actions/workflows/fastq2qual.yml/badge.svg

READFQ

genbank2embl

https://github.com/bioconvert/bioconvert/actions/workflows/genbank2embl.yml/badge.svg

BIOPYTHON

genbank2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/genbank2fasta.yml/badge.svg

BIOPYTHON

genbank2gff3

https://github.com/bioconvert/bioconvert/actions/workflows/genbank2gff3.yml/badge.svg

BIOCODE

gfa2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/gfa2fasta.yml/badge.svg

BIOCONVERT

gff22gff3

https://github.com/bioconvert/bioconvert/actions/workflows/gff22gff3.yml/badge.svg

BIOCONVERT

gff32gff2

https://github.com/bioconvert/bioconvert/actions/workflows/gff32gff2.yml/badge.svg

BIOCONVERT

gff32gtf

https://github.com/bioconvert/bioconvert/actions/workflows/gff32gtf.yml/badge.svg

BIOCONVERT

gz2bz2

https://github.com/bioconvert/bioconvert/actions/workflows/gz2bz2.yml/badge.svg

pigz/pbzip2 software

gz2dsrc

https://github.com/bioconvert/bioconvert/actions/workflows/gz2dsrc.yml/badge.svg

DSRC software

json2yaml

https://github.com/bioconvert/bioconvert/actions/workflows/json2yaml.yml/badge.svg

Python

maf2sam

https://github.com/bioconvert/bioconvert/actions/workflows/maf2sam.yml/badge.svg

BIOCONVERT

newick2nexus

https://github.com/bioconvert/bioconvert/actions/workflows/newick2nexus.yml/badge.svg

GOTREE

newick2phyloxml

https://github.com/bioconvert/bioconvert/actions/workflows/newick2phyloxml.yml/badge.svg

GOTREE

nexus2clustal

https://github.com/bioconvert/bioconvert/actions/workflows/nexus2clustal.yml/badge.svg

GOALIGN

nexus2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/nexus2fasta.yml/badge.svg

BIOPYTHON

nexus2newick

https://github.com/bioconvert/bioconvert/actions/workflows/nexus2newick.yml/badge.svg

GOTREE

nexus2phylip

https://github.com/bioconvert/bioconvert/actions/workflows/nexus2phylip.yml/badge.svg

GOALIGN

nexus2phyloxml

https://github.com/bioconvert/bioconvert/actions/workflows/nexus2phyloxml.yml/badge.svg

GOTREE

ods2csv

https://github.com/bioconvert/bioconvert/actions/workflows/ods2csv.yml/badge.svg

pyexcel library

pdb2faa

https://github.com/bioconvert/bioconvert/actions/workflows/pdb2faa.yml/badge.svg

BIOCONVERT

phylip2clustal

https://github.com/bioconvert/bioconvert/actions/workflows/phylip2clustal.yml/badge.svg

BIOPYTHON

phylip2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/phylip2fasta.yml/badge.svg

BIOPYTHON

phylip2nexus

https://github.com/bioconvert/bioconvert/actions/workflows/phylip2nexus.yml/badge.svg

GOALIGN

phylip2stockholm

https://github.com/bioconvert/bioconvert/actions/workflows/phylip2stockholm.yml/badge.svg

BIOPYTHON

phylip2xmfa

https://github.com/bioconvert/bioconvert/actions/workflows/phylip2xmfa.yml/badge.svg

BIOPYTHON

phyloxml2newick

https://github.com/bioconvert/bioconvert/actions/workflows/phyloxml2newick.yml/badge.svg

GOTREE

phyloxml2nexus

https://github.com/bioconvert/bioconvert/actions/workflows/phyloxml2nexus.yml/badge.svg

GOTREE

plink2bplink

https://github.com/bioconvert/bioconvert/actions/workflows/plink2bplink.yml/badge.svg

PLINK

plink2vcf

https://github.com/bioconvert/bioconvert/actions/workflows/plink2vcf.yml/badge.svg

PLINK

sam2bam

https://github.com/bioconvert/bioconvert/actions/workflows/sam2bam.yml/badge.svg

SAMTOOLS

sam2cram

https://github.com/bioconvert/bioconvert/actions/workflows/sam2cram.yml/badge.svg

SAMTOOLS

sam2paf

https://github.com/bioconvert/bioconvert/actions/workflows/sam2paf.yml/badge.svg

BIOCONVERT

scf2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/scf2fasta.yml/badge.svg

BIOCONVERT

scf2fastq

https://github.com/bioconvert/bioconvert/actions/workflows/scf2fastq.yml/badge.svg

BIOCONVERT

sra2fastq

https://github.com/bioconvert/bioconvert/actions/workflows/sra2fastq.yml/badge.svg

FASTQDUMP

stockholm2clustal

https://github.com/bioconvert/bioconvert/actions/workflows/stockholm2clustal.yml/badge.svg

BIOPYTHON

stockholm2phylip

https://github.com/bioconvert/bioconvert/actions/workflows/stockholm2phylip.yml/badge.svg

BIOPYTHON

tsv2csv

https://github.com/bioconvert/bioconvert/actions/workflows/tsv2csv.yml/badge.svg

BIOCONVERT

twobit2fasta

https://github.com/bioconvert/bioconvert/actions/workflows/twobit2fasta.yml/badge.svg

DEEPTOOLS

vcf2bcf

https://github.com/bioconvert/bioconvert/actions/workflows/vcf2bcf.yml/badge.svg

BCFTOOLS

vcf2bed

https://github.com/bioconvert/bioconvert/actions/workflows/vcf2bed.yml/badge.svg

BIOCONVERT

vcf2bplink

https://github.com/bioconvert/bioconvert/actions/workflows/vcf2bplink.yml/badge.svg

PLINK

vcf2plink

https://github.com/bioconvert/bioconvert/actions/workflows/vcf2plink.yml/badge.svg

PLINK

vcf2wiggle

https://github.com/bioconvert/bioconvert/actions/workflows/vcf2wiggle.yml/badge.svg

WIGGLETOOLS

wig2bed

https://github.com/bioconvert/bioconvert/actions/workflows/wig2bed.yml/badge.svg

BEDOPS

xls2csv

https://github.com/bioconvert/bioconvert/actions/workflows/xls2csv.yml/badge.svg

xlsx2csv

https://github.com/bioconvert/bioconvert/actions/workflows/xlsx2csv.yml/badge.svg

Pandas library

xmfa2phylip

https://github.com/bioconvert/bioconvert/actions/workflows/xmfa2phylip.yml/badge.svg

BIOPYTHON

yaml2json

https://github.com/bioconvert/bioconvert/actions/workflows/yaml2json.yml/badge.svg

Pandas library

Contributors

Setting up and maintaining Bioconvert has been possible thanks to users and contributors. Thanks to all:

https://contrib.rocks/image?repo=bioconvert/bioconvert

Changes

Version

Description

1.1.1

  • Fix benchmark labels.

  • NEW: fast52pod5 conversion

  • FIX: set goalign and gotree instead of go requirements

1.1.0

  • Implement ability to benchmark the CPU and memory usage (not just time) benchmark incorporates CPU/memory usage

1.0.0

0.6.3

  • add picard method in bam2sam

  • Fixed all CI workflows to use mamba

  • drop python3.7 support and add 3.10 support

  • update bedops test file to fit the latest bedops 2.4.41 version

  • revisit logging system

0.6.2

  • added gff3 to gtf conversion.

  • Added pdb to faa conversion

  • Added missing --reference argument to the cram2sam conversion

0.6.1

  • output file can be in sub-directories allowing syntax such as 'bioconvert fastq2fasta test.fastq outputs/test.fasta

  • fix all CI actions

  • add more examples as notebooks in ./examples

  • add a Snakefile for the paper in ./doc/Snakefile_paper

0.6.0

  • Fix bug in bam2sam (method sambamba)

  • Fix graph layout

  • add threading in fastq2fasta (seqkit method)

  • multibenchmark feature added

  • stable version used for web interface

0.5.2

  • Update requirements and environment.yml and add a conda spec-file.txt file

0.5.1

  • add genbank2gff3 requirement material in bioconvert.utils.biocode

0.5.0

  • Add CI actions for all converters

  • remove sniffer (now in biosniff on pypi https://pypi.org/project/biosniff/)

  • A complete benchmarking suite (see doc/Snakefile_benchmark file and benchmarking)

  • documentation and tests for all converters

  • removed the validators (we assume intputs are correct)

0.4.X

  • (aug 2019) added nexus2fasta, cram2fasta, fasta2faa ... ; 1-to-many and many-to-one converters are now part of the API.

0.3.X

may 2019. new methods abi2qual, bigbed2bed, etc. added --threads option

0.2.X

aug 2018. abi2fastx, bioconvert_stats tool added

0.1.X

major refactoring to have subcommands with implicit/explicit mode

Complete documentation including User and Developer Guides

Installation

Bioconvert is developed in Python so you can use the pip method to install it easily. We recommend to use a virtual environment to not interfere with your system. In any case, install BioConvert with:

pip install bioconvert

Note, however, that you will be able to use only about half of the conversions (pure Python). Others depend on third-party software.

One solution is to create a dedicated environment using conda. In particular, we use bioconda to install those dependencies.

conda / bioconda /mamba installation

One workable and relatively straightfoward installation is based on conda/mamba:

conda create --name bioconvert python=3.8
conda activate bioconvert
conda install mamba

Then, use mamba to install the missing executable. Dependencies and BioConvert are available on the bioconda channel (see more about channels at the bottom of the page). For example for samtools:

mamba install samtools -c bioconda

Third-package executables can be installed with your own method. We recommend and provide solutions for conda. Indeed, BioConvert is available on the bioconda channel (see Conda channels section for details).

So, you could create a conda environment and install bioconvert directly with all dependencies. This is, however, pretty slow due to the large number of dependencies:

conda create --name bioconvert bioconvert

Instead, we recommend to use an intermediate tool called mamba that will provide a more robust and faster installation:

conda create -c bioconda --name bioconvert mamba
conda activate bioconvert
mamba install bioconvert -c bioconda

In Jan 2023, this method worked out of box and created an environment with Python3.10 and bioconvert 0.6.2 with all its dependencies.

We also provide a frozen version of an environment with the bioconvert github repository. Note, however, that this file may change with time. This will create a conda environment called bioconvert. See the link

wget https://raw.githubusercontent.com/bioconvert/bioconvert/main/environment.yml -O test.yml conda create install create -f test.yml

Docker

A Dockerfile (version 0.6.1 of BioConvert) is available on dockerhub:

docker pull bioconvert/bioconvert:0.6.1

Which can be used as follows:

docker run bioconvert -d /home/user:/home/user bioconvert /home/user/test_file.fastq /home/user/test_file.fasta

Since bioconvert is on bioconda, it is also available on quay.io. For instance, version 0.6.2 is reachable here:

docker pull quay.io/biocontainers/bioconvert:0.6.2--pyhdfd78af_0

Singularity/Apptainer

We provide Singularity/Apptainer images of BioConvert within the https://damona.readthedocs.io project.

The version 0.6.2 of BioConvert is available for downloads.

Using damona:

pip install damona

# create and activate an environment
damona env --create test_bioconvert
damona activate test_bioconvert
damona install bioconvert
bioconvert

You can also install the singularity image yourself by downloading it:

wget https://zenodo.org/record/7034822/files/bioconvert_0.6.1.img
singularity exec bioconvert_0.6.1.img bioconvert

# you can also create an alias
alias bioconvert="singularity run bioconvert.simg bioconvert"

Warning

You will need singularity of course. If you have a conda environment, you are lucky. singularity is there/

Conda channels

First, you will need to set up the bioconda channel if not already done:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

Warning

it is important to add them in this order, as mentionned on bioconda webpage (https://bioconda.github.io/).

If you have already set the channels, please check that the order is correct. With the following command:

conda config --get channels

You should see:

--add channels 'defaults'
--add channels 'bioconda'
--add channels 'conda-forge'# highest priority

User Guide

Quick Start

Most of the time, Bioconvert simply requires the input and output filenames. If there is no ambiguity, the extension are used to infer the type of conversion you wish to perform.

For instance, to convert a FASTQ to a FASTA file, use this type of command:

bioconvert test.fastq test.fasta

If the converter fastq to fasta*² exists in **Bioconvert*, it will work out of the box. In order to get a list of all possible conversions, just type:

bioconvert --help

To obtain more specific help about a converter that you found in the list:

bioconvert fastq2fasta --help

Note

All converters are named as <input_extension>2<output_extension>

Explicit conversion

Sometimes, Bioconvert won't be able to know what you want solely based on the input and ouput extensions. So, you may need to be explicit and use a subcommand. For instance to use the converter fastq2fasta, type:

bioconvert fastq2fasta  input.fastq output.fasta

The rationale behind the subcommand choice is manyfold. First, you may have dedicated help for a given conversion, which may be different from one conversion to the other:

bioconvert fastq2fasta --help

Second, the extensions of your input and output may be non-standard or different from the choice made by the bioconvert developers. So, using the subcommand you can do:

bioconvert fastq2fasta  input.fq output.fas

where the extensions can actually be whatever you want.

If you do not provide the output file, it will be created based on the input filename by replacing the extension automatically. So this command:

bioconvert fastq2fasta input.fq

generates an output file called input.fasta. Note that it will be placed in the same directory as the input file, not locally. So:

bioconvert fastq2fasta ~/test/input.fq

will create the input.fasta file in the ~/test directory.

If an output file exists, it will not be overwritten. If you want to do so, use the --force argument:

bioconvert fastq2fasta  input.fq output.fa --force

Implicit conversion

If the extensions match the conversion name, you can perform implicit conversion:

bioconvert input.fastq output.fasta

Internally, a format may be registered with several extensions. For instance the extensions possible for a FastA file are fasta and fa so you can also write:

bioconvert input.fastq output.fa

Compression

Input files may be compressed. For instance, most FASTQ are compressed in GZ format. Compression are handled in some converters. Basically, most of the humand-readable files handle compression. For instance, all those commands should work and can be used to compress output files, or handle input compressed files:

bioconvert test.fastq.gz test.fasta
bioconvert test.fastq.gz test.fasta.gz
bioconvert test.fastq.gz test.fasta.bz2

Note that you can also decompress and compress into another compression keeping without doing any conversion (note the fastq extension in both input and output files):

bioconvert test.fastq.gz test.fastq.dsrc

Parallelization

In Bioconvert, if the input contains a wildcard such as * or ? characters, then, input filenames are treated separately and converted sequentially:

bioconvert fastq2fasta "*.fastq"

Note, however, that the files are processed sequentially one by one. So, we may want to parallelise the computation.

Iteration with unix commands

You can use a bash script under unix to run Bioconvert on a set of files. For instance the following script takes all files with the .fastq extension and convert them to fasta:

#!/bin/bash
FILES=*fastq
CONVERSION=fastq2fasta
for f in $FILES
do
  echo "Processing $f file..."
  bioconvert $CONVERSION $f  --force
done

Note, however, that this is still a sequential computation. Yet, you may now change it slightly to run the commands on a cluster. For instance, on a SLURM scheduler, you can use:

#!/bin/bash
FILES=*fastq
CONVERSION=fastq2fasta
for f in $FILES
do
  echo "Processing $f file..."
  sbatch -c 1 bioconvert $CONVERSION $f  --force
done

Snakemake option

If you have lots of files to convert, a snakemake pipeline is available in the Sequana project and can be installed using pip install sequana_bioconvert. It also installs bioconvert with an ap ptainer image that contains all dependencies for you.

Here is another way of running your jobs in parallel using a simple Snakefile (snakemake) that can be run easily either locally or on a cluster.

You can download the following file Snakefile

inext = "fastq"
outext = "fasta"
command = "fastq2fasta"

import glob
samples = glob.glob("*.{}".format(inext))
samples = [this.rsplit(".")[0] for this in samples]

rule all:
    input: expand("{{dataset}}.{}".format(outext), dataset=samples)

rule bioconvert:
    input: "{{dataset}}.{}".format(inext)
    output: "{{dataset}}.{}".format(outext)
    run:
        cmd = "bioconvert {} ".format(command) + "{input} {output}"
        shell(cmd)

and execute it locally as follows (assuming you have 4 CPUs):

snakemake -s Snakefile --cores 4

or on a cluster:

snakemake -s Snakefile --cluster "--mem=1000 -j 10"

Tutorial

Here is a tutorial that allows you to quickly start bioconvert and see some features on a real data set.

We are looking to highlight CNVs (Copy Number Variation) by identifying a significant increase in sequencing coverage in the following samples

Data

For this tutorial, we will work with 6 sequencing samples of the Staphylococcus aureus genome :

  1. ERR043367

  2. ERR043371

  3. ERR043375

  4. ERR043379

  5. ERR142616

  6. ERR316404

But these files are compressed in SRA format while we want fastq files.

Downloading

To download archive files (SRA):

bioconvert sra2fastq ERR043367

So we download the ERR043367 archive in SRA format to convert them into fastq just with the sample id.

It's paired sequencing so bioconvert creates two fastq files :

  • ERR043367_1.fastq (contains the reads 1)

  • ERR043367_2.fastq (contains the reads 2)

Bioconvert behaves differently when it's single sequencing but always with the same syntax, for example the ERR3295124 sample is single sequencing. let's try:

bioconvert sra2fastq ERR3295124

It's exactly the same command for single sequencing, but only one file as output :

  • ERR3295124.fastq

Compression

Fastq files can be huge. If you want to conserve the files bioconvert can perform compression on the fly but in this case you have to be explicit (specify the output, with the gz extension):

bioconvert sra2fastq ERR043367 ERR043367.fastq.gz

Mapping

Now, from fastq files, we can perform an alignment on the reference genome using bwa for example:

bwa index staphylococcus_aureus.fasta
bwa mem -M -t 4 staphylococcus_aureus.fasta ERR043367_1.fastq ERR043367_2.fastq > ERR043367.sam

Note

Find the reference genome of staphylococcus aureus with the accession FN433596 on NCBI :

We get a sam file that we can visualize but if you want to reduce the size of the file.

You can used bioconvert by two ways to convert the sam file to a bam file:

  1. Implicit way:

    bioconvert ERR043367.sam ERR043367.bam
    

This is the implicit way because bioconvert deduces the converter to use from the input and output extension

  1. Explicit way:

    bioconvert sam2bam ERR043367.sam
    

By this way, we specify the converter so bioconvert is able to deduces the extension of the output file.

Note

In both cases, we have the same output file (ERR043367.bam)

Visualization

Then from this bam file you can visualize the mapping with igv for example.

Here we have a global view of 500bp from the position 2.828.460 to 2.828.960 using IGV. From this point of view, we can see a significant difference between the region in red and the other two blue.

_images/coverage_igv.png

We expected to obtain fairly uniform coverage across all samples. But on this region we observe that this is not the case. We can therefore say that there is a possible variation in the number of copies.

In order to confirm what we saw, We want to convert our alignment (BAM) to a BED file to know the number of reads mapped by position:

bioconvert bam2bedgraph ERR043367.bam ERR043367.bed

In this bed file, we can check the visual results obtained a little earlier with word processing tools that allow us to get some quick statistics like the average coverage (168) and compare to the most covered regions to identify CNVs.

On all the samples we have identified 2 regions that are significantly more covered.

Developer guide

Quick start

As a developer, assuming you have a valid environment and installed Bioconvert (Installation for developers), go to bioconvert directory and type the bioconvert init command for the input and output formats you wish to add (here we want to convert format A to B). You may also just copy an existing file:

cd bioconvert
bioconvert_init -i A -o B > A2B.py

see How to add a new conversion section for details. Edit the file, update the method that performs the conversion by adding the relevant code (either python or external tools). Once done, please

  1. add an input test file in the ./test/data directory (see How to add a test)

  2. add the relevant data test files in the ./bioconvert/test/data/ directory (see How to add a test file)

  3. Update the documentation as explained in How to add you new converter to the main documentation ? section:

    1. add the module in doc/ref_converters.rst in the autosummary section

    2. add the A2B in the README.rst

  4. add a CI action in .github/workflows named after the conversion (A2B.yml)

Note also that a converter (a Python module, e.g., fastq2fasta) may have several methods included and it is quite straightforward to add a new method (How to add a new method to an existing converter). They can later be compared thanks to our benchmarking framework.

If this is a new formats, you may also update the glossary.rst file in the documentation.

Installation for developers

To develop on bioconvert it is highly recommended to install bioconvert in a virtualenv

mkdir bioconvert
cd bioconvert
python3.7 -m venv py37
source py37/bin/activate

And clone the bioconvert project

mkdir src
cd src
git clone https://github.com/bioconvert/bioconvert.git
cd  bioconvert

We need to install some extra requirements to run the tests or build the doc so to install these requirements

pip install -e . [testing]

Warning

The extra requirements try to install pygraphviz so you need to install graphviz on your computer. If you running a distro based on debian you have to install libcgraph6, libgraphviz-dev and graphviz packages.

Note

You may need to install extra tools to run some conversion. The requirements_tools.txt file list conda extra tools

How to add a new conversion

Officially, Bioconvert supports one-to-one conversions only (from one format to another format). See the note here below about One-to-many and many-to-one conversions.

Let us imagine that we want to include a new format conversion from FastQ to FastA format.

First, you need to add a new file in the ./bioconvert directory called:

fastq2fasta.py

Please note that the name is all in small caps and that we concatenate the input format name, the character 2 and the output format name. Sometimes a format already includes the character 2 in its name (e.g. bz2), which may be confusing. For now, just follow the previous convention meaning duplicate the character 2 if needed (e.g., for bz2 to gz format, use bz22gz).

As for the class name, we us all in big caps. In the newly created file (fastq2fasta.py) you can (i) copy / paste the content of an existing converter (ii) use the bioconvert_init executable (see later), or (iii) copy / paste the following code:

 1"""Convert :term:`FastQ` format to :term:`FastA` formats"""
 2from bioconvert import ConvBase
 3
 4__all__ = ["FASTQ2FASTA"]
 5
 6
 7class FASTQ2FASTA(ConvBase):
 8    """
 9
10    """
11    _default_method = "v1"
12
13    def __init__(self, infile, outfile):
14        """
15        :param str infile: information
16        :param str outfile: information
17        """
18        super().__init__(infile, outfile)
19
20    @requires(external_library="awk")
21    def _method_v1(self, *args, **kwargs):
22        # Conversion is made here.
23        # You can use self.infile and  self.outfile
24        # If you use an external command, you can use self.execute:
25        self.execute(cmd)
26
27    @requires_nothing
28    def _method_v2(self, *args, **kwargs):
29        #another method
30        pass

On line 1, please explain the conversion using the terms available in the Glossary (./doc/glossary.rst file). If not available, you may edit the glossary.rst file to add a quick description of the formats.

Warning

If the format is not already included in Bioconvert, you will need to update the file core/extensions.py to add the format name and its possible extensions.

On line 2, just import the common class.

On line 7, name the class after your input and output formats; again include the character 2 between the input and output formats. Usually, we use big caps for the formats since most format names are acronyms. If the input or output format exists already in Bioconvert, please follow the existing conventions.

On line 13, we add the constructor.

On line 21, we add a method to perform the conversion named _method_v1. Here, the prefix _method_ is compulsary: it tells Bioconvert that is it a possible conversion to include in the user interface. This is also where you will add your code to perform the conversion. The suffix name (here v1) is the name of the conversion. That way you can add as many conversion methods as you need (e.g. on line 28, we implemented another method called v2).

Line 20 and line 27 show the decorator that tells bioconvert which external tools are required. See Decorators section.

Since several methods can be implemented, we need to define a default method (line 11; here v1).

In order to simplify the creation of new converters, you can also use the standalone bioconvert_init. Example:

$ bioconvert_init -i fastq -o fasta > fastq2fasta.py

Of course, you will need to edit the file to add the conversion itself in the appropriate method (e.g. _method_v1).

If you need to include extra arguments, such as a reference file, you may add extra argument, although this is not yet part of the official Bioconvert API. See for instance SAM2CRAM converter.

One-to-many and many-to-one conversions

The one-to-many and many-to-one conversions are now implemented in Bioconvert. We have only 2 instances so far namely class:bioconvert.fastq2fasta_qual and class:bioconvert.fasta_qual2fastq. We have no instances of many-to-many so far. The underscore character purpose is to indicate a and connection. So you need QUAL and FASTA to create a FASTQ file.

For developers, we ask the input or output formats to be sorted alphabetically to ease the user experience.

How to add a new method to an existing converter

As shown above, use this code and add it to the relevant file in ./bioconvert directory:

def _method_UniqueName(self, *args, **kwargs):
    # from kwargs, you can use any kind of arguments.
    # threads is an example, reference, another example.
    # Your code here below
    pass

Then, it will be available in the class and bioconvert automatically; the bioconvert executable should show the name of your new method in the help message.

In order to add your new method, you can add:

  • Pure Python code

  • Python code that relies on third-party library. If so, you may use:

    • Python libraries available on pypi. Pleaes add the library name to the requirements.txt

    • if the Python library requires lots of compilation and is available on bioconda, you may add the library name to the requirements_tools.txt instead.

  • Third party tools available on bioconda (e.g., squizz, seqtk, etc) that you can add to the requirements_tools.txt

  • Perl and GO code are also accepted. If so, use the self.install_tool(NAME) and add a script in ./misc/install_NAME.sh

Decorators

Decorators have been defined in bioconvert/core/decorators.py that can be used to "flag" or "modify" conversion methods:

  • @in_gz can be used to indicate that the method is able to transparently handle input files that are compressed in .gz format. This is done by adding an in_gz attribute (set to True) to the method.

  • @compressor will wrap the method in code that handles input decompression from .gz format and output compression to .gz, .bz2 or .dsrc. This automatically applies @in_gz.

    Example:

@compressor
def _method_noncompressor(self, *args, **kwargs):
    """This method does not handle compressed input or output by itself."""
    pass
# The decorator transforms the method that now handles compressed
# input and output; the method has an in_gz attribute (which is set to True)
  • @out_compressor will wrap the method in code that handles output compression to .gz, .bz2 or .dsrc. It is intended to be used on methods that already handle compressed input transparently, and therefore do not need the input decompression provided by @compressor. Typically, one would also apply @in_gz to such methods. In that case, @in_gz should be applied "on top" of @out_compressor. The reason is that decorators closest to the function are applied first, and applying another decorator on top of @in_gz would typically not preserve the in_gz attribute. Example:

@in_gz
@out_compressor
def _method_incompressor(self, *args, **kwargs):
    """This method already handles compressed .gz input."""
    pass
# This results in a method that handles compressed input and output
# This method is further modified to have an in_gz attribute
# (which is set to True)

Another bioconvert decorator is called requires.

It should be used to annotate a method with the type of tools it needs to work.

It is important to decorate all methods with the requires decorator so that user interface can tell what tools are properly installed or not. You can use 4 arguments as explained in bioconvert.core.decorators:

 1@requires_nothing
 2def _method_python(self, *args, **kwargs):
 3    # a pure Python code does not require extra libraries
 4    with open(self.outfile, "w") as fasta, open(self.infile, "r") as fastq:
 5         for (name, seq, _) in FASTQ2FASTA.readfq(fastq):
 6             fasta.write(">{}\n{}\n".format(name, seq))
 7
 8 @requires(python_library="mappy")
 9 def _method_mappy(self, *args, **kwargs):
10     with open(self.outfile, "w") as fasta:
11         for (name, seq, _) in fastx_read(self.infile):
12             fasta.write(">{}\n{}\n".format(name, seq))
13
14 @requires("awk")
15 def _method_awk(self, *args, **kwargs):
16     # Note1: since we use .format, we need to escape the { and } characters
17     # Note2: the \n need to be escaped for Popen to work
18     awkcmd = """awk '{{printf(">%s\\n",substr($0,2));}}' """
19     cmd = "{} {} > {}".format(awkcmd, self.infile, self.outfile)
20     self.execute(cmd)

On line 1, we decorate the method with the requires_nothing() decorator because the method is implemented in Pure Python.

One line 8, we decorate the method with the requires() decorator to inform bioconvert that the method relies on the external Python library called mappy.

One line 14, we decorate the method with the requires() decorator to inform bioconvert that the method relies on an external tool called awk. In theory, you should write:

@requires(external_library="awk")

but external_library is the first optional argument so it can be omitted. If several libraries are required, you can use:

@requires(external_libraries=["awk", ""])

or:

@requires(python_libraries=["scipy", "pandas"])

Note

For more general explanations about decorators, see https://stackoverflow.com/a/1594484/1878788.

How to add a test

Following the example from above (fastq2fasta), we need to add a test file. To do so, go to the ./test directory and add a file named test_fastq2fasta.py.

 1import pytest
 2
 3from bioconvert.fastq2fasta import FASTQ2FASTA
 4from bioconvert import bioconvert_data
 5from easydev import TempFile, md5
 6
 7from . import test_dir
 8
 9@pytest.mark.parametrize("method", FASTQ2FASTA.available_methods)
10def test_fastq2fasta(method):
11    # your code here
12    # you will need data for instance "mydata.fastq and mydata.fasta".
13    # Put it in bioconvert/bioconvert/data
14    # you can then use ::
15    infile = f"{test_dir}/data/fastq/test_mydata.fastq"
16    expected_outfile = f"{test_dir}/data/fasta/test_mydata.fasta"
17    with TempFile(suffix=".fasta") as tempfile:
18        converter = FASTQ2FASTA(infile, tempfile.name)
19        converter(method=method)
20
21        # Check that the output is correct with a checksum
22        assert md5(tempfile.name) == md5(expected_outfile)

In Bioconvert, we use pytest as our test framework. In principle, we need one test function per method found in the converter. Here on line 7 we serialize the tests by looping through the methods available in the converter using the pytest.mark.parametrize function. That way, the test file remains short and do not need to be duplicated.

How to add a test file

Files used for testing should be added in ./bioconvert/test/data/ext/converter_name.ext.

How to locally run the tests

Go to the source directory of Bioconvert.

If not already done, install all packages required for testing:

cd bioconvert
pip3 install .[testing]

Then, run the tests using:

pytest test/ -v

Or, to run a specific test file, for example for your new converter fastq2fasta:

pytest test/test_fastq2fasta.py -v

or

pytest -v -k test_fastq2fasta

How to benchmark your new method vs others

from bioconvert import Benchmark
from bioconvert.fastq2fasta import FASTQ2FASTA
converter = FASTQ2FASTA(infile, outfile)
b = Benchmark(converter)
b.plot()

you can also use the bioconvert standalone with -b option.

How to add you new converter to the main documentation ?

Edit the doc/ref_converters.rst and add this code (replacing A2B by your conversion):

.. automodule:: bioconvert.A2B
    :members:
    :synopsis:
    :private-members:

and update the autosummary section:

.. autosummary::

    bioconvert.A2B

pep8 and conventions

In order to write your Python code, use PEP8 convention as much as possible. Follow the conventions used in the code. For instance,

class A():
    """Some documentation"""

    def __init__(self):
        """some doc"""
        pass

    def another_method(self):
        """some doc"""
        c = 1 + 2


class B():
    """Another class"""

    def __init__(self, *args, **kwargs):
        """some doc"""
        pass


 def AFunction(x):
    """some doc"""
    return x
  • 2 blank lines between classes and functions

  • 1 blank lines between methods

  • spaces around operators (e.g. =, +)

  • Try to have 80 characters max on one line

  • Add documentation in triple quotes

Since v0.5.2, we apply black on the different Python modules.

Requirements files

  • requirements.txt : should contain the packages to be retrieved from Pypi only. Those are downloaded and installed (if missing) when using python setup.py install

  • environment_rtd.yml : do not touch. Simple file for readthedocs

  • readthedocs.yml : all conda and pip dependencies to run the example and build the doc

  • environment.yml is a conda list of all dependencies

How to update bioconvert on bioconda

Fork bioconda-recipes github repository and clone locally. Follow instructions on https://bioconda.github.io/contributing.html

In a nutshell, install bioconda-utils:

git clone YOURFORKED_REPOSITORY
cd bioconda-recipes

edit bioconvert recipes and update its contents. If a new version pypi exists, you need to change the md5sum in recipes/bioconvert/meta.yaml.

check the recipes:

bioconda-utils build  recipes/ config.yml --packages bioconvert

Finally, commit and created a PR:

#git push -u origin my-recipe
git commit .
git push

Sphinx Documentation

In order to update the documentation, go the ./doc directory and update any of the .rst file. Then, for Linux users, just type:

make html

Regarding the Formats page, we provide simple ontology with 3 entries: Type, Format and Status. Please choose one of the following values:

  • Type: sequence, assembly, alignement, other, index, variant, database, compression

  • Format: binary, human-readable

  • Status: deprecated, included, not included

Docker

In order to create the docker file, use this command:

docker build .

The Dockerfile found next to setup.py is self-content and has been tested for v0.5.2 ; it uses the spec-file.txt that was generated in a conda environment using:

conda list --explicit

Benchmarking

Introduction

Converters (e.g. FASTQ2FASTA) may have several methods implemented. A developer may also want to compare his/her methods with those available in Bioconvert.

In order to help developers comparing their methods, we provide a benchmark framework.

Of course, the first thing to do is to add your new method inside the converter (see Developer guide) and use the method boxplot_benchmark().

Then, you have two options. Either use the bioconvert command or use the bioconvert Python library. In both case you will first need a local data set as input file. We do not provide such files inside Bioconvert. We have a tool to generate random FastQ file inside the fastq() for the example below but this is not generalised for all input formats.

So, you could use the following code to run the benchmark fro Python:

# Generate the dummy data, saving the results in a temporary file
from easydev import TempFile
from bioconvert.simulator.fastq import FastqSim

infile = TempFile(suffix=".fastq")
outfile = TempFile(suffix=".fasta")
fs = FastqSim(infile.name)
fs.nreads = 1000 # 1,000,000 by default
fs.simulate()

# Perform the benchmarking
from bioconvert.fastq2fasta import FASTQ2FASTA
c = FASTQ2FASTA(infile.name, outfile.name)
c.compute_benchmark(N=10)

# you may study the memory or CPU usage using mode="CPU" or mode="memory"
c.boxplot_benchmark(mode="time")

infile.delete()
outfile.delete()

(Source code, png, hires.png, pdf)

_images/benchmarking-1.png

Here, the boxplot_benchmark methods is called 10 times for each available method.

Be aware that the pure Python methods may be faster for small data and slower for large data. Indeed, each method has an intrinsec delay to start the processing. Therefore, benchmarking needs large files to be meaningful !

If we use 1,000,000 reads instead of just 1,000, we would get different results (which may change depending on your system and IO performance):

_images/benchmark_1000000.png

Here, what you see more robust and reproducible results.

Multiple benchmarking for more robustness

With the previous method, even though you can decrease the error bars using more trials per method, we still suffer from local computation or IO access that may bias the results. We provide a Snakefile here: Snakefile_benchmark that allows to run the previous benchmarking several times. So at the end you have a benchmark ... of benchmarks somehow. We found it far more robust. Here is an example for the fastq2fasta case where each method was run 3 times and in each case, 10 instances of conversion were performed. The orange vertical lines give the median and a final statement indicates whther the final best method is significantly better than the others.

_images/multi_benchmark.png

Note

The computation can be long and the Snakefile allows to parallelised the computation.

Zenodo

The benchmarking requires input files, which can be large. Those files are stored on Zenodo: https://zenodo.org/communities/bioconvert/

References

Core functions

bioconvert.core.base

Main factory of Bioconvert

bioconvert.core.benchmark

Tools for benchmarking

bioconvert.core.converter

Standalone application dedicated to conversion

bioconvert.core.decorators

Provides a general tool to perform pre/post compression

bioconvert.core.downloader

Download singularity image

bioconvert.core.extensions

List of formats and associated extensions

bioconvert.core.graph

Network tools to manipulate the graph of conversion

bioconvert.core.registry

Main bioconvert registry that fetches automatically the relevant converter

bioconvert.core.shell

Simplified version of shell.py module from snakemake package

bioconvert.core.utils

misc utility functions

Base

Main factory of Bioconvert

class ConvArg(names, help, **kwargs)[source]

This class can be used to add specific extra arguments to any converter

For instance, imagine a conversion named A2B that requires the user to provide a reference. Then, you may want to provide the --reference extra argument. This is possible by adding a class method named get_additional_arguments that will yield instance of this class for each extra argument.

@classmethod
def get_additional_arguments(cls):
    yield ConvArg(
        names="--reference",
        default=None,
        help="the referenc"
    )

Then, when calling bioconvert as follows,:

bioconvert A2B --help

the new argument will be shown in the list of arguments.

class ConvBase(infile, outfile)[source]

Base class for all converters.

To build a new converter, create a new class which inherits from ConvBase and implement method that performs the conversion. The name of the converter method must start with _method_.

For instance:

class FASTQ2FASTA(ConvBase):

    def _method_python(self, *args, **kwargs):
        # include your code here. You can use the infile and outfile
        # attributes.
        self.infile
        self.outfile

constructor

Parameters:
  • infile (str) -- the path of the input file.

  • outfile (str) -- the path of The output file

boxplot_benchmark(rot_xticks=90, boxplot_args={}, mode='time')[source]

This function plots the benchmark computed in compute_benchmark()

compute_benchmark(N=5, to_exclude=[], to_include=[])[source]

Simple wrapper to call Benchmark

This function computes the benchmark

see Benchmark for details.

install_tool(executable)[source]

Install the given tool, using the script: bioconvert/install_script/install_executable.sh if the executable is not already present

Parameters:

executable -- executable to install

Returns:

nothing

property name

The name of the class

class ConvMeta(name, bases, namespace, **kwargs)[source]

This metaclass checks that the converter classes have

  • an attribute input_ext

  • an attribute output_ext

This is a meta class used by ConvBase class. For developers only.

make_chain(converter_map)[source]

Create a class performing step-by-step conversions following a path. converter_map is a list of pairs ((in_fmt, out_fmt), converter). It describes the conversion path.

Benchmark

Tools for benchmarking

class Benchmark(obj, N=5, to_exclude=None, to_include=None)[source]

Convenient class to benchmark several methods for a given converter

c = BAM2COV(infile, outfile)
b = Benchmark(c, N=5)
b.run_methods()
b.plot()

Constructor

Parameters:
  • obj -- can be an instance of a converter class or a class name

  • N (int) -- number of replicates

  • to_exclude (list) -- methods to exclude from the benchmark

  • to_include (list) -- methods to include ONLY

Use one of to_exclude or to_include. If both are provided, only the to_include one is used.

plot(rerun=False, ylabel=None, rot_xticks=0, boxplot_args={}, mode='time')[source]

Plots the benchmark results, running the benchmarks if needed or if rerun is True.

Parameters:
  • rot_xlabel -- rotation of the xticks function

  • boxplot_args -- dictionary with any of the pylab.boxplot arguments

  • mode -- either time, CPU or memory

Returns:

dataframe with all results

run_methods()[source]

Runs the benchmarks, and stores the timings in self.results.

plot_multi_benchmark_max(path_json, output_filename='multi_benchmark.png', min_ylim=0, mode=None)[source]

Plotting function for the Snakefile_benchmark to be found in the doc

The json file looks like:

{
  "awk":{
    "0":0.777020216,
    "1":0.9638044834,
    "2":1.7623617649,
    "3":0.8348755836
  },
  "seqtk":{
    "0":1.0024843216,
    "1":0.6313509941,
    "2":1.4048073292,
    "3":1.0554351807
  },
  "Benchmark":{
    "0":1,
    "1":1,
    "2":2,
    "3":2
  }
}

Number of benchmark is infered from field 'Benchmark'.

Converter

Standalone application dedicated to conversion

class Bioconvert(infile, outfile, force=False, threads=None, extra=None)[source]

Universal converter used by the standalone

from bioconvert import Bioconvert
c = Bioconvert("test.fastq", "test.fasta", threads=4, force=True)

constructor

Parameters:
  • infile (str) -- The path of the input file.

  • outfile (str) -- The path of The output file

  • force (bool) -- overwrite output file if it exists already otherwise raises an error

Decorators

Provides a general tool to perform pre/post compression

compressor(func)[source]

Decompress/compress input file without pipes

Does not use pipe: we decompress and compress back the input file. The advantage is that it should work for any files (even very large).

This decorator should be used by method that uses pure python code

in_gz(func)[source]

Marks a function as accepting gzipped input.

make_in_gz_tester(converter)[source]

Generates a function testing whether a conversion method of converter has the in_gz tag.

out_compressor(func)[source]

Compress output file without pipes

This decorator should be used by method that uses pure python code

requires(external_binary=None, python_library=None, external_binaries=None, python_libraries=None)[source]
Parameters:
  • external_binary -- a system binary required for the method

  • python_library -- a python library required for the method

  • external_binaries -- an array of system binaries required for the method

  • python_libraries -- an array of python libraries required for the method

Returns:

requires_nothing(func)[source]

Marks a function as not needing dependencies.

Downloader

Download singularity image

Extensions

List of formats and associated extensions

class AttrDict(**kwargs)[source]

Copy from easydev package.

update(content)[source]

See class/constructor documentation for details

Parameters:

content (dict) -- a valid dictionary

extensions = {'abi': ['abi', 'ab1'], 'agp': ['agp'], 'bam': ['bam'], 'bcf': ['bcf'], 'bed': ['bed'], 'bedgraph': ['bedgraph', 'bg'], 'bigbed': ['bb', 'bigbed'], 'bigwig': ['bigwig', 'bw'], 'bplink': ['bplink'], 'bz2': ['bz2'], 'cdao': ['cdao'], 'clustal': ['clustal', 'aln', 'clw'], 'cov': ['cov'], 'cram': ['cram'], 'csv': ['csv'], 'dsrc': ['dsrc'], 'embl': ['embl'], 'ena': ['ena'], 'faa': ['faa', 'mpfa', 'aa'], 'fast5': ['fast5'], 'fasta': ['fasta', 'fa', 'fst'], 'fastq': ['fastq', 'fq'], 'genbank': ['genbank', 'gbk', 'gb'], 'gfa': ['gfa'], 'gff2': ['gff'], 'gff3': ['gff3'], 'gtf': ['gtf'], 'gz': ['gz'], 'json': ['json'], 'maf': ['maf'], 'newick': ['newick', 'nw', 'nhx', 'nwk'], 'nexus': ['nexus', 'nx', 'nex', 'nxs'], 'ods': ['ods'], 'paf': ['paf'], 'pdb': ['pdb'], 'phylip': ['phy', 'ph', 'phylip'], 'phyloxml': ['phyloxml', 'xml'], 'plink': ['plink'], 'pod5': ['pod5'], 'qual': ['qual'], 'sam': ['sam'], 'scf': ['scf'], 'sra': ['sra'], 'stockholm': ['sto', 'sth', 'stk', 'stockholm'], 'tsv': ['tsv'], 'twobit': ['2bit'], 'vcf': ['vcf'], 'wig': ['wig'], 'wiggle': ['wig', 'wiggle'], 'xls': ['xls'], 'xlsx': ['xlsx'], 'xmfa': ['xmfa'], 'yaml': ['yaml', 'YAML']}

List of formats and their extensions included in Bioconvert

Graph

Network tools to manipulate the graph of conversion

create_graph(filename, layout='dot', use_singularity=False, color_for_disabled_converter='red', include_subgraph=False)[source]
Parameters:

filename -- should end in .png or .svg or .dot

If extension is .dot, only the dot file is created without annotations. This is useful if you have issues installing graphviz. If so, under Linux you could use our singularity container see github.com/cokelaer/graphviz4all

create_graph_for_cytoscape(all_converter=False)[source]
Parameters:

all_converter -- use all converters or only the ones available in the current installation

Returns:

Registry

Main bioconvert registry that fetches automatically the relevant converter

class Registry[source]

class to centralise information about available conversions

from bioconvert.core.registry import Registry
r = Registry()
r.conversion_exists("BAM", "BED")
r.info()  # returns number of available methods for each converter

conv_class = r[(".bam", ".bed")]
converter = conv_class(input_file, output_file)
converter.convert()
conversion_exists(input_fmt, output_fmt, allow_indirect=False)[source]
Parameters:
  • input_fmt (str) -- the input format

  • output_fmt (str) -- the output format

  • allow_indirect (boolean) -- whether to count indirect conversions

Returns:

True if a converter which transform input_fmt into output_fmt exists

Return type:

boolean

conversion_path(input_fmt, output_fmt)[source]

Return a list of conversion steps to get from input and output formats

Parameters:

Each step in the list is a pair of formats.

get_all_conversions()[source]
Returns:

a generator which allow to iterate on all available conversions and their availability; a conversion is encoded by a tuple of 2 strings (input format, output format)

Retype:

generator (input format, output format, status)

get_conversions()[source]
Returns:

a generator which allow to iterate on all available conversions a conversion is encoded by a tuple of 2 strings (input format, output format)

Retype:

generator

get_conversions_from_ext()[source]
Returns:

a generator which allow to iterate on all available conversions a conversion is encoded by a tuple of 2 strings (input extension, output extension)

Return type:

generator

get_converters_names()[source]
Returns:

a generator that allows to get the name of the converter from the subclass (ConvBase object)

Return type:

generator

get_ext(ext_pair)[source]

Copy the registry into a dict that behaves like a list to be able to have multiple values for a single key and from a key have all converter able to do the conversion from the input extension to the output extension.

Parameters:

ext_pair (tuple of 2 strings) -- the input extension, the output extension

Returns:

list of objects of subclass o ConvBase

iter_converters(allow_indirect: bool = False)[source]
Parameters:

allow_indirect (bool) -- also return indirect conversion

Returns:

a generator to iterate over (in_fmt, out_fmt, converter class when direct, path when indirect)

Return type:

a generator

set_ext(ext_pair, convertor)[source]

Register new convertor from input extension and output extension in a list. We can have a list of multiple convertors for one ext_pair.

Parameters:
  • ext_pair (tuple) -- tuple containing the input extensions and the output extensions e.g. ( ("fastq",) , ("fasta") )

  • convertor (list of ConvBase object) -- the convertor which handle the conversion from input_ext -> output_ext

Utils

misc utility functions

class TempFile(suffix='', dir=None)[source]

A small wrapper around tempfile.NamedTemporaryFile function

f = TempFile(suffix="csv")
f.name
f.delete() # alias to delete=False and close() calls

Copy from easydev package

class Timer(times)[source]

Timer working with with statement

Copy from easydev package.

generate_outfile_name(infile, out_extension)[source]

simple utility to replace the file extension with the given one.

Parameters:
  • infile (str) -- the path to the Input file

  • out_extension (str) -- Desired extension

Returns:

The file path with the given extension

Return type:

str

get_extension(filename, remove_compression=False)[source]

Return extension of a filename

>>> get_extension("test.fastq")
fastq
>>> get_extension("test.fastq.gz")
fastq
get_format_from_extension(extension)[source]

get format from extension.

Parameters:

extension -- the extension

Returns:

the corresponding format

Return type:

str

md5(fname, chunk=65536)[source]

Return the MD5 checksums of a file

Reference converters

Summary

bioconvert.abi2fasta

Convert ABI format to FASTA format

bioconvert.abi2fastq

Convert ABI format to FASTQ format

bioconvert.abi2qual

Convert ABI format to QUAL format

bioconvert.bam2bedgraph

Convert BAM format to BEDGRAPH format

bioconvert.bam2cov

Convert BAM format to COV format

bioconvert.bam2bigwig

Convert BAM file to BIGWIG format

bioconvert.bam2cram

Convert BAM file to CRAM format

bioconvert.bam2fasta

Convert BAM format to FASTA format

bioconvert.bam2fastq

Convert BAM format to FASTQ foarmat

bioconvert.bam2json

Convert BAM format to JSON format

bioconvert.bam2sam

Convert SAM file to BAM format

bioconvert.bam2tsv

Convert BAM file to TSV format

bioconvert.bam2wiggle

Convert BAM to WIGGLE format

bioconvert.bcf2vcf

Convert BCF file to VCF format

bioconvert.bcf2wiggle

Convert BCF format to WIGGLE format

bioconvert.bed2wiggle

Convert BED format to WIGGLE format

bioconvert.bedgraph2cov

Convert BEDGRAPH file to COV format

bioconvert.bedgraph2bigwig

Convert BEDGRAPH to BIGWIG format

bioconvert.bedgraph2wiggle

Convert BEDGRAPH format to WIGGLE format

bioconvert.bigbed2wiggle

Convert BIGBED format to WIGGLE format

bioconvert.bigbed2bed

Convert BIGBED format to BED format

bioconvert.bigwig2bedgraph

Convert BIGWIG to BEDGRAPH format

bioconvert.bigwig2wiggle

Convert BIGWIG format to WIGGLE format

bioconvert.bplink2plink

Convert BPLINK to PLINK format

bioconvert.bz22gz

Convert BZ2 to GZ format

bioconvert.clustal2fasta

Convert CLUSTAL to FASTA format

bioconvert.clustal2phylip

Convert CLUSTAL to PHYLIP format

bioconvert.clustal2stockholm

Convert CLUSTAL to STOCKHOLM format

bioconvert.cram2bam

Convert CRAM file to BAM format

bioconvert.cram2fasta

Convert CRAM file to FASTQ format

bioconvert.cram2fastq

Convert CRAM file to FASTQ format

bioconvert.cram2sam

Convert CRAM file to SAM format

bioconvert.csv2tsv

Convert CSV format to TSV format

bioconvert.csv2xls

convert CSV to XLS format

bioconvert.dsrc2gz

Convert a compressed FASTQ from DSRC to FASTQ format

bioconvert.embl2fasta

Convert EMBL file to FASTA format

bioconvert.embl2genbank

Convert EMBL file to GENBANK format

bioconvert.fasta_qual2fastq

Convert FASTA format to FASTQ format

bioconvert.fasta2clustal

Convert FASTA to CLUSTAL format

bioconvert.fasta2faa

Convert FASTA format to FAA format

bioconvert.fasta2fasta_agp

Convert FASTA (scaffold) to FASTA (contig) and AGP formats

bioconvert.fasta2fastq

Convert FASTA format to FASTQ format

bioconvert.fasta2genbank

Convert FASTA to GENBANK format

bioconvert.fasta2nexus

Convert FASTA to NEXUS format

bioconvert.fasta2phylip

Convert FASTA to PHYLIP format

bioconvert.fasta2twobit

Convert FASTA to TWOBIT format

bioconvert.fastq2fasta

Convert FASTQ to FASTA format

bioconvert.genbank2embl

Convert GENBANK to EMBL format

bioconvert.genbank2fasta

Convert GENBANK to EMBL format

bioconvert.genbank2gff3

Convert GENBANK to GFF3 format

bioconvert.gff22gff3

Convert GFF2 to GFF3 format

bioconvert.gff32gff2

Convert GFF3 to GFF2 format

bioconvert.gfa2fasta

Convert GFA to FASTA format

bioconvert.gz2bz2

Convert GZ file to BZ2 format

bioconvert.gz2dsrc

Convert GZ to DSRC format

bioconvert.json2yaml

Convert JSON to YAML format

bioconvert.maf2sam

Convert MAF file to SAM format

bioconvert.newick2nexus

Converts NEWICK file to NEXUS format.

bioconvert.newick2phyloxml

Converts NEWICK file to PHYLOXML format.

bioconvert.nexus2fasta

Convert NEXUS to FASTA format

bioconvert.nexus2newick

Converts NEXUS file to NEWICK format.

bioconvert.nexus2phylip

Converts NEXUS file to PHYLIP format.

bioconvert.nexus2phyloxml

Converts NEXUS file to PHYLOXML format.

bioconvert.ods2csv

Convert XLS format to CSV format

bioconvert.pdb2faa

Convert PDB to FAA format

bioconvert.phylip2clustal

Converts PHYLIP file to CLUSTAL format.

bioconvert.phylip2fasta

Converts PHYLIP file to FASTA format.

bioconvert.phylip2nexus

Converts PHYLIP file to NEXUS format.

bioconvert.phylip2stockholm

Converts PHYLIP file to STOCKHOLM format.

bioconvert.phylip2xmfa

Converts PHYLIP file to XMFA format.

bioconvert.phyloxml2newick

Converts PHYLOXML file to NEWICK format.

bioconvert.phyloxml2nexus

Converts PHYLOXML file to NEXUS format.

bioconvert.plink2bplink

Convert PLINK to BPLINK

bioconvert.sam2bam

Convert SAM file to BAM format

bioconvert.sam2cram

Convert SAM file to CRAM format

bioconvert.sam2paf

"Convert CRAM to BAM format

bioconvert.scf2fasta

Convert SCF file to FASTA file

bioconvert.scf2fastq

Convert SCF file to FASTQ file

bioconvert.sra2fastq

Convert SRA format to FASTA format

bioconvert.stockholm2clustal

Converts STOCKHOLM file to CLUSTAL file.

bioconvert.stockholm2phylip

Converts STOCKHOLM to PHYLIP format.

bioconvert.tsv2csv

Convert TSV format to CSV format

bioconvert.twobit2fasta

Conversion from TWOBIT to FASTA format

bioconvert.vcf2bcf

Convert VCF to BCF format

bioconvert.vcf2bed

Convert VCF to BED3 file

bioconvert.vcf2wiggle

Convert VCF format to WIGGLE format

bioconvert.xls2csv

Convert XLS format to CSV format

bioconvert.xlsx2csv

Convert XLS format to CSV format

bioconvert.xmfa2phylip

Convert XMFA to PHYLIP format

bioconvert.yaml2json

Convert YAML to JSON format

All converters documentation

Convert ABI format to FASTA format

class ABI2FASTA(infile, outfile, *args, **kargs)[source]

Convert ABI file to FASTQ file

ABI files are created by ABI sequencing machine and includes PHRED quality scores for base calls. This allows the creation of FastA files.

Method implemented is based on BioPython [BIOPYTHON].

constructor

Parameters:
  • infile (str) -- input ABI file

  • outfile (str) -- output FASTA filename

_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.SeqIO.

Reference:

Bio.SeqIO Documentation

Convert ABI format to FASTQ format

class ABI2FASTQ(infile, outfile, *args, **kargs)[source]

Convert ABI file to FASTQ file

ABI files are created by ABI sequencing machine and includes PHRED quality scores for base calls. This allows the creation of FastQ files.

Method implemented is based on BioPython [BIOPYTHON].

constructor

Parameters:
  • infile (str) -- input ABI file

  • outfile (str) -- output FASTQ filename

_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.SeqIO.

Bio.SeqIO Documentation

Convert ABI format to QUAL format

class ABI2QUAL(infile, outfile, *args, **kargs)[source]

Convert ABI file to QUAL file

ABI files are created by ABI sequencing machine and includes PHRED quality scores for base calls. This allows the creation of QUAL files.

Method implemented is based on BioPython [BIOPYTHON].

constructor

Parameters:
  • infile (str) -- input ABI file

  • outfile (str) -- output QUAL filename

_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.SeqIO.

Bio.SeqIO Documentation

Convert BAM format to BEDGRAPH format

class BAM2BEDGRAPH(infile, outfile)[source]

Convert sorted BAM file into BEDGRAPH file

Compute the coverage (depth) in BEDGRAPH. Regions with zero coverage are also reported.

Note that this BEDGRAPH format is of the form:

chrom chromStart chromEnd dataValue

Note that consecutive positions with same values are compressed.

chr1    0   75  0
chr1    75  176 1
chr1    176  177 2

Warning

the BAM file must be sorted. This can be achieved with bamtools.

Methods available are based on bedtools [BEDTOOLS] and mosdepth [MOSDEPTH].

Constructor

Parameters:
  • infile (str) -- The path to the input BAM file. It must be sorted.

  • outfile (str) -- The path to the output file

_default_method = 'bedtools'

Default value

_method_bedtools(*args, **kwargs)[source]

Do the conversion using bedtools.

bedtools documentation

_method_mosdepth(*args, **kwargs)[source]

Do the conversion using mosdepth.

mosdepth documentation

Convert BAM format to COV format

class BAM2COV(infile, outfile)[source]

Convert sorted BAM file into COV file

Note that the COV format is of the form:

chr1    1   0
chr1    2   0
chr1    3   0
chr1    4   0
chr1    5   0

that is contig name, position, coverage.

Warning

the BAM file must be sorted. This can be achieved with bamtools using bamtools sort -in INPUT.bam

Methods available are based on samtools [SAMTOOLS] or bedtools [BEDTOOLS].

Constructor

Parameters:
  • infile (str) -- The path to the input BAM file. It must be sorted.

  • outfile (str) -- The path to the output file

_method_bedtools(*args, **kwargs)[source]

Do the conversion sorted BAM -> BED using bedtools

bedtools documentation

_method_samtools(*args, **kwargs)[source]

Do the conversion sorted BAM -> BED using samtools

SAMtools documentation

Convert BAM file to BIGWIG format

class BAM2BIGWIG(infile, outfile, *args, **kargs)[source]

Convert BAM file to BIGWIG file

Convert BAM into a binary version of WIG format.

Methods are base on bamCoverage [DEEPTOOLS] and bedGraphToBigWig from wiggletools [WIGGLETOOLS]. Wiggletools method requires an extra argument (--chrom-sizes) therefore default one is bamCoverage for now.

Moreover, the two methods do not return exactly the same info!

You can check this by using bioconvert to convert into a human readable file such as wiggle. We will use the bamCoverage as our default conversion.

constructor

Parameters:
  • infile (str) -- input BAM file

  • outfile (str) -- output BIGWIG filename

_default_method = 'bamCoverage'

Default value

_method_bamCoverage(*args, **kwargs)[source]

run bamCoverage package.

bamCoverage documentation

_method_ucsc(*args, **kwargs)[source]

Run ucsc tool bedGraphToBigWig.

Requires extra argument (chrom_sizes) required by the bioconvert stanalone.

Convert BAM file to CRAM format

class BAM2CRAM(infile, outfile, *args, **kargs)[source]

Convert BAM file to CRAM file

The conversion requires the reference corresponding to the input file It can be provided as an argument with the standalone (--reference). Otherwise, users are asked to provide it.

Methods available are based on samtools [SAMTOOLS].

constructor

Parameters:
  • infile (str) -- input BAM file

  • outfile (str) -- output CRAM filename

_default_method = 'samtools'

Default value

_method_samtools(*args, **kwargs)[source]

Here we use the SAMtools tool.

SAMtools documentation

Convert BAM format to FASTA format

class BAM2FASTA(infile, outfile)[source]

Convert sorted BAM file into FASTA file

Methods available are based on samtools [SAMTOOLS] or bedtools [BEDTOOLS].

Warning

Using the bedtools method, the R1 and R2 reads must be next to each other so that the reads are sorted similarly

Warning

there is no guarantee that the R1/R2 output file are sorted similarly in paired-end case due to supp and second reads

constructor

Parameters:
  • infile (str) -- BAM file

  • outfile (str) -- FASTA file

_default_method = 'samtools'

Default value

_method_samtools(*args, **kwargs)[source]

do the conversion BAM -> FASTA using samtools.

Note

fasta are on one line

SAMtools documentation

Convert BAM format to FASTQ foarmat

class BAM2FASTQ(infile, outfile)[source]

Convert sorted BAM file into FASTQ file

Methods available are based on samtools [SAMTOOLS] or bedtools [BEDTOOLS].

Warning

Using the bedtools method, the R1 and R2 reads must be next to each other so that the reads are sorted similarly

Warning

there is no guarantee that the R1/R2 output file are sorted similarly in paired-end case due to supp and second reads

constructor

Parameters:
  • infile (str) --

  • outfile (str) --

_default_method = 'samtools'

Default value

_method_bedtools(*args, **kwargs)[source]

Do the conversion BAM -> Fastq using bedtools

bedtools documentation

_method_samtools(*args, **kwargs)[source]

Do the conversion BAM -> FASTQ using samtools

SAMtools documentation

Convert BAM format to JSON format

class BAM2JSON(infile, outfile)[source]

Convert BAM format to JSON file

Methods available are based on bamtools [BAMTOOLS].

constructor

Parameters:
  • infile (str) --

  • outfile (str) --

_default_method = 'bamtools'

Default value

_method_bamtools(*args, **kwargs)[source]

Do the conversion BAM -> JSON using bamtools.

BAMTools documentation

Convert SAM file to BAM format

class BAM2SAM(infile, outfile, *args, **kargs)[source]

Convert BAM file to SAM file

Methods available are based on samtools [SAMTOOLS] , sam-to-bam [SAMTOBAM] , sambamba [SAMBAMBA] and pysam [PYSAM].

constructor

Parameters:
  • infile (str) --

  • outfile (str) --

_default_method = 'sambamba'

default value

_method_pysam(*args, **kwargs)[source]

We use here the python module Pysam.

Pysam documentation

_method_sambamba(*args, **kwargs)[source]

Here we use the Sambamba tool. This is the default method because it is the fastest.

Sambamba documentation

_method_samtools(*args, **kwargs)[source]

Here we use the SAMtools tool.

SAMtools documentation

Convert BAM file to TSV format

class BAM2TSV(infile, outfile, *args, **kargs)[source]

Convert sorted BAM file into TSV stats

This is not a conversion per se but the extraction of BAM statistics saved into a TSV format. The 4 columns of the TSV file are:

Reference sequence name, Sequence length,Mapped reads, Unmapped reads

Methods are based on samtools [SAMTOOLS] and pysam [PYSAM].

constructor

Parameters:
  • infile (str) -- BAM file

  • outfile (str) -- TSV file

Methods are based on samtools [SAMTOOLS] and pysam [PYSAM].

_default_method = 'samtools'

Default value

_method_pysam(*args, **kwargs)[source]

We use here the python module Pysam.

Pysam documentation

_method_samtools(*args, **kwargs)[source]

Here we use the SAMtools tool.

SAMtools documentation

Convert BAM to WIGGLE format

class BAM2WIGGLE(infile, outfile)[source]

Convert sorted BAM file into WIGGLE file

Methods available are based on wiggletools [WIGGLETOOLS].

Parameters:
  • infile (str) -- The path to the input BAM file. It must be sorted.

  • outfile (str) -- The path to the output file

_method_wiggletools(*args, **kwargs)[source]

Conversion using wiggletools

wiggletools documentation

Convert BCF file to VCF format

class BCF2VCF(infile, outfile, *args, **kargs)[source]

Convert BCF file to VCF file

Methods available are based on bcftools [BCFTOOLS].

constructor

Parameters:
  • infile (str) -- input BCF file

  • outfile (str) -- output VCF file

_method_bcftools(*args, **kwargs)[source]

Here we use the bcftools tool from samtools.

bcftools documentation

Convert BCF format to WIGGLE format

class BCF2WIGGLE(infile, outfile)[source]

Convert sorted BCF file into WIGGLE file

Methods available are based on wiggletools [WIGGLETOOLS].

Parameters:
  • infile (str) -- The path to the input BCF file. It must be sorted.

  • outfile (str) -- The path to the output file

_default_method = 'wiggletools'

Default value

_method_wiggletools(*args, **kwargs)[source]

Conversion using wiggletools

wiggletools documentation

Convert BED format to WIGGLE format

class BED2WIGGLE(infile, outfile)[source]

Convert sorted BED file into WIGGLE file

Methods available are based on wiggletools [WIGGLETOOLS].

Parameters:
  • infile (str) -- The path to the input BED file. It must be sorted.

  • outfile (str) -- The path to the output file

_method_wiggletools(*args, **kwargs)[source]

Convert BED to WIGGLE using wiggletools

wiggletools documentation

Convert BEDGRAPH file to COV format

class BEDGRAPH2COV(infile, outfile)[source]

Converts a BEDGRAPH (4 cols) to COV format (3 cols)

Input example:

chr19   49302000    4930205    -1
chr19   49302005    4930210    1

becomes:

chr19   4930201    -1
chr19   4930202    -1
chr19   4930203    -1
chr19   4930204    -1
chr19   4930205    -1
chr19   4930206    1
chr19   4930207    1
chr19   4930208    1
chr19   4930209    1
chr19   4930210    1

Method available is a Bioconvert implementation (Python).

constructor

Parameters:
_default_method = 'python'

Default value

_method_python(*args, **kwargs)[source]

Convert bedgraph file in coverage. Internal method.

Convert BEDGRAPH to BIGWIG format

class BEDGRAPH2BIGWIG(infile, outfile)[source]

Converts BEDGRAPH format to BIGWIG format

Conversion is based on bedGraph2BigWig tool. Note that an argument --chrom-sizes is required.

constructor

Parameters:
_default_method = 'ucsc'

Default value

_method_ucsc(*args, **kwargs)[source]

Convert bedgraph file in bigwig format using ucsc tool.

bigWig documentation

chromosome size

Convert BEDGRAPH format to WIGGLE format

class BEDGRAPH2WIGGLE(infile, outfile)[source]

Convert sorted BEDGRAPH file into WIGGLE file

Methods available are based on wiggletools [WIGGLETOOLS].

Parameters:
  • infile (str) -- The path to the input BEDGRAPH file. It must be sorted.

  • outfile (str) -- The path to the output file

_default_method = 'wiggletools'

Default value

_method_wiggletools(*args, **kwargs)[source]

wiggletools based method. Extension must be .bg.

wiggletools documentation

Convert BIGBED format to WIGGLE format

class BIGBED2WIGGLE(infile, outfile)[source]

Convert sorted BIGBED file into WIGGLE file

Methods available are based on wiggletools [WIGGLETOOLS].

Parameters:
  • infile (str) -- The path to the input BIGBED file. It must be sorted.

  • outfile (str) -- The path to the output file

_default_method = 'wiggletools'

Default value

_method_wiggletools(*args, **kwargs)[source]

Conversion using wiggletools

wiggletools documentation

Convert BIGBED format to BED format

class BIGBED2BED(infile, outfile)[source]

Converts a sequence alignment in BIGBED format to BED4 format

Methods available are based on pybigwig [DEEPTOOLS].

constructor

Parameters:
  • infile (str) -- input BIGBED file.

  • outfile (str) -- (optional) output BED4 file

_default_method = 'pybigwig'

Default value

_method_pybigwig(*args, **kwargs)[source]

In this method we use the python extension written in C, pyBigWig.

pyBigWig documentation

Convert BIGWIG to BEDGRAPH format

class BIGWIG2BEDGRAPH(infile, outfile)[source]

Converts a sequence alignment in BIGWIG format to BEDGRAPH format

Conversion is based on ucsc bigWigToBedGraph tool or pybigwig (default) [DEEPTOOLS].

constructor

Parameters:
_default_method = 'pybigwig'

Default value

_method_pybigwig(*args, **kwargs)[source]

In this method we use the python extension written in C, pyBigWig.

pyBigWig documentation

_method_ucsc(*args, **kwargs)[source]

Convert bigwig file in bedgraph format using ucsc tool.

ucsc.bedgraph documentation

Convert BIGWIG format to WIGGLE format

class BIGWIG2WIGGLE(infile, outfile)[source]

Convert sorted BIGWIG file into WIGGLE file

Methods available are based on pybigwig [DEEPTOOLS].

Parameters:
  • infile (str) -- The path to the input BIGWIG file. It must be sorted.

  • outfile (str) -- The path to the output file

_default_method = 'wiggletools'

Default value

_method_wiggletools(*args, **kwargs)[source]

Conversion using wiggletools

wiggletools documentation

Convert BPLINK to PLINK format

Converts a genotype dataset bed+bim+fam in BPLINK format to ped+map PLINK format.

Conversion is based on plink [PLINK] executable.

Warning

plink takes several inputs and outputs and does not need extensions. What is required is a prefix. Bioconvert usage is therefore:

bioconvert bplink2plink plink_toy

Since there is no extension, you must be explicit by providing the conversion name (bplink2plink). This command will search for 3 input files plink_toy.bed, plink_toy.bim and plink_toy.fam. It will then create two output files named plink_toy.ped and plink_toy.map

constructor

Parameters:
  • infile (str) -- input BPLINK files.

  • outfile (str) -- (optional) output PLINK files.

_default_method = 'plink'

Default value

Convert plink file in text using plink executable.

plink documentation

Convert BZ2 to GZ format

class BZ22GZ(infile, outfile, *args, **kargs)[source]

Convert BZ2 file to GZ file

Methods based on bunzip2 or zlib/bz2 Python libraries.

constructor

Parameters:
  • infile (str) -- input BZ2 file

  • outfile (str) -- output GZ filename

_default_method = 'bz2_gz'

Default value

_method_bz2_gz(*args, **kwargs)[source]

Method that uses bunzip2 gzip.

bunzip2 documentation gzip documentation

_method_python(*args, **kargs)[source]

Internal method

Convert CLUSTAL to FASTA format

class CLUSTAL2FASTA(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment from CLUSTAL to FASTA format.

Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON], and goalign [GOALIGN].

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

Convert CLUSTAL interleaved file in PHYLIP format.

Bio.SeqIO Documentation

_method_goalign(*args, **kwargs)[source]

Convert CLUSTAL file in FASTA format using goalign.

goalign documentation

_method_squizz(*args, **kwargs)[source]

Convert CLUSTAL file in FASTA format.

Convert CLUSTAL to PHYLIP format

class CLUSTAL2PHYLIP(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment from CLUSTAL format to PHYLIP format.

Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON], and goalign [GOALIGN].

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

Convert CLUSTAL interleaved file in PHYLIP format using biopython.

Bio.SeqIO Documentation

_method_squizz(*args, **kwargs)[source]

Convert CLUSTAL interleaved file in PHYLIP format using squizz tool.

Convert CLUSTAL to STOCKHOLM format

class CLUSTAL2STOCKHOLM(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment from CLUSTAL format to STOCKHOLM format.

Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON], and goalign [GOALIGN].

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

Convert CLUSTAL interleaved file in PHYLIP format using biopython.

Bio.SeqIO Documentation

_method_squizz(*args, **kwargs)[source]

Convert CLUSTAL file in STOCKHOLM format using squizz tool.

Convert CRAM file to BAM format

class CRAM2BAM(infile, outfile, *args, **kargs)[source]

Convert CRAM file to BAM file

The conversion requires the reference corresponding to the input file It can be provided as an argument with the standalone (--reference). Otherwise, users are asked to provide it.

Methods available are based on samtools [SAMTOOLS].

constructor

Parameters:
  • infile (str) -- input CRAM file

  • outfile (str) -- output BAM filename

_default_method = 'samtools'

Default value

_method_samtools(*args, **kwargs)[source]

Here we use the SAMtools tool.

SAMtools documentation

Convert CRAM file to FASTQ format

class CRAM2FASTA(infile, outfile, *args, **kargs)[source]

Convert CRAM file to FASTA file

Methods available are based on samtools [SAMTOOLS].

constructor

Parameters:
  • infile (str) -- input CRAM file

  • outfile (str) -- output FASTA filename

_default_method = 'samtools'

Default value

_method_samtools(*args, **kwargs)[source]

do the conversion BAM -> FASTA using samtools

SAMtools documentation

Note

fasta are on one line

Convert CRAM file to FASTQ format

class CRAM2FASTQ(infile, outfile, *args, **kargs)[source]

Convert CRAM file to FASTQ file

Methods available are based on samtools [SAMTOOLS].

constructor

Parameters:
  • infile (str) -- input CRAM file

  • outfile (str) -- output FASTQ filename

_default_method = 'samtools'

Default value

_method_samtools(*args, **kwargs)[source]

Do the conversion BAM -> FASTQ using samtools

SAMtools documentation

Convert CRAM file to SAM format

class CRAM2SAM(infile, outfile, *args, **kargs)[source]

Convert CRAM file to SAM file

The conversion requires the reference corresponding to the input file It can be provided as an argument with the standalone (--reference). Otherwise, users are asked to provide it.

Methods available are based on samtools [SAMTOOLS].

constructor

Parameters:
  • infile (str) -- input CRAM file

  • outfile (str) -- output SAM filename

_default_method = 'samtools'

Default value

_method_samtools(*args, **kwargs)[source]

Here we use the SAMtools tool.

SAMtools documentation

Convert CSV format to TSV format

class CSV2TSV(infile, outfile)[source]

Convert CSV file into TSV file

Available methods: Python, Pandas

Methods available are based on python or Pandas [PANDAS].

See also

TSV2CSV

Constructor

Parameters:
  • infile (str) -- comma-separated file

  • outfile (str) -- tabulated file

_default_method = 'python'

Default value

_method_pandas(in_sep=',', out_sep='\t', line_terminator='\n', *args, **kwargs)[source]

Do the conversion CSV -> TSV using Pandas library

pandas documentation

_method_python(in_sep=',', out_sep='\t', line_terminator='\n', *args, **kwargs)[source]

Do the conversion CSV -> TSV using standard Python modules.

csv documentation

_method_python_v2(in_sep=',', out_sep='\t', line_terminator='\n', *args, **kwargs)[source]

Do the conversion CSV -> CSV using csv module.

Note

This method cannot escape nor quote output char

csv documentation

convert CSV to XLS format

class CSV2XLS(infile, outfile, *args, **kargs)[source]

Convert CSV file to XLS file

Methods available are based on python, pyexcel [PYEXCEL], or pandas [PANDAS].

constructor

Parameters:
  • infile (str) -- input CSV file

  • outfile (str) -- output XLS filename

_default_method = 'pandas'

Default value

_method_pandas(in_sep=',', sheet_name='Sheet 1', *args, **kwargs)[source]

Do the conversion CSV -> XLS using Panda modules.

pandas documentation

_method_pyexcel(in_sep=',', sheet_name='Sheet 1', *args, **kwargs)[source]

Do the conversion CSV -> XLS using pyexcel modules.

pyexcel documentation

Convert a compressed FASTQ from DSRC to FASTQ format

class DSRC2GZ(infile, outfile, *args, **kargs)[source]

Convert a compressed FASTQ from DSRC to GZ format

Methods available are based on dsrc [DSRC] and pigz [PIGZ].

constructor

Parameters:
  • infile (str) -- input DSRC filename

  • outfile (str) -- output GZ filename

_default_method = 'dsrcpigz'

Default value

_method_dsrcpigz(*args, **kwargs)[source]

Do the conversion dsrc -> GZ. Method that uses pigz and dsrc.

pigz documentation dsrc documentation

option threadig does not work with the dsrc version from conda so we do not add the -t threads option

Convert EMBL file to FASTA format

class EMBL2FASTA(infile, outfile, *args, **kargs)[source]

Convert EMBL file to FASTA file

Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON].

constructor

Parameters:
  • infile (str) -- input EMBL file

  • outfile (str) -- output FASTA filename

_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.SeqIO.

Bio.SeqIO Documentation

_method_squizz(*args, **kwargs)[source]

Header is less informative than the one obtained with biopython

Convert EMBL file to GENBANK format

class EMBL2GENBANK(infile, outfile, *args, **kargs)[source]

Convert EMBL file to GENBANK file

Methods available are based on squizz [SQUIZZ] and biopython [BIOPYTHON].

constructor

Parameters:
  • infile (str) -- input EMBL file

  • outfile (str) -- output GENBANK filename

_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.SeqIO.

Bio.SeqIO Documentation

_method_squizz(*args, **kwargs)[source]

Header is less informative than the one obtained with biopython.

Convert FASTA format to FASTQ format

class FASTA_QUAL2FASTQ(infile, outfile)[source]

Convert FASTA and QUAL back into a FASTQ file

Method based on pysam [PYSAM].

Parameters:
  • infile (list) -- The path to the input FASTA file, the path to the input QUAL file

  • outfile (str) -- The path to the output FASTQ file

_default_method = 'pysam'

Default value

_method_pysam(*args, **kwargs)[source]

This method uses the FastxFile function of the Pysam python module.

FastxFile documentation

Convert FASTA to CLUSTAL format

class FASTA2CLUSTAL(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment from FASTA to CLUSTAL format

Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON], and goalign [GOALIGN].

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

Convert FASTA interleaved file in CLUSTAL format using biopython.

Bio.SeqIO Documentation

_method_goalign(*args, **kwargs)[source]

Convert FASTA file in CLUSTAL format using goalign tool.

goalign documentation

_method_squizz(*args, **kwargs)[source]

Convert FASTA file in CLUSTAL format using squizz tool.

Convert FASTA format to FAA format

class FASTA2FAA(infile, outfile)[source]

Methods available is a bioconvert implementation.

Parameters:
  • infile (str) -- The path to the input FASTA file

  • outfile (str) -- The path to the output FASTQ file

_default_method = 'bioconvert'

Default value

_method_bioconvert(*args, **kwargs)[source]

Internal method.

Convert FASTA format to FASTQ format

class FASTA2FASTQ(infile, outfile)[source]

Methods available are based on pysam [PYSAM].

Parameters:
  • infile (str) -- The path to the input FASTA file

  • outfile (str) -- The path to the output FASTQ file

_default_method = 'pysam'

Default value

_method_pysam(quality_file=None, *args, **kwargs)[source]

This method uses the FastxFile function of the Pysam python module.

FastxFile documentation

Convert FASTA to GENBANK format

class FASTA2GENBANK(infile, outfile, *args, **kargs)[source]

Convert FASTA file to GENBANK file

Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON] or Bioconvert pure implementation (default).

constructor

Parameters:
  • infile (str) -- input FASTA file

  • outfile (str) -- output GENBANK filename

_default_method = 'bioconvert'

Default value

_method_bioconvert(*args, **kwargs)[source]

Internal method

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.SeqIO.

Bio.SeqIO Documentation

_method_squizz(*args, **kwargs)[source]

Header is less informative than the one obtained with biopython

Convert FASTA to NEXUS format

class FASTA2NEXUS(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment in FASTA format to NEXUS format

Methods available are based on squizz [GOALIGN].

constructor

Parameters:
  • infile (str) -- input FASTA file.

  • outfile (str) -- (optional) output NEXUS file

_default_method = 'goalign'

Default value

_method_goalign(*args, **kwargs)[source]

Convert fasta file in Nexus format using goalign tool.

goalign documentation

The fasta file must be an alignemnt file, yhis mean all the sequences must have the same length (with the gap) otherwise an error will be raised.

Convert FASTA to PHYLIP format

class FASTA2PHYLIP(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment in FASTA format to PHYLIP format

Conversion is based on Bio Python modules

Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON] or goalign [GOALIGN]. Squizz is the default (https://github.com/bioconvert/bioconvert/issues/149). Phylip created is a strict phylip that is with 10 characters on the first column.

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.SeqIO.

Bio.SeqIO Documentation

_method_goalign(*args, **kwargs)[source]

Convert fasta file in Phylip interleaved format using goalign tool.

goalign documentation

The fasta file must be an alignemnt file, this means that all sequences must have the same length (with the gap) otherwise an error will be raised

_method_squizz(*args, **kwargs)[source]

Convert fasta file in Phylip interleaved format using squizz tool. The fasta file must be an alignement file, this means that all sequences must have the same length (with the gap) otherwise an error will be raised.

Convert FASTA to TWOBIT format

class FASTA2TWOBIT(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment in FASTA format to TWOBIT format

Methods available are based on UCSC faToTwoBit [UCSC].

constructor

Parameters:
_default_method = 'ucsc'

default value

_method_ucsc(*args, **kwargs)[source]

Convert fasta file in twobit format using ucsc faToTwoBit.

uscsc faToTwoBit Documentation

Convert FASTQ to FASTA format

class FASTQ2FASTA(infile, outfile)[source]

Convert FASTQ to FASTA

This converter has lots of methods. Some of them have also been removed or commented with time. BioPython for instance is commented due to poo performance compared to others. Does not mean that it is not to be considered. Performances are decrease due to lot of sanity checks.

Similarly, bioawk and python_external method are commented because redundant with other equivalent method.

Parameters:
  • infile (str) -- The path to the input FASTA file.

  • outfile (str) -- The path to the output file.

_default_method = 'bioconvert'

default value

_method_awk(*args, **kwargs)[source]

Here we are using the awk method.

Note

Another method with awk has been tested but is less efficient. Here is which one was used:

box.awkcmd = """awk '{{if(NR%4==1) {{printf(">%s\n",substr($0,2));}} else if(NR%4==2) print;}}' """

awk documentation

_method_bioconvert(*args, **kwargs)[source]

Bioconvert implementation in pure Python. This is the default method because it is the fastest.

_method_mappy(*args, **kwargs)[source]

This method provides a fast and accurate C program to align genomic sequences and transcribe nucleotides.

mappy method

_method_mawk(*args, **kwargs)[source]

This variant of the awk method uses mawk, a lighter and faster implementation of awk.

Note

Other methods with mawk have been tested but are less efficient. Here are which ones were used:

mawkcmd_v2 = """mawk '{{if(NR%4==1) {{printf(">%s\n",substr($0,2));}} else if(NR%4==2) print;}}' """
mawkcmd_v3 = """mawk '(++n<=0){next}(n!=1){print;n=-2;next}{print">"substr($0,2)}'"""

mawk documentation

_method_perl(*args, **kwargs)[source]

This method uses the perl command which will call the "fastq2fasta.pl" script.

Perl documentation

_method_readfq(*args, **kwargs)[source]

This method is inspired by Readfq coded by Heng Li.

original Readfq method

_method_sed(*args, **kwargs)[source]

This method uses the UNIX function sed which is a non-interactive editor.

Note

Another method with sed has been tested but is less efficient. Here is which one was used:

cmd = """sed -n 's/^@/>/p;n;p;n;n'"""

sed documentation

_method_seqkit(*args, **kwargs)[source]

We use the Seqkit library.

Documentation of the Seqkit method

_method_seqtk(*args, **kwargs)[source]

We use the Seqtk library.

Documentation of the Seqtk method

static just_name(record)[source]

This method takes a Biopython sequence record record and returns its name. The comment part is not included.

static unwrap_fasta(infile, outfile, strip_comment=False)[source]

This method reads fasta sequences from infile and writes them unwrapped in outfile. Used in the test suite.

Parameters:
  • infile (str) -- The path to the input FASTA file.

  • outfile (str) -- The path to the output file.

Convert GENBANK to EMBL format

class GENBANK2EMBL(infile, outfile, *args, **kargs)[source]

Convert GENBANK file to EMBL file

Some description.

constructor

Parameters:
  • infile (str) -- input GENBANK file

  • outfile (str) -- output EMBL filename

_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.SeqIO.

Bio.SeqIO Documentation

_method_squizz(*args, **kwargs)[source]

Header is less informative than the one obtained with biopython

Convert GENBANK to EMBL format

class GENBANK2FASTA(infile, outfile, *args, **kargs)[source]

Convert GENBANK file to FASTA file

Methods are based on biopython [BIOPYTHON], squizz [SQUIZZ] and our own Bioconvert implementation.

constructor

Parameters:
  • infile (str) -- input GENBANK file

  • outfile (str) -- output EMBL filename

_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.SeqIO.

Bio.SeqIO Documentation

_method_python(*args, **kwargs)[source]

Internal method.

_method_squizz(*args, **kwargs)[source]

Header is less informative than the one obtained with biopython

Convert GENBANK to GFF3 format

class GENBANK2GFF3(infile, outfile, *args, **kargs)[source]

Convert GENBANK file to GFF3 file

Method based on biocode.

constructor

Parameters:
  • infile (str) -- input GENBANK file

  • outfile (str) -- output GFF3 filename

_default_method = 'biocode'

Default value

_method_biocode(*args, **kwargs)[source]

Uses scripts from biocode copied and modified in bioconvert.utils.biocode

Please see Main entry

Convert GFF2 to GFF3 format

class GFF22GFF3(infile, outfile, *args, **kargs)[source]

Convert GFF2 to GFF3

constructor

Parameters:
  • infile (str) --

  • outfile (str) --

Method available is pure Python.

_default_method = 'bioconvert'

Default value

_method_bioconvert(*args, **kwargs)[source]

This method is a basic mapping of the 9th column of gff2 to gff3. Other methods with smart translations must be created for specific usages. There is no good solution for this translation.

Convert GFF3 to GFF2 format

class GFF32GFF2(infile, outfile, *args, **kargs)[source]

Convert GFF2 to GFF3

Method available is Python-based.

constructor

Parameters:
  • infile (str) -- input GFF3 filename

  • outfile (str) -- output GFF2 filename

_default_method = 'bioconvert'

Default value

_method_bioconvert(*args, **kwargs)[source]

This method is a basic mapping of the 9th column of gff2 to gff3. Other methods with smart translations must be created for specific usages. There is no good solution for this translation.

Convert GFA to FASTA format

class GFA2FASTA(infile, outfile)[source]

Convert sorted GFA file into FASTA file

Available methods are based on awk or python (default)

(Source code)

Reference:

https://github.com/GFA-spec/GFA-spec/blob/master/GFA-spec.md

See also

bioconvert.simulator.gfa

Parameters:
  • infile (str) -- The path to the input BAM file. It must be sorted.

  • outfile (str) -- The path to the output file

_default_method = 'python'

Default value

_method_awk(*args, **kwargs)[source]

For this method, we use the awk tools.

awk documentation

Returns:

the standard output

Return type:

io.StringIO object.

Note

this method fold the sequence to 80 characters

_method_python(*args, **kwargs)[source]

Internal method

Convert GZ file to BZ2 format

class GZ2BZ2(infile, outfile, *args, **kargs)[source]

Convert GZ file to BZ2 file

Unzip input file using pigz or gunzip and compress using pbzip2. Default is pigz/pbzip2.

constructor

Parameters:
  • infile (str) -- input GZ file

  • outfile (str) -- output BZ2 filename

_default_method = 'pigz_pbzip2'

Default value

_method_gunzip_bzip2(*args, **kwargs)[source]

Single theaded conversion. Method that uses gunzip bzip2.

gunzip documentation bzip2 documentation

_method_pigz_pbzip2(*args, **kwargs)[source]

Method that uses pigz pbzip2.

pigz documentation pbzip2 documentation

_method_python(*args, **kwargs)[source]

Internal method

Convert GZ to DSRC format

class GZ2DSRC(infile, outfile, *args, **kargs)[source]

Convert compressed fastq.gz file into DSRC compressed file

(Source code)

constructor

Parameters:
  • infile (str) -- input GZ filename

  • outfile (str) -- output DSRC filename

_default_method = 'pigzdsrc'

Default value

_method_pigzdsrc(*args, **kwargs)[source]

do the conversion gz -> DSRC

Returns:

the standard output

Return type:

io.StringIO object.

Method that uses pigz and dsrc.

pigz documentation dsrc documentation

Convert JSON to YAML format

class JSON2YAML(infile, outfile, *args, **kargs)[source]

Convert JSON file into YAML file

Conversion is based on yaml and json standard Python modules Indentation is set to 4 by default and affects the sections (not the list). For example:

fruits_list:
- apple
- orange
section1:
    do: true
    misc: 1

constructor

Parameters:
  • infile (str) -- input JSON file

  • outfile (str) -- input YAML file.

_default_method = 'yaml'

Default value

_method_yaml(*args, **kwargs)[source]

Internal method

Convert MAF file to SAM format

class MAF2SAM(infile, outfile)[source]

This is the Multiple alignment format or MIRA assembly format

This is not Mutation Annotation Format (somatic)

pbsim creates this kind of data

Some references:

Those two codes were in Py2 at the time of this implementation. We re-used some of the information from maf-convert but the code in bioconvert.io.maf can be considered original.

constructor

Parameters:
  • infile (str) -- the path of the input file.

  • outfile (str) -- the path of The output file

_default_method = 'python'

Default value

_method_python(*args, **kwargs)[source]

Internal module

MAF documentation

Converts NEWICK file to NEXUS format.

class NEWICK2NEXUS(infile, outfile=None, *args, **kwargs)[source]

Converts a tree file from NEWICK format to NEXUS format.

Methods available are based on gotree [GOTREE].

constructor

Parameters:
_default_method = 'gotree'

Default value

_method_gotree(*args, **kwargs)[source]

Convert NEWICK file in NEXUS format using gotree tool.

gotree documentation

Converts NEWICK file to PHYLOXML format.

class NEWICK2PHYLOXML(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a tree file from NEWICK format to PHYLOXML format.

Methods available are based on gotree [GOTREE].

constructor

Parameters:
_default_method = 'gotree'

Default value

_method_gotree(*args, **kwargs)[source]

Convert NEWICK file in PHYLOXML format using gotree tool.

gotree documentation

Convert NEXUS to FASTA format

class NEXUS2FASTA(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment from NEXUS format to FASTA format.

constructor

Parameters:
  • infile (str) -- input NEXUS file.

  • outfile (str) -- (optional) output FASTA file

_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]
Convert NEXUS interleaved or sequential file in FASTA format using biopython.

The FASTA output file will be an aligned FASTA file.

Bio.AlignIO

For instance:

We have a Nexus input file that look like

#NEXUS
[TITLE: Test file]

begin data;
dimensions ntax=3 nchar=123;
format interleave datatype=DNA missing=N gap=-;

matrix
read3                -AT--------CCCGCTCGATGGGCCTCATTGCGTCCACTAGTTGATCTT
read2                -----------------------GGAAGCCCACGCCACGGTCTTGATACG
read4                ---------------------AGGGATGAACGATGCTCGCAGTTGATGCT

read3                CTGGAGTAT---T----TAGGAAAGCAAGTAAACTCCTTGTACAAATAAA
read2                AATTTTTCTAATGGCTATCCCTACATAACCTAACCGGGCATGTAATGTGT
read4                CAGAAGTGCCATTGCGGTAGAAACAAATGTTCCCAGATTGTTGACTGATA

read3                GATCTTA-----GATGGGCAT--
read2                CACCGTTGTTTCGACGTAAAGAG
read4                AGTAGGACCTCAGTCGTGACT--
;

end;
begin assumptions;
options deftype=unord;
end;

the output file will look like

>read3
-AT--------CCCGCTCGATGGGCCTCATTGCGTCCACTAGTTGATCTTCTGGAGTAT-
--T----TAGGAAAGCAAGTAAACTCCTTGTACAAATAAAGATCTTA-----GATGGGCA
T--
>read2
-----------------------GGAAGCCCACGCCACGGTCTTGATACGAATTTTTCTA
ATGGCTATCCCTACATAACCTAACCGGGCATGTAATGTGTCACCGTTGTTTCGACGTAAA
GAG
>read4
---------------------AGGGATGAACGATGCTCGCAGTTGATGCTCAGAAGTGCC
ATTGCGGTAGAAACAAATGTTCCCAGATTGTTGACTGATAAGTAGGACCTCAGTCGTGAC
T--

and not

>read3
ATCCCGCTCGATGGGCCTCATTGCGTCCACTAGTTGATCTTCTGGAGTATTTAGGAAAGC
AAGTAAACTCCTTGTACAAATAAAGATCTTAGATGGGCAT
>read2
GGAAGCCCACGCCACGGTCTTGATACGAATTTTTCTAATGGCTATCCCTACATAACCTAA
CCGGGCATGTAATGTGTCACCGTTGTTTCGACGTAAAGAG
>read4
AGGGATGAACGATGCTCGCAGTTGATGCTCAGAAGTGCCATTGCGGTAGAAACAAATGTT
CCCAGATTGTTGACTGATAAGTAGGACCTCAGTCGTGACT
_method_goalign(*args, **kwargs)[source]

Convert NEXUS interleaved file in FASTA format using goalign tool.

goalign documentation

Warning

the sequential format is not supported

_method_squizz(*args, **kwargs)[source]

Convert NEXUS sequential or interleave file in FASTA format using squizz tool.

command used:

squizz -c FASTA infile > outfile

Converts NEXUS file to NEWICK format.

class NEXUS2NEWICK(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a tree file from NEXUS format to NEWICK format.

Methods available are based on biopython [BIOPYTHON] or goalign [GOALIGN].

constructor

Parameters:
_default_method = 'gotree'

Default value

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.Phylo.

Bio.Phylo Documentation

_method_gotree(*args, **kwargs)[source]

Convert NEXUS file in NEWICK format using gotree tool.

gotree documentation

Converts NEXUS file to PHYLIP format.

class NEXUS2PHYLIP(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment from NEXUS format to PHYLIP format.

Methods available are based on goalign [GOALIGN].

constructor

Parameters:
_default_method = 'goalign'

Default value

_method_goalign(*args, **kwargs)[source]

Convert NEXUS interleaved file in PHYLIP format using goalign tool.

goalign documentation

Converts NEXUS file to PHYLOXML format.

class NEXUS2PHYLOXML(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a tree file from NEXUS format to PHYLOXML format.

Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON], and goalign [GOALIGN].

constructor

Parameters:
_default_method = 'gotree'

Default value

_method_gotree(*args, **kwargs)[source]

uses gotree tool:

gotree documentation

Convert XLS format to CSV format

class ODS2CSV(infile, outfile)[source]

Convert XLS file into CSV file

Method based on pyexcel [PYEXCEL].

constructor

Parameters:
  • infile (str) --

  • outfile (str) --

_default_method = 'pyexcel'

Default value

_method_pyexcel(out_sep=',', line_terminator='\n', sheet_name=0, *args, **kwargs)[source]

Do the conversion XLS -> CSV using Panda library

pyexcel documentation

Converts PHYLIP file to CLUSTAL format.

class PHYLIP2CLUSTAL(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment from PHYLIP format to CLUSTAL format

Methods available are based on biopython [BIOPYTHON], squiz [SQUIZZ].

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

Convert PHYLIP interleaved file in CLUSTAL format using biopython.

Bio.SeqIO Documentation

_method_squizz(*args, **kwargs)[source]

Convert PHYLIP interleaved file in CLUSTAL format using squizz tool.

Converts PHYLIP file to FASTA format.

class PHYLIP2FASTA(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment in PHYLIP format to FASTA format

Methods available are based on biopython [BIOPYTHON], squiz [SQUIZZ].

constructor

Parameters:
_default_method = 'biopython'

default value

_method_biopython(*args, **kwargs)[source]

For this method we use the biopython package Bio.SeqIO.

Bio.SeqIO Documentation

_method_goalign(*args, **kwargs)[source]

Convert fasta file in Phylip interleaved format using goalign tool.

goalign documentation

The fasta file must be an alignemnt file, yhis mean all the sequences must have the same length (with the gap) otherwise an error will be raised

_method_squizz(*args, **kwargs)[source]

Convert Phylip inteleaved file in fasta format using squizz tool. The fasta file is an alignemnt, that means the gap are conserved.

Converts PHYLIP file to NEXUS format.

class PHYLIP2NEXUS(infile, outfile=None, *args, **kwargs)[source]

Converts a sequence alignment from PHYLIP format to NEXUS format.

Methods available are based on goalign [GOALIGN].

constructor

Parameters:
_default_method = 'goalign'

Default value

_method_goalign(*args, **kwargs)[source]

Convert PHYLIP interleaved file in NEXUS format using goalign tool.

goalign documentation

Converts PHYLIP file to STOCKHOLM format.

class PHYLIP2STOCKHOLM(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment from PHYLIP interleaved to STOCKHOLM

Methods available are based on biopython [BIOPYTHON], squiz [SQUIZZ].

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

Convert PHYLIP interleaved file in STOCKHOLM format using biopython.

Bio.SeqIO Documentation

_method_squizz(*args, **kwargs)[source]

Convert PHYLIP interleaved file in STOCKHOLM format using squizz tool.

Converts PHYLIP file to XMFA format.

class PHYLIP2XMFA(infile, outfile=None, *args, **kwargs)[source]

Converts a sequence alignment from PHYLIP format to XMFA

Methods available are based on biopython [BIOPYTHON].

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

Convert PHYLIP interleaved file in XMFA (Mauve)format.

Bio.AlignIO

Converts PHYLOXML file to NEWICK format.

class PHYLOXML2NEWICK(infile, outfile=None, *args, **kwargs)[source]

Converts a tree file from PHYLOXML format to NEWICK format.

Methods available are based on gotree [GOTREE].

constructor

Parameters:
_default_method = 'gotree'

Default value

_method_gotree(*args, **kwargs)[source]

Convert PHYLOXML file in NEWICK format using gotree tool.

gotree documentation

Converts PHYLOXML file to NEXUS format.

class PHYLOXML2NEXUS(infile, outfile=None, *args, **kwargs)[source]

Converts a tree file from PHYLOXML format to NEXUS format.

Methods available are based on gotree [GOTREE].

constructor

Parameters:
_default_method = 'gotree'

Default value

_method_gotree(*args, **kwargs)[source]

Convert PHYLOXML file in NEXUS format using gotree tool.

gotree documentation

Convert PLINK to BPLINK

Converts a genotype dataset ped+map in PLINK format to bed+bim+fam BPLINK format

Conversion is based on plink executable

constructor

Parameters:
_default_method = 'plink'

Default value

Convert plink file in text using plink executable.

plink documentation

Convert SAM file to BAM format

class SAM2BAM(infile, outfile, *args, **kargs)[source]

Convert SAM file to BAM file

constructor

Parameters:
  • infile (str) --

  • outfile (str) --

_default_method = 'samtools'

Default value

_method_samtools(*args, **kwargs)[source]

Do the conversion SAM -> BAM using samtools

SAMtools documentation

Convert SAM file to CRAM format

class SAM2CRAM(infile, outfile, reference=None, *args, **kargs)[source]

Convert SAM file to CRAM file

The conversion requires the reference corresponding to the input file It can be provided as an argument with the standalone (--reference). Otherwise, users are asked to provide it.

Methods available are based on samtools [SAMTOOLS].

constructor

Parameters:
  • infile (str) -- input SAM file

  • outfile (str) -- output CRAM filename

_default_method = 'samtools'

Default value

_method_samtools(*args, **kwargs)[source]

Here we use the SAMtools tool.

SAMtools documentation

"Convert CRAM to BAM format

class SAM2PAF(infile, outfile, *args, **kargs)[source]

Convert SAM file to PAF file

The SAM and PAF formats are described in the Formats section.

Description:

The header of the SAM file (lines starting with @) are dropped. However, the length of the target is retrieved from the @SQ line that must be present.

Consider this SAM file with two alignements only. One is aligned on the target (first) while the other is not (indicated by the * characters):

@SQ     SN:ENA|K01711|K01711.1  LN:15894
@PG     ID:minimap2     PN:minimap2     VN:2.5-r572     CL:minimap2 -a measles.fa Hm2_GTGAAA_L005_R1_001.fastq.gz
HISEQ:426:C5T65ACXX:5:2302:1943:2127    0       ENA|K01711|K01711.1     448     60      101M    *       00      CTTACCTTCGCATCAAGAGGTACCAACATGGAGGATGAGGCGGACCAATACTTTTCACATGATGATCCAATTAGTAGTGATCAATCCAGGTTCGGATGGTT   BCCFFFFFHHHHHIIJJJJJJIIJJJJJJJJFHIHIJJJIJIIIIGHFFFFFFEEEEEEEDDDDDFDDDDDDDDD>CDDEDEEDDDDDDCCDDDDDDDDCD   NM:i:0  ms:i:202        AS:i:202        nn:i:0  tp:A:P  cm:i:14 s1:i:94 s2:i:0
HISEQ:426:C5T65ACXX:5:2302:4953:2090    4       *       0       0       *       *       0       0       AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAACAACCAAAAAGAGACGAACAA   CCCFFDDFAFFBHJHGGGIHIJBGGHIIJJJJJJJHGEIJGIFIIIHCBGHIJIIIIIJJHHHHEF@D@=;=,0)0&5&))+(((+((((&+(((()&&)(

The equivalent PAF file is

HISEQ:426:C5T65ACXX:5:2302:1943:2127    101     0       101     +       ENA|K01711|K01711.1     15894   447     548     101     101     60      NM:i:0  ms:i:202        AS:i:202        nn:i:0  tp:A:P  cm:i:14 s1:i:94 s2:i:0  cg:Z:101M

In brief, the sequences are dropped. The final file is therefore smaller. Extra fields (starting from NM:i:0) can be dropped or kept using the keep_extra_field argument. Alignement with * characters are dropped. The first line (@SQ) is used to retrieve the length of the contigs that is stored in the PAF file (column 6).

The 12 compulsary PAF fields are:

Col

Type

Description

1

string

Query sequence name

2

int

Query sequence length

3

int

Query start (0-based)

4

int

Query end (0-based)

5

char

Relative strand: "+" or "-"

6

string

Target sequence name

7

int

Target sequence length

8

int

Target start on original strand (0-based)

9

int

Target end on original strand (0-based)

10

int

Number of residue matches

11

int

Alignment block length

12

int

Mapping quality (0-255; 255 for missing)

For developesr:

Get the measles data from Sequana library (2 paired fastq files):

minimap2 measles.fa R1.fastq > approx-mapping.paf

You can ask minimap2 to generate CIGAR at the cg tag of PAF with:

minimap2 -c measles.fa R1.fastq > alignment.paf

or to output alignments in the SAM format:

minimap2 -a measles.fa R1.fastq > alignment.sam

The SAM lines must contains 11 positional element and the NM:i and nn:i fields (see example above).

constructor

Parameters:
  • infile (str) -- input SAM file

  • outfile (str) -- output PAF filename

Reference:

This function is a direct translation of https://github.com/lh3/miniasm/blob/master/misc/sam2paf.js (Dec. 2017).

_default_method = 'python'

Default value

_method_python(*args, **kwargs)[source]

Internal method

Convert SCF file to FASTA file

class SCF2FASTA(infile, outfile)[source]

Converts a binary SCF/ABI file to Fasta format.

Parameters:
  • infile (str) -- input SCF/ABI file

  • outfile (str) -- output name file

constructor

Parameters:
  • infile (str) -- the path of the input file.

  • outfile (str) -- the path of The output file

_default_method = 'python'

Default value

_method_python(*args, **kwargs)[source]

Internal method

Convert SCF file to FASTQ file

class SCF2FASTQ(infile, outfile)[source]

Converts a binary SCF file to FastQ file

Parameters:
  • infile (str) -- input SCF file

  • outfile (str) -- output name file

constructor

Parameters:
  • infile (str) -- the path of the input file.

  • outfile (str) -- the path of The output file

_default_method = 'python'

Default value

_method_python(*args, **kwargs)[source]

Internal method

Convert SRA format to FASTA format

class SRA2FASTQ(infile, outfile, test=False)[source]

Download FASTQ from SRA archive

bioconvert sra2fastq ERR043367

This may take some times since the files are downloaded from SRA website.

constructor

https://edwards.flinders.edu.au/fastq-dump/

library used: sra-toolkit

_default_method = 'fastq_dump'

Default value

_method_fastq_dump(*args, **kwargs)[source]

Uses Sratoolkit (fastq-dump) to convert a sra file to fastq

Fastq-dump documentation

Converts STOCKHOLM file to CLUSTAL file.

class STOCKHOLM2CLUSTAL(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment from STOCKHOLM format to CLUSTAL format

Methods available are based on squizz [SQUIZZ] and biopython [BIOPYTHON].

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

Convert STOCKHOLM interleaved file in CLUSTAL format using biopython.

Bio.SeqIO Documentation

_method_squizz(*args, **kwargs)[source]

Convert STOCKHOLM file in CLUSTAL format using squizz tool.

Converts STOCKHOLM to PHYLIP format.

class STOCKHOLM2PHYLIP(infile, outfile=None, *args, **kwargs)[source]

Converts a sequence alignment from STOCKHOLM format to PHYLIP interleaved format

Methods available are based on squizz [SQUIZZ], and biopython [BIOPYTHON].

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

Convert STOCKHOLM interleaved file in PHYLIP format using biopython.

Bio.SeqIO Documentation

_method_squizz(*args, **kwargs)[source]

Convert STOCKHOLM interleaved file in PHYLIP interleaved format using squizz tool.

Convert TSV format to CSV format

class TSV2CSV(infile, outfile)[source]

Convert TSV file into CSV file

Available methods: Python, Pandas

Methods available are based on python or Pandas [PANDAS].

See also

CSV2TSV

Constructor

Parameters:
  • infile (str) -- tabulated file

  • outfile (str) -- comma-separated file

_default_method = 'python'

Default value

_method_pandas(in_sep='\t', out_sep=',', line_terminator='\n', *args, **kwargs)[source]

Do the conversion TSV -> CSV using Pandas library

pandas documentation

_method_python(in_sep='\t', out_sep=',', line_terminator='\n', *args, **kwargs)[source]

Do the conversion TSV -> CSV using csv module.

csv documentation

_method_python_v2(in_sep='\t', out_sep=',', line_terminator='\n', *args, **kwargs)[source]

Do the conversion TSV -> CSV using csv module

Note

Note that this method cannot escape nor quote output char

csv documentation

Conversion from TWOBIT to FASTA format

class TWOBIT2FASTA(infile, outfile=None, alphabet=None, *args, **kwargs)[source]

Converts a sequence alignment in TWOBIT format to FASTA format

Conversion is based on UCSC [UCSC] and py2bit.

constructor

Parameters:
_default_method = 'py2bit'

Default value

_method_py2bit(*args, **kwargs)[source]

This method uses the py2bit python extension.

py2bit documentation

_method_ucsc(*args, **kwargs)[source]

Convert twobit file in fasta format using ucsc twobittofa.

uscsc faToTwoBit Documentation

Convert VCF to BCF format

class VCF2BCF(infile, outfile=None, *args, **kwargs)[source]

Convert VCF file to BCF format

Method based on bcftools [BCFTOOLS].

Parameters:
  • infile (str) -- The path to the input FASTA file.

  • outfile (str) -- The path to the output file.

_default_method = 'bcftools'

Default value

_method_bcftools(*args, **kwargs)[source]

For this method, we use the BCFtools tool

BCFtools documentation

command used:

bcftools view -Sb
Parameters:
  • args --

  • kwargs --

Returns:

Convert VCF to BED3 file

class VCF2BED(infile, outfile)[source]

Convert VCF file to BED3 file by extracting positions.

The awk method implemented here below reports an interval of 1 for SNP, the length of the insertion or the length of the deleted part in case of deletion.

constructor

Parameters:
  • infile (str) -- the path of the input file.

  • outfile (str) -- the path of The output file

_default_method = 'awk'

Default value

_method_awk(*args, **kwargs)[source]

do the conversion VCF -> BED using awk

awk documentation

Returns:

the standard output

Return type:

io.StringIO object.

Convert VCF format to WIGGLE format

class VCF2WIGGLE(infile, outfile)[source]

Convert sorted VCF file into WIGGLE file

Parameters:
  • infile (str) -- The path to the input VCF file. It must be sorted.

  • outfile (str) -- The path to the output file

_default_method = 'wiggletools'

Default value

_method_wiggletools(*args, **kwargs)[source]

Conversion using wiggletools

wiggletools documentation

Convert XLS format to CSV format

class XLS2CSV(infile, outfile)[source]

Convert XLS file into CSV file

Extra arguments when using Bioconvert executable.

name

Description

--sheet-name

The name or id of the sheet to convert

--out-sep

The separator used in the output file

--line-terminator

The line terminator used in the output file

Methods available are based on pandas [PANDAS] and pyexcel [PYEXCEL].

Constructor

Parameters:
  • infile (str) --

  • outfile (str) --

_method_pandas(out_sep=',', line_terminator='\n', sheet_name=0, *args, **kwargs)[source]

Do the conversion XLSX -> CSV using Pandas library.

pandas documentation

_method_pyexcel(out_sep=',', line_terminator='\n', sheet_name=0, *args, **kwargs)[source]

Do the conversion XLS -> CSV using pyexcel library

pyexcel documentation

Convert XLS format to CSV format

class XLSX2CSV(infile, outfile)[source]

Convert XLSX file into CSV file

Extra arguments when using Bioconvert executable.

name

Description

--sheet-name

The name or id of the sheet to convert

--out-sep

The separator used in the output file

--line-terminator

The line terminator used in the output file

Methods available are based on pandas [PANDAS] and pyexcel [PYEXCEL].

Constructor

Parameters:
  • infile (str) --

  • outfile (str) --

_default_method = 'pandas'

Default value

_method_pandas(out_sep=',', line_terminator='\n', sheet_name=0, *args, **kwargs)[source]

Do the conversion XLSX -> CSV using Pandas library.

pandas documentation

_method_pyexcel(out_sep=',', line_terminator='\n', sheet_name=0, *args, **kwargs)[source]

Do the conversion XLSX -> CSV using pyexcel library

pyexcel documentation

Convert XMFA to PHYLIP format

class XMFA2PHYLIP(infile, outfile=None, *args, **kwargs)[source]

Converts a sequence alignment from XMFA to PHYLIP format.

Method available based on biopython [BIOPYTHON].

constructor

Parameters:
_default_method = 'biopython'

Default value

_method_biopython(*args, **kwargs)[source]

Convert XMFA interleaved file in PHYLIP (Mauve)format.

Bio.SeqIO Documentation

Convert YAML to JSON format

class YAML2JSON(infile, outfile, *args, **kargs)[source]

Convert YAML file into JSON file

Conversion is based on yaml and json standard Python modules

Note

YAML comments will be lost in JSON output

Reference:

http://yaml.org/spec/1.2/spec.html#id2759572

constructor

Parameters:
  • infile (str) -- input YAML file.

  • outfile (str) -- input JSON file

_default_method = 'python'

Default value

_method_python(*args, **kwargs)[source]

Internal method

get_json()[source]

Return the JSON dictionary corresponding to the YAML input.

IO functions

bioconvert.io.maf

bioconvert.io.scf

read_from_buffer(f_file, length, offset)[source]

Return 'length' bits of file 'f_file' starting at offset 'offset'

class MAF(filename, outfile=None)[source]

A reader for MAF format.

count_insertions(alnString)[source]

return length without insertion, forward and reverse shift

class MAFLine(line)[source]

A reader for MAF format.

mode refname start algsize strand refsize alignment

a
s ref    100 10 + 100000 ---AGC-CAT-CATT
s contig 0   10 + 10     ---AGC-CAT-CATT

a
s ref    100 12 + 100000 ---AGC-CAT-CATTTT
s contig 0   12 + 12     ---AGC-CAT-CATTTT

The alignments are stored by pair, one item for the reference, one for the query. The query (second line) starts at zero.

Formats

Here below, we provide a list of formats used in bioinformatics or computational biology. Most of these formats are used in Bioconvert and available for conversion to another formats. Some are available for book-keeping.

We hope that this page will be useful to all developers and scientists. Would you like to contribute, please edit the file in our github doc/formats.rst.

If you wish to update this page, please see the Developer guide page.

TWOBIT

Format:

binary

Status:

available

Type:

sequence

A 2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself.

The file begins with a 16-byte header containing the following fields:

  • signature: the number 0x1A412743 in the architecture of the machine that created the file

  • version: zero for now. Readers should abort if they see a version number higher than 0

  • sequenceCount: the number of sequences in the file

  • reserved: always zero for now

All fields are 32 bits unless noted. If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures.

The header is followed by a file index, which contains one entry for each sequence. Each index entry contains three fields:

  • nameSize: a byte containing the length of the name field

  • name: the sequence name itself (in ASCII-compatible byte string), of variable length depending on nameSize

  • offset: the 32-bit offset of the sequence data relative to the start of the file, not aligned to any 4-byte padding boundary

The index is followed by the sequence records, which contain nine fields:

  • dnaSize - number of bases of DNA in the sequence

  • nBlockCount - the number of blocks of Ns in the file (representing unknown sequence)

  • nBlockStarts - an array of length nBlockCount of 32 bit integers indicating the (0-based) starting position of a block of Ns

  • nBlockSizes - an array of length nBlockCount of 32 bit integers indicating the length of a block of Ns

  • maskBlockCount - the number of masked (lower-case) blocks

  • maskBlockStarts - an array of length maskBlockCount of 32 bit integers indicating the (0-based) starting position of a masked block

  • maskBlockSizes - an array of length maskBlockCount of 32 bit integers indicating the length of a masked block

  • reserved - always zero for now

  • packedDna - the DNA packed to two bits per base, represented as so: T - 00, C - 01, A - 10, G - 11. The first base is in the most significant 2-bit byte; the last base is in the least significant 2 bits. For example, the sequence TCAG is represented as 00011011.

Bioconvert conversions

TWOBIT2FASTA

AGP

Format:

human-readable

Status:

Type:

assembly

AGP files are used to describe the assembly of a sequences from smaller fragments. The large object can be a contig, a scaffold (supercontig), or a chromosome. Each line (row) of the AGP file describes a different piece of the object, and has the column entries defined below. Several format exists: 1.0, 2.0, 2.1

you can validate your AGP file using this website: https://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_validate.cgi

ABI

Format:

binary

Status:

available

Type:

sequence

ABI are trace files that include the PHRED quality scores for the base calls. This allows ABI to FASTQ conversion. Note that each ABI file contains one and only one sequence (no need for indexing the file). The trace data contains probablities of the four nucleotide bases along the sequencing run together with the sequence deduced from that data. ABI trace is a binary format.

File format produced by ABI sequencing machine. It produces ABI "Sanger" capillary sequence

Bioconvert conversions:

ABI2QUAL, ABI2FASTQ, ABI2FASTA

See also

SCF, SCF2FASTA, SCF2FASTQ,

ASQG

Format:

human-readable

Status:

not included (deprecated)

Type:

assembly

The ASQG format describes an assembly graph. Each line is a tab-delimited record. The first field in each record describes the record type. The three types are:

  • HT: Header record. This record contains metadata tags for the file version (VN tag) and parameters associated with the graph (for example the minimum overlap length).

  • VT: Vertex records. The second field contains the vertex identifier, the third field contains the sequence. Subsequent fields contain optional tags.

  • ED: Edge description records. Fields are:
    • sequence 1 name

    • sequence 2 name

    • sequence 1 overlap start (0 based)

    • sequence 1 overlap end (inclusive)

    • sequence 1 length

    • sequence 2 overlap start (0 based)

    • sequence 2 overlap end (inclusive)

    • sequence 2 length

    • sequence 2 orientation (1 for reversed with respect to sequence 1)

    • number of differences in overlap (0 for perfect overlaps, which is the default).

Example:

HT  VN:i:1  ER:f:0  OL:i:45 IN:Z:reads.fa   CN:i:1  TE:i:0
VT  read1   GATCGATCTAGCTAGCTAGCTAGCTAGTTAGATGCATGCATGCTAGCTGG
VT  read2   CGATCTAGCTAGCTAGCTAGCTAGTTAGATGCATGCATGCTAGCTGGATA
VT  read3   ATCTAGCTAGCTAGCTAGCTAGTTAGATGCATGCATGCTAGCTGGATATT
ED  read2 read1 0 46 50 3 49 50 0 0
ED  read3 read2 0 47 50 2 49 50 0 0

BAI

Format:

binary

Status:

not included

Type:

index

The index file of a BAM file is a BAI file format. The BAI files are not used in Bioconvert.

BAM

Format:

binary

Status:

included

Type:

Sequence alignement

The BAM (Binary Alignment Map) is the binary version of the Sequence Alignment Map (SAM) format. It is a compact and index-able representation of nucleotide sequence alignments.

See also

The SAM and BAI formats.

BCF

Format:

binary

Status:

included

Type:

variant

Binary version of the Variant Call Format (VCF).

Bioconvert conversions

BCF2VCF, VCF2BCF. BCF2WIGGLE

BCL

Format:

binary

Status:

not included

Type:

sequence

BCL is the raw format used by Illumina sequencer. This data is converted into FastQ thanks to a tool called bcl2fastq. This type of conversion is not included in Bioconvert. Indeed, Illumina provides a bcl2fastq executable and its user guide is available online. In most cases, the BCL files are already converted and users will only get the FastQ files so we will not provide such converter.

BEDGRAPH

Format:

human-readable

Status:

included

Type:

database

BedGraph is a subset of BED12 format. It is a 4-columns tab-delimited file with chromosome name, start and end positions and the fourth column is a number that is often used to show coverage depth. So, this is the same format as the BED4 format. Example:

chr1    0     75  0
chr1    75   176  1
chr1    176  177  2

See also

BED

BED

Format:

human-readable

Status:

not included

Type:

database

A Browser Extensible Data (BED) file is a tab-delimited text file. It is a concise way to represent genomic features and annotations.

The BED file is a very versatile format, which makes it difficult to handle in Bioconvert. So, let us describe exhaustively the BED format.

Although the BED description format supports up to 12 columsn, only the first 3 are required for some tools such as the UCSC browser, Galaxy, or bedtools software.

So, in general BED lines have 3 required fields and nine additional optional fields.

Generally, all BED files have the same extensions (.bed) irrespective of the number of columns. We can refer to the 3-columns version as BED3, the 4-columns BED as BED4 and so on.

The number of fields per line must be consistent. If some fields are empty, additional column information must be filled for consistency (e.g., with a "."). BED fields can be whitespace-delimited or tab-delimited although some variations of BED types such as "bed Detail" require a tab character delimitation for the detail columns (see Note box here below).

Note

BED detail format

It is an extension of BED format plus 2 additional fields. The first one is an ID, which can be used in place of the name field for creating links from the details pages. The second additional field is a description of the item, which can be a long description and can consist of html.

Requirements:
  • fields must be tab-separated

  • "type=bedDetail" must be included in the track line,

  • the name and position fields should uniquely describe items so that the correct ID and description will be displayed on the details pages.

The following example uses the first 4 columns of BED format, but up to 12 may be used. Note the header, which contains the type=bedDetail string.:

track name=HbVar type=bedDetail description="HbVar custom track" db=hg19  visibility=3 url="blabla.html"
chr11  5246919 5246920 Hb_North_York   2619    Hemoglobin variant
chr11  5255660 5255661 HBD c.1 G>A 2659    delta0 thalassemia
chr11  5247945 5247946 Hb Sheffield    2672    Hemoglobin variant
chr11  5255415 5255416 Hb A2-Lyon  2676    Hemoglobin variant
chr11  5248234 5248235 Hb Aix-les-Bains    2677    Hemoglobin variant

Warning

Browser such as the Genome Browser (http://genome.ucsc.edu/) can visualise BED files. Usually, BED files can be annotated using header lines, which begin with the word "browser" or "track" to assist the browser in the display and interpretation.

Such annotation track header lines are not permissible in utilities such as bedToBigBed, which convert lines of BED text to indexed binary files.

The file description below is modified from: http://genome.ucsc.edu/FAQ/FAQformat#format1.

The first three required BED fields are:

  1. chrom - The name of the chromosome (e.g. chr3) or scaffold.

  2. chromStart - The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.

  3. chromEnd - The ending position of the feature in the chromosome. The chromEnd base is not included in the display of the feature.

The 9 additional optional BED fields are:

  1. name - Label of the BED line

  2. score - A score between 0 and 1000. In Genome Browser, the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed.

  3. strand - Defines the strand. Either "." (=no strand) or "+" or "-".

  4. thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.

  5. thickEnd - The ending position at which the feature is drawn thickly.

  6. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0).

  7. blockCount - The number of blocks (exons) in the BED line.

  8. blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.

  9. blockStarts - A comma-separated list of block starts. Should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

In BED files with block definitions, the first blockStart value must be 0, so that the first block begins at chromStart. Similarly, the final blockStart position plus the final blockSize value must equal to chromEnd. Blocks may not overlap.

Here is a simple example:

track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

Note

If your data set is BED-like, but it is very large (over 50MB) you can convert it to a BIGBED format.

See also

BEDGRAPH

BED3

A BED3 is supported by bedtools. It is a BED file where each feature is described by chrom, start and end (with tab-delimited values). Example:

chr1    100    120

See BED section for details.

BED4

A BED4 is a BED file where each feature is described by chrom, start, end and name (with tab-delimited values). The last column could also be a number. Example:

chr1    100    120    gene1

See BED section for details.

See also

BEDGRAPH

BED5

A BED5 is supported by bedtools. It is a BED file where each feature is described by chrom, start, end, name and score(with tab-delimited values). Example:

chr1    100    120    gene1 0

See BED section for details.

BED6

A BED6 is supported by bedtools. It is a BED file where each feature is described by chrom, start, end, name, score and strand (with tab-delimited values). Example:

chr1    100    120    gene1 0 +

See BED section for details.

BED12

A BED12 is supported by bedtools. It is a BED file where each feature is described by all 12 BED fields. Example:

chr1    100    120    gene1 0 + 100 100 0 3 1,2,3 4,5,6

See BED section.

BIGBED

Format:

binary

Status:

included

Type:

database/track

The bigBed format stores annotation items. BigBed files are created initially from BED type files. The resulting bigBed files are in an indexed binary format. The main advantage of the bigBed files is that only the portions of the files needed to display a particular region is used.

bioconvert conversions

BIGBED2COV, BIGBED2WIGGLE

BIGWIG

Format:

binary

Status:

included

Type:

database/track

The bigWig format is useful for dense, continuous data. They can be created from wiggle file (WIGGLE (WIG)). This type of file is an indexed binary format.

Wiggle data must be continuous unlike BED. You can convert a BED/BEDGraph to bigwig using BEDGRAPH2BIGWIG.

To create a bigwig from a wiggle, yo need to remove the existing "track" header

Bioconvert conversions::

BIGWIG2WIGGLE, BEDGRAPH2BIGWIG

Note

Wiggle, bigWig, and bigBed files use 0-based half-open coordinates, which are also used by this extension. So to access the value for the first base on chr1, one would specify the starting position as 0 and the end position as 1. Similarly, bases 100 to 115 would have a start of 99 and an end of 115. This is simply for the sake of consistency with the underlying bigWig file and may change in the future in various formats and tools dealing with those formats.

BIM

Format:

human-readable

Status:

included

Type:

variants

The BIM formatted file is a variant information file accompanying a .bed or biallelic .pgen binary genotype table. Please see PLINK binary files (BED/BIM/FAM) section.

The fields are:

  • chromosome number (integer)

  • SNP marker ID (string) / variant ID

  • SNP generic position (cM) (float) / position in centimorgans (safe to use dummy value 0)

  • SNP physical position (bp) (1-based)

  • Alternate allele code

  • Reference allele code

Here is an example:

1    rs0     0   1000    0   1
1    rs10    0   1001    2   1

BZ2

Format:

binary

Status:

included

Type:

Compression

bzip2 is a file compression program that uses the Burrows–Wheeler algorithm. Extension is usually .bz2 The BZ2 compression is usually better than gzip for Fastq format compression (factor 2-3).

Bioconvert conversions:

gz2bz2, gz2dsrc bz22gz, dsrc2gz

COV

A simple TSV file with 3 columns to store coverage in a continuous way. First column is contig/chromosome name, second is position and third is coverage. Expected positions are continuous. The BEDGRAPH stores an extra column but can be a more compact way of storing coverage/depth.

Example:

chr1   1    10
chr1   2    11
chr1   3    15
chr1   4    12
chr1   5    11

CRAM

Format:

binary

Status:

not included

Type:

Alignment

The CRAM file format is a more dense form of BAM files with the benefit of saving much disk space. While BAM files contain all sequence data within a file, CRAM files are smaller by taking advantage of an additional external reference sequence file. This file is needed to both compress and decompress the read information.

See also

BAM

Bioconvert Conversions

BAM2CRAM, SAM2CRAM, CRAM2BAM, CRAM2SAM.

CLUSTAL

Format:

human-readable

Status:

included

Type:

multiple alignment

In a Clustal format, the first line in the file must start with the words "CLUSTAL W" or "CLUSTALW". Nevertheless, many such files starts with CLUSTAL or CLUSTAL X. Other information in the first line is ignored. One or more empty lines. One or more blocks of sequence data. Each block consists of one line for each sequence in the alignment. Each line consists of the sequence name white space up to 60 sequence symbols. optional - white space followed by a cumulative count of residues for the sequences A line showing the degree of conservation for the columns of the alignment in this block. One or more empty lines.

Some rules about representing sequences:

  • Case does not matter.

  • Sequence symbols should be from a valid alphabet.

  • Gaps are represented using hyphens ("-").

  • The characters used to represent the degree of conservation are
    • * - : all residues or nucleotides in that column are identical

    • : - : conserved substitutions have been observed

    • . - : semi-conserved substitutions have been observed

    • <SPACE> - : no match.

Here is an example of a multiple alignment in CLUSTAL W format:

CLUSTAL W (1.82) multiple sequence alignment


FOSB_MOUSE      MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA 60
FOSB_HUMAN      MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA 60
                ************************************************************

FOSB_MOUSE      TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL 98
FOSB_HUMAN      TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL 98
                ***********************:**************

Some bioconvert conversions

CLUSTAL2FASTA, CLUSTAL2NEXUS, CLUSTAL2PHYLIP, CLUSTAL2STOCKHOLM,

CSV

Format:

human-readable

Type:

database

Status:

included

A comma-separated values format is a delimited text file that uses a comma to separate values. See CSV format page for details.

DSRC

Format:

binary

Status:

included

Type:

Compression

DSRC compression dedicated for DNA sequences.

Bioconvert conversions:

GZ2BZ2, GZ2DSRC BZ22GZ, DSRC2GZ

EMBL

Format:

human-readable

Status:

included

Type:

database

EMBL format stores sequence and its annotation together. The start of the annotation section is marked by a line beginning with the word "ID". The start of sequence section is marked by a line beginning with the word "SQ". The "//" (terminator) line also contains no data or comments and designates the end of an entry.

An example sequence in EMBL format is:

ID   AB000263 standard; RNA; PRI; 368 BP.
XX
AC   AB000263;
XX
DE   Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.
XX
SQ   Sequence 368 BP;
     acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg        60
     ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg       120
     caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc       180
     aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag       240
     gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga       300
     agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca       360
     gacctgaa                                                                368

Bioconvert conversions:

EMBL2GENBANK GENBANK2EMBL

FAM

Format:

human-readable

Status:

included

Type:

database

The FAM format is used to store sample information accompanying a .bed or biallelic .pgen binary genotype table. Please see PLINK binary files (BED/BIM/FAM) section.

In brief, it stores the first 6 columns of the PED file. So it is a text file with no header line, and one line per sample with the following six fields:

  • Family ID ('FID')

  • Individual ID ('IID'; cannot be '0')

  • Individual ID of father ('0' if father isn't in dataset)

  • Individual ID of mother ('0' if mother isn't in dataset)

  • Sex code ('1' = male, '2' = female, '0' = unknown)

  • Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

For example:

1 1000000000 0 0 1 1
1 1000000001 0 0 1 2

FAA

Fasta formatted file storing amino acid sequences. A mutliple protein fasta file can have the more specific extension mpfa.

FASTA

Format:

human-readable

Status:

included

Type:

Sequence

FASTA format is one of the most widely used sequence format. It can stores multiple records of sequence and their identifier.

A sequence entry has a one-line header followed by one or more lines of sequence. The header must start with the ">" character. The next word is the sequence identifier or the accession number; the rest of the line is considered as description.

The NCBI recommandation do not allowed blank lines in the middle of FASTA files. Note, however, that some tools can handle blank lines by ignoring them. This is not recommened to include blank lines though.

There is no standard file extension for a text file containing FASTA formatted sequences. Although their is a plethora of ad-hoc file extensions: fasta, fas, fa, seq, fsa, fna, ffn, faa, frn, we use only fasta, fa and fst within Bioconvert (see extensions). For completeness, fasta is the generic fasta file, fna stands for fasta nucleic acid, ffn for fasta nucleotide of gene resions, faa for fasta amino acid, frn for fasta non-coding RNA, etc.

An example sequence in FASTA format is:

>X65923.1 H.sapiens fau mRNA
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGCAGCTCTTTGT
CCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCGCCCAGATCAAGGCTCATGTA
GCCTCACTGGAGGGCATTGCCCCGGAAGATCAAGTCGTGCTCCTGGCAGGCGCGCCCCTGGAGGATGAGG
CCACTCTGGGCCAGTGCGGGGTGGAGGCCCTGACTACCCTGGAAGTAGCAGGCCGCATGCTTGGAGGTAA
AGTTCATGGTTCCCTGGCCCGTGCTGGAAAAGTGAGAGGTCAGACTCCTAAGGTGGCCAAACAGGAGAAG
AAGAAGAAGAAGACAGGTCGGGCTAAGCGGCGGATGCAGTACAACCGGCGCTTTGTCAACGTTGTGCCCA
CCTTTGGCAAGAAGAAGGGCCCCAATGCCAACTCTTAAGTCTTTTGTAATTCTGGCTTTCTCTAATAAAA
AAGCCACTTAGTTCAGTCAAAAAAAAAA

In this example, the header (also known as description line) is formatted as:

>ID description

Many variants of FASTA formats exists but differ only in the way the header is written. All starts with the ">" sign though. We can cite a few variants here below (for simplicity we give only puit 2 lines per sequence).

The NCBI style defines the identifier with database name, entry ID and optional accession or sequence version number separated by pipes:

>embl|X65923|X65923.1 H.sampiens fau mRNA
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGCAGCTCTTTGT
CCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCGCCCAGATCAAGGCTCATGTA

List of NCBI FASTA database are listed in https://tinyurl.com/y6wrzyad

The GI style is the same as NCBI style except that the sequence GI code is given instead of the entry ID:

>gi|31302|gnl|genbank|X65923 (X65923.1) H.sampiens fau mRNA
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGCAGCTCTTTGT
CCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCGCCCAGATCAAGGCTCATGTA

There is also a CGC-style FASTA format (not to be confused with the GCG format). Its header includes an optional database name as part of the identifier by using the : sign:

>DATABASE_NAME:DI accession description

>embl:X65923 X65923.1 H.sapiens fau mRNA
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGCAGCTCTTTGT
CCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCGCCCAGATCAAGGCTCATGTA

And more generally, we have the FASTA with accession and description style. The accession number or sequence version included after the identifier:

>X65923 X65923.1 H.sapiens fau mRNA
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGCAGCTCTTTGT
CCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCGCCCAGATCAAGGCTCATGTA
GCCTCACTGGAGGGCATTGCCCCGGAAGATCAAGTCGTGCTCCTGGCAGGCGCGCCCCTGGAGGATGAGG
CCACTCTGGGCCAGTGCGGGGTGGAGGCCCTGACTACCCTGGAAGTAGCAGGCCGCATGCTTGGAGGTAA
AGTTCATGGTTCCCTGGCCCGTGCTGGAAAAGTGAGAGGTCAGACTCCTAAGGTGGCCAAACAGGAGAAG
AAGAAGAAGAAGACAGGTCGGGCTAAGCGGCGGATGCAGTACAACCGGCGCTTTGTCAACGTTGTGCCCA
CCTTTGGCAAGAAGAAGGGCCCCAATGCCAACTCTTAAGTCTTTTGTAATTCTGGCTTTCTCTAATAAAA
AAGCCACTTAGTTCAGTCAAAAAAAAAA

Note

original FASTA format may include comments with the ; sign. This is not supported anymore in most programs.

Bioconvert conversions

FASTQ2FASTA, FASTA2FASTQ, FASTA2CLUSTAL, FASTA2NEXUS, FASTA2TWOBIT

See also

FastQ and QUAL

FastG

Format:

Status:

not included

Type:

assembly

FastG is a Graph format used to faithfully representing genome assemblies in the face of allelic polymorphism and assembly uncertainty.

reference:

http://fastg.sourceforge.net/FASTG_Spec_v1.00.pdf

FastQ

Format:

human-readable

Status:

included

Type:

Sequence

FASTQ is a text-based format for storing both biological sequence (usually nucleotide sequence) and its corresponding quality scores (QUAL). A FASTQ format can contain several sequences. All FASTQ variations are in the formatting of the quality scores. Currently, the recommended variant is the Sanger encoding also used by Illumina 1.8. It encodes the Phred quality score from 0 to 93 using ASCII 33 to 126. This format is also refered to PHRED+33 meaning there is an offset of 33 in the ASCII code. Other variants such as FASTQ-solexa or earlier Illumina versions. Currently conversions included in Bioconvert do not need to be aware of the quality score encoding.

A FASTQ file uses four lines per sequence:

  1. a '@' character, followed by a sequence identifier and an optional description

  2. the raw sequence letters.

  3. a '+' character, optionally followed by the same sequence identifier (and any description)

  4. quality values for the sequence in Line 2

An example sequence in FASTQ format is:

@SEQUENCE_ID1
GTGGAAGTTCTTAGGGCATGGCAAAGAGT
+
FAFFADEDGDBGEGGBCGGHE>EEBA@@=
@SEQUENCE_ID2
GTGGAAGTTCTTAGG
+
FAFFADEDGDBGEGG

Bioconvert conversions

FASTQ2FASTA, FASTA2FASTQ

See also

FASTA and QUAL

GENBANK

Format:

human-readable

Status:

included

Type:

annotation/sequence

GenBank format (GenBank Flat File Format) stores sequence and its annotation together. The start of the annotation section is marked by a line beginning with the word LOCUS. The start of sequence section is marked by a line beginning with the word ORIGIN and the end of the section is marked by a line with only "//".

GenBank format for protein has been renamed GenPept.

An example sequence in GenBank format is:

LOCUS       AB000263                 368 bp    mRNA    linear   PRI 05-FEB-1999
DEFINITION  Homo sapiens mRNA for prepro cortistatin like peptide, complete
            cds.
ACCESSION   AB000263
ORIGIN
        1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg
       61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg
      121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc
      181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag
      241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga
      301 agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca
      361 gacctgaa
//

Bioconvert conversions

GENBANK2FASTA, GENBANK2EMBL

GENPEPT

see GENBANK

GFA

Format:

human-readable

Status:

included

Type:

assembly graph

The Graphical Fragment Assembly (GFA) can be used to represent genome assemblies. GFA stores sequence graphs as the product of an assembly, a representation of variation in genomes, splice graphs in genes, or even overlap between reads from long-read sequencing technology.

The GFA format is a tab-delimited text format for describing a set of sequences and their overlap. The first field of the line identifies the type of the line. Header lines start with H. Segment lines start with S. Link lines start with L. A containment line starts with C. A path line starts with P.

  • Segment a continuous sequence or subsequence.

  • Link an overlap between two segments. Each link is from the end of one segment to the beginning of another segment. The link stores the orientation of each segment and the amount of basepairs overlapping.

  • Containment an overlap between two segments where one is contained in the other.

  • Path an ordered list of oriented segments, where each consecutive pair of oriented segments are supported by a link record.

See details in the reference above.

Example:

H VN:Z:1.0 S 11 ACCTT S 12 TCAAGG S 13 CTTGATT L 11 + 12 - 4M L 12 - 13 + 5M L 11 + 13 + 3M P 14 11+,12-,13+ 4M,5M

Notes: sometimes you would have extra field (fourth one) on segment lines. Convertion to fasta will store this fourth line after the name.

GFA2 is a generalization of GFA that allows one to specify an assembly graph in either less detail, e.g. just the topology of the graph, or more detail, e.g. the multi-alignment of reads giving rise to each sequence. It is further designed to be a able to represent a string graph at any stage of assembly, from the graph of all overlaps, to a final resolved assembly of contig paths with multi-alignments. Apart from meeting these needs, the extensions also supports other assembly and variation graph types.

Like GFA, GFA2 is tab-delimited in that every lexical token is separated from the next by a single tab.

Bioconvert conversions

GFA2FASTA

GTF

Format:

human-readable

Status:

included

Type:

Annotation

GTF2 (General Feature Format version 2) is a file format used to represent genomic features and their locations in a genome. It is a tab-delimited text file that contains one line for each genomic feature, with each line consisting of nine fields separated by tabs.

The fields in a GTF2 file are as follows:

  • Seqid: The identifier of the genomic sequence.

  • Source: The source of the annotation.

  • Feature: The type of feature.

  • Start: The starting position of the feature.

  • End: The ending position of the feature.

  • Score: A score associated with the feature.

  • Strand: The strand on which the feature is located.

  • Phase: The phase of the feature, if applicable.

  • Attributes: A list of attributes associated with the feature, encoded as a semicolon-separated list of key-value pairs.

GFF

Format:

human-readable

Status:

included

Type:

Annotation

GFF is a standard file format for storing genomic features in a text file. GFF stands for Generic Feature Format. It is 9 column tab-delimited file, each line of which corresponds to an annotation, or feature.

The GFF v2 is deprecated and v3 should be used instead. In particular, GFF2 is sunable to deal with the three-level hierarchy of gene -> transcript -> exon.

The first line is a comment (starting with #) followed by a series of data lines, each of which correspond to an annotation. Here is an example:

##gff-version 3
ctg123  .  exon  1300  1500  .  +  .  ID=exon00001
ctg123  .  exon  1050  1500  .  +  .  ID=exon00002
ctg123  .  exon  3000  3902  .  +  .  ID=exon00003
ctg123  .  exon  5000  5500  .  +  .  ID=exon00004
ctg123  .  exon  7000  9000  .  +  .  ID=exon00005

The header is compulsary and following lines must have 9 columns as follows:

  1. seqname - The name of the sequence (e.g. chromosome) on which the feature exists. Any string can be used. For example, chr1, III, contig1112.23. Any character not in [a-zA-Z0-9.:^*$@!+_?-|] must be escaped with the % character followed by its hexadecimal value.

  2. source - The source of this feature. This field will normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc. If there is no source, use the . character.

  3. feature - The feature type name. Equivalent to BED’s name field. For example, exon, etc. Should be a term from the lite sequence ontology (SOFA).

  4. start - The one-based starting position of feature on seqname. bedtools uses a one-based position and BED uses a zero-based start position.

  5. end - The one-based ending position of feature on seqname.

  6. score - A score assigned to the GFF feature.

  7. strand - Defines the strand. Use +, - or .

  8. frame/phase - The frame of the coding sequence. Use 0, 1, 2. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.

  9. attribute - A list of feature attributes in the format tag=value separated by semi columns. All non-printing characters in such free text value strings (e.g. newlines, tabs, control characters, etc) must be explicitly represented by their C (UNIX) style backslash-escaped representation (e.g. newlines as ‘n’, tabs as ‘t’). Tabs must be replaced with %09 URL escape. There are predefined tags:

    • ID: unique identifier of the feature.

    • Name: name of the feature

    • Alias

    • Parent: can be used to group exons into transcripts, transcripts into genes and so on.

    • Target

    • Gap

    • Derives_from

    • Note

    • Dbxref

    • Ontology_term

    Multiple attributes of the same type are separated by comma. Case sensitive: Parent is difference from parent.

Bioconvert conversions:

GZ

Format:

binary

Status:

included

Type:

Compression

gzip is a file compression program that is based on the DEFLATE algorithm, which is a combination of LZ77 and Hufmfman coding.

Bioconvert conversions:

GZ2BZ2, GZ2DSRC BZ22GZ, DSRC2GZ

JSON

Format:

human-readable

Status:

included

Type:

database

JSON format stands for Javascript Object Notation. Basic data types used in JSON:

  • Number: a signed decimal number that may contain a fractional part and may use exponential E notation, but cannot include non-numbers such as NaN. The format makes no distinction between integer and floating-point. JavaScript uses a double-precision floating-point format for all its numeric values, but other languages implementing JSON may encode numbers differently.

  • String: a sequence of zero or more Unicode characters. Strings are delimited with double-quotation marks and support a backslash escaping syntax.

  • Boolean: either of the values true or false

  • Array: an ordered list of zero or more values, each of which may be of any type. Arrays use square bracket notation and elements are comma-separated.

  • Object: an unordered collection of name–value pairs where the names (also called keys) are strings. Since objects are intended to represent associative arrays, it is recommended that each key is unique within an object. Objects are delimited with curly brackets and use commas to separate each pair, while within each pair the colon ':' character separates the key or name from its value.

  • null: An empty value, using the word null

Limited whitespace is allowed and ignored around or between syntactic elements (values and punctuation, but not within a string value). Only four specific characters are considered whitespace for this purpose: space, horizontal tab, line feed, and carriage return. In particular, the byte order mark must not be generated by a conforming implementation (though it may be accepted when parsing JSON). JSON does not provide syntax for comments.

Example:

{
"database": "AB",
"date": "13-10-2010",
"entries":
    [
      {
        "ID": 1,
        "coverage": 10
      },
      {
        "ID": 2,
        "coverage": 15
      }
    ]
}

Bioconvert conversions

JSON2YAML, YAML2JSON.

MAF (Mutation Annotation Format)

Format:

human-readable

Status:

not included

Type:

multiple alignement

MAF (Multiple Alignement Format)

Format:

human-readable

Status:

included

Type:

phylogeny

The Multiple Alignment Format stores a series of multiple alignments.

Warning

Not to be confused with MAF (Mutation Annotation Format)

Here are some rules about the MAF syntax:

  • It is line-oriented.

  • Each multiple alignment ends with a blank line.

  • Each sequence in an alignment is on a single line, which can get quite long, but there is no length limit.

  • Words in a line are delimited by any white space.

  • Lines starting with # are considered to be comments.

  • Lines starting with ## can be ignored by most programs, but contain meta-data of one form or another.

  • The file is divided into paragraphs that terminate in a blank line.

  • Within a paragraph, the first word of a line indicates its type.

Each multiple alignment is in a separate paragraph that begins with an a line and contains an s line for each sequence in the multiple alignment.

Some MAF files may contain other optional line types:

  • i line contains information about what is in the aligned species DNA before and after the immediately preceding s line

  • e line contains information about the size of the gap between the alignments that span the current block

  • q line indicates the quality of each aligned base for the species.

Here is an example of s lines (alignment block):

s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon         249182 13 +   4622798 gcagctgaaaaca
s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA

The s and a lines define a multiple alignment. The columns of the s lines have the following fields:

  • src: The name of one of the source sequences for the alignment. The form 'database.chromosome' allows automatic creation of links to other assemblies in some browsers.

  • start: The start of the aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).

  • size: The size of the aligning region in the source sequence. This number is equal to the number of non-dash characters in the alignment text field below.

  • strand: Either + or -. If -, then the alignment is to the reverse-complemented source.

  • srcSize: The size of the entire source sequence, not just the parts involved in the alignment.

  • text: The nucleotides (or amino acids) in the alignment and any insertions (dashes).

Lines starting with i give information about what's happening before and after this block in the aligning species:

s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
i panTro1.chr6 N 0 C 0
s baboon         249182 13 +   4622798 gcagctgaaaaca
i baboon       I 234 n 19

The i lines contain information about the context of the sequence lines immediately preceding them. The following fields are defined by position rather than name=value pairs:

  • src: The name of the source sequence for the alignment. Should be the same as the s line immediately above this line.

  • leftStatus: A character that specifies the relationship between the sequence in this block and the sequence that appears in the previous block.

  • leftCount: Usually the number of bases in the aligning species between the start of this alignment and the end of the previous one.

  • rightStatus: A character that specifies the relationship between the sequence in this block and the sequence that appears in the subsequent block.

  • rightCount: Usually the number of bases in the aligning species between the end of this alignment and the start of the next one.

The status characters can be one of the following values:

C: the sequence before or after is contiguous with this block.
I: there are bases between the bases in this block and the one before or
   after it.
N: this is the first sequence from this src chrom or scaffold.
n: this is the first sequence from this src chrom or scaffold but it is
   bridged by another alignment from a different chrom or scaffold.
M: there is missing data before or after this block (Ns in the sequence).
T: the sequence in this block has been used before in a previous block
   (likely a tandem duplication)

Lines starting with e gives information about empty parts of the alignment block:

s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
e mm4.chr6     53310102 13 + 151104725 I

The e lines indicate that there isn't aligning DNA for a species but that the current block is bridged by a chain that connects blocks before and after this block. The following fields are defined by position rather than name=value pairs.

  • src: The name of one of the source sequences for the alignment.

  • start: The start of the non-aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).

  • size: The size in base pairs of the non-aligning region in the source sequence.

  • strand: Either + or -. If -, then the alignment is to the reverse-complemented source.

  • srcSize: The size of the entire source sequence, not just the parts involved in the alignment; alignment and any insertions (dashes) as well.

  • status*: A character that specifies the relationship between the non-aligning sequence in this block and the sequence that appears in the previous and subsequent blocks.

The status character can be one of the following values:

C: the sequence before and after is contiguous implying that this region
   was either deleted in the source or inserted in the reference sequence.
   The browser draws a single line or a "-" in base mode in these blocks.
I: there are non-aligning bases in the source species between chained alignment
   blocks before and after this block. The browser shows a double line
   or "=" in base mode.
M: there are non-aligning bases in the source and more than 90% of them are Ns in
   the source. The browser shows a pale yellow bar.
n: there are non-aligning bases in the source and the next aligning block starts
   in a new chromosome or scaffold that is bridged by a chain between still
   other blocks. The browser shows either a single line or a double line based
   on how many bases are in the gap between the bridging alignments.

Lines starting with q -- information about the quality of each aligned base for the species:

s hg18.chr1                  32741 26 + 247249719 TTTTTGAAAAACAAACAACAAGTTGG
s panTro2.chrUn            9697231 26 +  58616431 TTTTTGAAAAACAAACAACAAGTTGG
q panTro2.chrUn                                   99999999999999999999999999
s dasNov1.scaffold_179265     1474  7 +      4584 TT----------AAGCA---------
q dasNov1.scaffold_179265                         99----------32239---------

The q lines contain a compressed version of the actual raw quality data, representing the quality of each aligned base for the species with a single character of 0-9 or F. The following fields are defined by position rather than name=value pairs:

  • src: The name of the source sequence for the alignment. Should be the same as the "s" line immediately preceding this line.

  • value: A MAF quality value corresponding to the aligning nucleotide acid in the preceding "s" line. Insertions (dashes) in the preceding "s" line are represented by dashes in the "q" line as well. The quality value can be "F" (finished sequence) or a number derived from the actual quality scores (which range from 0-97) or the manually assigned score of 98. These numeric values are calculated as:

    MAF quality value = min( floor(actual quality value/5), 9 )
    

This results in the following mapping:

MAF quality value     Raw quality score range     Quality level
0-8     0-44     Low
9     45-97     High
0     98     Manually assigned
F     99     Finished

A Simple Example (three alignment blocks derived from five starting sequences). Repeats are shown as lowercase, and each block may have a subset of the input sequences. All sequence columns and rows must contain at least one nucleotide (no columns or rows that contain only insertions):

##maf version=1 scoring=tba.v8
# tba.v8 (((human chimp) baboon) (mouse rat))

a score=23262.0
s hg18.chr7    27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s baboon         116834 38 +   4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
s mm4.chr6     53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
s rn3.chr4     81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG

a score=5062.0
s hg18.chr7    27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
s baboon         241163 6 +   4622798 TAAAGA
s mm4.chr6     53303881 6 + 151104725 TAAAGA
s rn3.chr4     81444246 6 + 187371129 taagga

a score=6636.0
s hg18.chr7    27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon         249182 13 +   4622798 gcagctgaaaaca
s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA

MAP

Format:

human-readable

Status:

included

Type:

Genotypic

PLINK is a very widely used application for analyzing genotypic data.

The fields in a MAP file are:

  • Chromosome

  • Marker ID

  • Genetic distance

  • Physical position

Example of a MAP file of the standard PLINK format:

21     rs11511647   0          26765
X      rs3883674    0           32380
X      rs12218882   0           48172
9      rs10904045   0           48426
9      rs10751931   0           49949
8      rs11252127   0           52087
10     rs12775203   0           52277
8      rs12255619   0           52481

NEWICK

Format:

human-readable

Status:

included

Type:

phylogeny

Newick format is typically used for tools like PHYLIP and is a minimal definition for a phylogenetic tree. It is a way of representing graph-theoretical trees with edge lengths using parentheses and commas.

_images/NewickExample.svg
(,,(,));                              no nodes are named
(A,B,(C,D));                          leaf nodes are named
(A,B,(C,D)E)F;                        all nodes are named
(:0.1,:0.2,(:0.3,:0.4):0.5);          all but root node have a distance to parent
(:0.1,:0.2,(:0.3,:0.4):0.5):0.0;      all have a distance to parent
(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);      distances and leaf names (popular)
(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;    distances and all names
((B:0.2,(C:0.3,D:0.4)E:0.5)A:0.1)F;   a tree rooted on a leaf node (rare)

Bioconvert conversions

NEWICK2NEXUS, NEWICK2PHYLOXML

NEXUS

Format:

human-readable

Status:

included

Type:

phylogeny

The NEXUS multiple alignment format, also known as PAUP format is used to multiple alignment or phylogentic trees.

After a header to indicate the format (#NEXUS ), blocks are stored and start with Begin NAME; and end with END;

Example of a DNA alignment:

#NEXUS
Begin data;
Dimensions ntax=4 nchar=15;
Format datatype=dna missing=? gap=-;
Matrix
Species1   atgctagctagctcg
Species2   atgcta??tag-tag
Species3   atgttagctag-tgg
Species4   atgttagctag-tag
;
End;

It can be used to store phylogenetic trees using the TREES block:

#NEXUS
BEGIN TAXA;
  TAXLABELS A B C;
END;

BEGIN TREES;
  TREE tree1 = ((A,B),C);
END;

Bioconvert conversions

NEXUS2CLUSTAL, NEXUS2NEWICK, NEXUS2PHYLIP, NEXUS2PHYLIPXML,

ODS

Format:

human-readable

Status:

included

Type:

Sequence

ODS stands for OpenDocument Spreadsheet (.ods) file format. It should be equivalent to the XLS format.

Bioconvert conversions

JSON2YAML, YAML2JSON.

PAF (Pairwise mApping Format)

Format:

human-readable

Status:

included

Type:

mapping

PAF is a text format describing the approximate mapping positions between two set of sequences. PAF is used for instance in miniasm tool (see reference above), an ultrafast de novo assembly for long noisy reads. PAF is TAB-delimited with each line consisting of the following predefined fields:

Col

Type

Description

1

string

Query sequence name

2

int

Query sequence length

3

int

Query start (0-based)

4

int

Query end (0-based)

5

char

Relative strand: "+" or "-"

6

string

Target sequence name

7

int

Target sequence length

8

int

Target start on original strand (0-based)

9

int

Target end on original strand (0-based)

10

int

Number of residue matches

11

int

Alignment block length

12

int

Mapping quality (0-255; 255 for missing)

If PAF is generated from an alignment, column 10 equals the number of sequence matches, and column 11 equals the total number of sequence matches, mismatches, insertions and deletions in the alignment. If alignment is not available, column 10 and 11 are still required but can be approximate.

A PAF file may optionally contain SAM-like typed key-value pairs at the end of each line.

Bioconvert conversion

SAM2PAF

PDB

Todo

coming soon

PED

Format:

human-readable

Status:

included

Type:

Genotypic

PLINK is a very widely used application for analyzing genotypic data.

The fields in a PED file are:

  • Family ID

  • Sample ID

  • Paternal ID

  • Maternal ID

  • Sex (1=male; 2=female; other=unknown)

  • Affection (0=unknown; 1=unaffected; 2=affected)

  • Genotypes (space or tab separated, 2 for each marker. 0=missing)

Example of a PED file of the standard PLINK format:

FAM1    NA06985 0   0   1   1   A   T   T   T   G   G   C   C   A   T   T   T   G   G   C   C
FAM1    NA06991 0   0   1   1   C   T   T   T   G   G   C   C   C   T   T   T   G   G   C   C
0       NA06993 0   0   1   1   C   T   T   T   G   G   C   T   C   T   T   T   G   G   C   T
0       NA06994 0   0   1   1   C   T   T   T   G   G   C   C   C   T   T   T   G   G   C   C
0       NA07000 0   0   2   1   C   T   T   T   G   G   C   T   C   T   T   T   G   G   C   T
0       NA07019 0   0   1   1   C   T   T   T   G   G   C   C   C   T   T   T   G   G   C   C
0       NA07022 0   0   2   1   C   T   T   T   G   G   0   0   C   T   T   T   G   G   0   0
0       NA07029 0   0   1   1   C   T   T   T   G   G   C   C   C   T   T   T   G   G   C   C
FAM2    NA07056 0   0   0   2   C   T   T   T   A   G   C   T   C   T   T   T   A   G   C   T
FAM2    NA07345 0   0   1   1   C   T   T   T   G   G   C   C   C   T   T   T   G   G   C   C

PHYLOXML

Format:

human-readable

Status:

included

Type:

phylogeny

PhyloXML is an XML language for the analysis, exchange, and storage of phylogenetic trees.

A shortcoming of formats such as Nexus and Newick is a lack of a standardized means to annotate tree nodes and branches with distinct data fields (species names, branch lengths, multiple support values). A well defined XML format addresses these problems in a general and extensible manner and allows for interoperability between specialized and general purpose software.

Here is an example (source https://en.wikipedia.org/wiki/PhyloXML)

<phyloxml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.10/phyloxml.xsd"
xmlns="http://www.phyloxml.org">
<phylogeny rooted="true">
  <name>example from Prof. Joe Felsenstein's book "Inferring Phylogenies"</name>
  <description>MrBayes based on MAFFT alignment</description>
  <clade>
     <clade branch_length="0.06">
        <confidence type="probability">0.88</confidence>
        <clade branch_length="0.102">
           <name>A</name>
        </clade>
        <clade branch_length="0.23">
           <name>B</name>
        </clade>
      </clade>
      <clade branch_length="0.5">
        <name>C</name>
      </clade>
    </clade>
  </phylogeny>
</phyloxml>

Bioconvert conversions

PHYLOXML2NEXUS PHYLOXML2NEWICK

PHYLIP

Format:

human-readable

Status:

included

Type:

phylogeny / alignement

The PHYLIP format stores a multiple sequence alignement.

It is a plain test format with a header describing the dimensions of the alignment followed by the mutliple sequence alignment. The following sequence is exactly 10 characters long (padded wit spaces if needed).

PHYLIP does not support blank lines between header and the alignment.

In the header, the first integer defines the number of sequences. The second intefer defines the number of alignments. There are several spaces between the two integers.

Here is an example:

   5   50
Seq0000  GATTAATTTG CCGTAGGCCA GAATCTGAAG ATCGAACACT TTAAGTTTTC
Seq0001  ACTTCTAATG GAGAGGACTA GTTCATACTT TTTAAACACT TTTACATCGA
Seq0002  TGTCGGACCT AAGTATTGAG TACAACGGTG TATTCCAGCG GTGGAGAGGT
Seq0003  CTATTTTTCC GGTTGAAGGA CTCTAGAGCT GTAAAGGGTA TGGCCATGTG
Seq0004  CTAAGCGCGG GCGGATTGCT GTTGGAGCAA GGTTAAATAC TCGGCAATGC

QUAL

Format:

human-readable

Status:

included

Type:

Sequence

QUAL files include qualities of each nucleotide in FASTA format.

Bioconvert conversions

See also

FASTA and FastQ

SAM

Format:

human readable

Status:

included

Type:

alignment

In the SAM format, each alignment line typically represents the linear alignment of a segment. Each line has 11 mandatory fields in the same order. Their values can be 0 or * if the field is unavailable. Here is an overview of those fields:

Col

Field

Type

Regexp/Range

Brief description

1

QNAME

String

[!-?A-~]{1,254}

Query template NAME

2

FLAG

Int

[0,2^16-1]

bitwise FLAG

3

RNAME

String

*|[!-()+-<>-~][!-~]*

Reference sequence NAME

4

POS

Int

[0,2^31-1]

1-based leftmost mapping POSition

5

MAPQ

Int

[0,2^8-1]

MAPping Quality

6

CIGAR

String

*|([0-9]+[MIDNSHPX=])+

CIGAR string

7

RNEXT

String

*|=|[!-()+-<>-~][!-~]*

Ref. name of the mate/next read

8

PNEXT

Int

[0,2^31-1]

Position of the mate/next read

9

TLEN

Int

[-2^31+1,2^31-1]

observed Template LENgth

10

SEQ

String

*|[A-Za-z=.]+

segment SEQuence

11

QUAL

String

[!-~]+

ASCII of Phred-scaled base QUALity+33

All optional fields follow the TAG:TYPE:VALUE format where TAG is a two-character string that matches /[A-Za-z][A-Za-z0-9]/ . Each TAG can only appear once in one alignment line.

The tag NM:i:2 means: Edit distance to the reference (number of changes necessary to make this equal to the reference, exceluding clipping).

The optional fields are tool-dependent. For instance with BWA mapper, we can get these tags

Tag

Meaning

NM

Edit distance

MD

Mismatching positions/bases

AS

Alignment score

BC

Barcode sequence

X0

Number of best hits

X1

Number of suboptimal hits found by BWA

XN

Number of ambiguous bases in the referenece

XM

Number of mismatches in the alignment

XO

Number of gap opens

XG

Number of gap extentions

XT

Type: Unique/Repeat/N/Mate-sw

XA

Alternative hits; format: (chr,pos,CIGAR,NM;)*

XS

Suboptimal alignment score

XF

Support from forward/reverse alignment

XE

Number of supporting seeds

Bioconvert conversions

BAM2SAM, SAM2BAM

SCF

Format:

human readable

Status:

included

Type:

alignment

Trace File Format - Sequence Chromatogram Format (SCF) is a binary file containing raw data output from automated sequencing instruments.

This converter was translated from BioPerl.

SCF file organisation (more or less)

Length in bytes

Data

128

header

Number of samples * sample size

Samples for A trace

Number of samples * sample size

Samples for C trace

Number of samples * sample size

Samples for G trace

Number of samples * sample size

Samples for T trace

Number of bases * 4

Offset into peak index for each base

Number of bases

Accuracy estimate bases being 'A'

Number of bases

Accuracy estimate bases being 'C'

Number of bases

Accuracy estimate bases being 'G'

Number of bases

Accuracy estimate bases being 'T'

Number of bases

The called bases

Number of bases * 3

Reserved for future use

Comments size

Comments

Private data size

Private data

Bioconvert conversions

SCF2FASTQ, SCF2FASTA.

SRA

The Sequence Read Archive (SRA) makes biological sequence data available to the research community. It stores raw sequencing data and alignment information from high-throughput sequencing platforms, including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, Complete Genomics, and Pacific Biosciences SMRT.

It is not a format per se but is included in Bioconvert by allowing the retrieval of sequencing data given a SRA identifier:

bioconvert sra2fastq <SRA_ID>

This will retrieve the fastq reads (single read or paired end data).

Bioconvert conversions

SRA2FASTQ

TSV

Format:

human readable

Type:

database

Status:

included

A tab-separated values format is a delimited text file that uses a tab character to separate values. See CSV format page for details.

Bioconvert conversions:

TSV2CSV,

STOCKHOLM

Format:

human readable

Status:

included

Type:

multiple sequence alignment

Stockholm format is a multiple sequence alignment format used by Pfam and Rfam to store protein and RNA sequence alignments.

Here is a simple example:

# STOCKHOLM 1.0
#=GF ID    UPSK
#=GF SE    Predicted; Infernal
#=GF SS    Published; PMID 9223489
#=GF RN    [1]
#=GF RM    9223489
#=GF RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic
#=GF RT    virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
#=GF RT    polymerase.
#=GF RA    Deiman BA, Kortlever RM, Pleij CW;
#=GF RL    J Virol 1997;71:5990-5996.

AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons                   .AAA....<<<<aaa....>>>>
//

A minimal well-formed Stockholm file should contain a header which states the format and version identifier, currently '# STOCKHOLM 1.0', followed by the sequences and corresponding unique sequence names:

<seqname> <aligned sequence>
<seqname> <aligned sequence>
<seqname> <aligned sequence>

Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.:

#=GF <feature> <Generic per-File annotation, free text>
#=GC <feature> <Generic per-Column annotation, exactly 1 char per column>
#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>
#=GR <seqname> <feature> <Generic per-Residue annotation, exactly 1 char per residue>

Bioconvert conversions:

STOCKHOLM2CLUSTAL, STOCKHOLM2PHYLIP

VCF

Format:

human readable

Status:

included

Type:

variant

Variant Call Format (VCF) is a flexible and extendable format for storing variation in sequences such as single nucleotide variants, insertions/deletions, copy number variants and structural variants.

Bioconvert conversions:

WIG

See WIGGLE (WIG).

WIGGLE (WIG)

Format:

human readable

Status:

included

Type:

database-style

The wiggle (WIG) format is a format used for display of dense, continuous data such as GC percent. Wiggle data elements must be equally sized.

Similar format such as the bedGraph format is also an older format used to display sparse data or data that contains elements of varying size.

For speed and efficiency, wiggle data is usually stored in BIGWIG format.

Wiggle format is line-oriented. It is composed of declaration lines and data lines. There are two options: variableStep and fixedStep.

The VariableStep format is used for data with irregular intervals between new data points, and is the more commonly used wiggle format. The variableStep begins with a declaration line and is followed by two columns containing chromosome positions and data values:

variableStep  chrom=chrN
[span=windowSize]
  chromStartA  dataValueA
  chromStartB  dataValueB
  ... etc ...  ... etc ...

The declaration line starts with the word variableStep and is followed by a specification for a chromosome. The optional span parameter (default: span=1) allows data composed of contiguous runs of bases with the same data value to be specified more succinctly. The span begins at each chromosome position specified and indicates the number of bases that data value should cover. For example, this variableStep specification:

variableStep chrom=chr2
300701 12.5
300702 12.5
300703 12.5
300704 12.5
300705 12.5

is equivalent to:

variableStep chrom=chr2 span=5
300701 12.5

The variableStep format becomes very inefficient when there are only a few data points per 1024 bases. If variableStep data points (i.e., chromStarts) are greater than about 100 bases apart, it is advisable to use BedGraph format.

The fixedStep format is used for data with regular intervals between new data values and is the more compact wiggle format. The fixedStep begins with a declaration line and is followed by a single column of data values:

fixedStep  chrom=chrN
start=position  step=stepInterval
[span=windowSize]
  dataValue1
  dataValue2
  ... etc ...

The declaration line starts with the word fixedStep and includes specifications for chromosome, start coordinate, and step size. The span specification has the same meaning as in variableStep format. For example, this fixedStep specification:

fixedStep chrom=chr3 start=400601 step=100
11
22
33

displays the values 11, 22, and 33 as single-base regions on chromosome 3 at positions 400601, 400701, and 400801, respectively. Adding span=5 to the declaration line:

fixedStep chrom=chr3 start=400601 step=100 span=5
11
22
33

causes the values 11, 22, and 33 to be displayed as 5-base regions on chromosome 3 at positions 400601-400605, 400701-400705, and 400801-400805, respectively.

Note that for both variableStep and fixedStep formats, the same span must be used throughout the dataset. If no span is specified, the default span of 1 is used. As the name suggests, fixedStep wiggles require the same size step throughout the dataset. If not specified, a step size of 1 is used.

Data values can be integer or real, postive or negative values. Positions specified in the input data must be in numerical order.

Warning

BigWig files created from bedGraph format use "0-start, half-open" coordinates, but bigWigs that represent variableStep and fixedStep data are generated from wiggle files that use 1-start, fully-closed coordinates. For example, for a chromosome of length N, the first position is 1 and the last position is N. For more information, see:

Bioconvert conversions

WIG2BED

XLS

Format:

human readable

Type:

database

Status:

included

Spreadsheet file format (Microsoft Excel file format).

Until 2007, Microsoft Excel used a proprietary binary file format called Excel Binary File Format (.XLS). In Excel 2007, the Office Open XML was introduced. We support the later formnat only.

With bioconvert you can convert an XLS file into CSV or TSV format. If several sheets are to be found, you can select one or the other.

Bioconvert conversions:

XLS2CSV, XLSZ2CSV,

XLSX

Type:

database

Status:

included

Spreadsheet file format in Office Open XML format.

With bioconvert you can convert an XLSX file into CSV or TSV format. If several sheets are to be found, you can select one or the other.

Bioconvert conversions:

XLS2CSV, XLSX2CSV

See also

XLS format.

XMFA

Format:

human-readable

Status:

included

Type:

alignment

XMFA stands for eXtended Multi-FastA file format. The .alignment file contains the complete genome alignment. This standard file format is also used by other genome alignment systems that align sequences with rearrangements.

The XMFA file format supports the storage of several collinear sub-alignments, each separated with an = sign, that constitute a single genome alignment. Each sub-alignment consists of one FastA format sequence entry per genome where the entry’s defline gives the strand (orientation) and location in the genome of the sequence in the alignment.

Example (from darlinglab.org/mauve ):

>seq_num:start1-end1 ± comments (sequence name, etc.)
AC-TG-NAC--TG
AC-TG-NACTGTG
...

> seq_num:startN-endN ± comments (sequence name, etc.)
AC-TG-NAC--TG
AC-TG-NACTGTG
...
= comments, and optional field-value pairs, i.e. score=12345

> seq_num:start1-end1 ± comments (sequence name, etc.)
AC-TG-NAC--TG
AC-TG-NACTGTG
...

> seq_num:startN-endN ± comments (sequence name, etc.)
AC-TG-NAC--TG
AC-TG-NACTGTG
...
= comments, and optional field-value pairs, i.e. score=12345

Bioconvert conversions

XMFA2PHYLIP

YAML

Format:

human-readable

Status:

included

Type:

database

YAML ("YAML Ain't Markup Language") is a human-readable data-serialization language. It is commonly used for configuration files, but could be used in many applications where data is being stored.

The full syntax cannot be described here. The full specification are available at the official site (https://yaml.org/refcard.html)

In brief: - whitespace indentation is used to denote srtucture. Tab spaces are not allowed. - Comments begin with the number sign #. Can start anywhere on a line. - List are denoted by the - character with one member per line, or, enclosed in square brackets [ ] . - associated arrays are represented with the colon space : in the form of key:value - strings can be unquoted or quoted.

Example:

# example of a yaml file
- {name: Jean, age: 33}
- name: Marie
  age : 32

men:
    - Pierre
    - Jean
women:
    - Marie

Bioconvert conversions

JSON2YAML, JSON2YAML.

Others

ACE ~~~-

Human-readable file format used by the AceDB database, which is a genome database designed for the handling of bioinformatics data. The data looks like:

DNA : "HSFAU"
ttccttccagctactgttccttccagc
tactg

This format is obsolet and will not be included in Bioconvert for now. BioPython seems to handle this format.

ASN1

ASN.1 Abstract Syntax Notation One, is an International Standards Organization (ISO) data representation format used to achieve interoperability. It is formal notation used for describing data transmitted by telecommunications protocols, regardless of language implementation and physical representation of these data, whatever the application, whether complex or very simple. NCBI uses ASN.1 for the storage and retrieval of data such as nucleotide and protein sequences, structures, genomes, PubMed records, and more.

GCG

Format:

human-readable

Status:

not included

Type:

sequence

GCG format contains exactly one sequence. It begins with annotation lines and the start of the sequence is marked by a line ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length and a checksum. This format should only be used if the file was created with the GCG package.

An example sequence in GCG format is:

ID   AB000263 standard; RNA; PRI; 368 BP.
XX
AC   AB000263;
XX
DE   Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.
XX
SQ   Sequence 368 BP;
AB000263  Length: 368  Check: 4514  ..
       1  acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg
      61  ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg
     121  caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc
     181  aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag
     241  gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga
     301  agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca
     361  gacctgaa

GVF

Format:

human-readable

Status:

not included

Type:

variant

The Genome Variation Format (GVF) is a very simple file format for describing sequence_alteration features at nucleotide resolution relative to a reference genome.

Example:

##gvf-version 1.10
##genome-build NCBI B36.3
##sequence-region chr16 1 88827254

chr16 samtools SNV 49291141 49291141 . + . ID=ID_1;Variant_seq=A,G;Reference_seq=G;
chr16 samtools SNV 49291360 49291360 . + . ID=ID_2;Variant_seq=G;Reference_seq=C;
chr16 samtools SNV 49302125 49302125 . + . ID=ID_3;Variant_seq=T,C;Reference_seq=C;
chr16 samtools SNV 49302365 49302365 . + . ID=ID_4;Variant_seq=G,C;Reference_seq=C;
chr16 samtools SNV 49302700 49302700 . + . ID=ID_5;Variant_seq=T;Reference_seq=C;
chr16 samtools SNV 49303084 49303084 . + . ID=ID_6;Variant_seq=G,T;Reference_seq=T;
chr16 samtools SNV 49303156 49303156 . + . ID=ID_7;Variant_seq=T,C;Reference_seq=C;
chr16 samtools SNV 49303427 49303427 . + . ID=ID_8;Variant_seq=T,C;Reference_seq=C;
chr16 samtools SNV 49303596 49303596 . + . ID=ID_9;Variant_seq=T,C;Reference_seq=C;

IG

The IntelliGenetics (IG) format is a sequence format. It can contain several sequences, each consisting of a number of comment lines that must begin with a semicolon (";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear or '2' for circular sequences.

An example sequence in IG format is:

; comment
; comment
AB000263
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC
CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC
CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG
AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC
CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG
TTTAATTACAGACCTGAA1

PIR

Format:

human-readable

Status:

not included

Type:

variant

The PIR (Protein Informatics Resource) may contain contain several sequences. A sequence in PIR format consists of One line starting with ">" character followed by a 2-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by a semicolon, followed by the sequence identification code (the database ID-code). Then, one line containing a textual description of the sequence and finally one or more lines containing the sequence itself. The end of the sequence is marked by a "*" character.

The PIR format is also often referred to as the NBRF format.

Example:

>P1;CRAB_ANAPL
Example protein sequence. Note the final * chraacter
MDITIHNPLI RRPLFSWLAP SRIFDQIFGE HLQESELLPA SPSLSPFLMR
SPIFRMPSWL ETGLSEMRLE KDKFSVNLDV KHFSPEELKV KVLGDMVEIH
GKHEERQDEH GFIAREFNRK YRIPADVDPL TITSSLSLDG VLTVSAPRKQ
SDVPERSIPI TREEKPAIAG AQRK*
  • imgt Unspecified (*.txt) This refers to the IMGT variant of the EMBL plain text file format.

  • phd PHD files are output from PHRED, used by PHRAP and CONSED for input.

  • seqxml Simple sequence XML file format.

  • sff Standard Flowgram Format (SFF) files produced by 454 sequencing. binary files produced by Roche 454 and IonTorrent/IonProton sequencing machines.

  • swiss Swiss-Prot aka UniProt format.

  • uniprot-xml UniProt XML format, successor to the plain text Swiss-Prot format.

  • pdb2gmx: This program reads a .pdb (or .gro) file, reads some database files, adds hydrogens to the molecules and generates coordinates in GROMACS (GROMOS), or optionally .pdb, format and a topology in GROMACS format. See http://manual.gromacs.org/archive/4.6.7/online/pdb2gmx.html for details. this tool is already quite complete and will not be provided for now.

  • rfam: https://en.wikipedia.org/wiki/Rfam

Glossary

Note that formats mentionned here below have dedicated description in the Formats section.

ABI

File format produced by ABI sequencing machines. Contains the trace data which includes probabilities of the four nucleotides. See the ABI format page for details.

ASQG

The ASQG format describes an assembly graph. Each line is a tab-delimited record. The first field in each record describes the record type. See the ASQG page for details.

BAI

The index file related to file generated in the BAM format. (This is a non-standard file type.) See the BAI page for details.

BAM

Binary version of the Sequence Alignment Map (SAM) format. See the BAM format page for details.

BCF

Binary version of the Variant Call Format (VCF). See BCF page for details.

BCL

BCL is the raw format used by Illumina sequencers. See the BCL format page for details.

BED

BEDGRAPH/BED format is line-oriented and allows display of continuous-valued data. Similar to WIG format. See the BED format page for details.

BED3

Variants of the BED format with 4 columns storing the track name, start and end positions and values. See the BED4 format page for details.

BED4

Variants of the BED format with 4 columns storing the track name, start and end positions and values. See the BED4 format page for details.

BEDGRAPH

BEDGRAPH/BED format is line-oriented and allows display of continuous-valued data. Similar to WIG format. See the BED format page for details.

BIGBED

An indexed binary version of a BED file See BIGBED page for details.

BIGWIG

Indexed binary version of the Wiggle format. See BIGWIG page for details.

Binary version of the PlINK forat used for analyzing genotypic data for Genome-wide Association Studies (GWAS). See PLINK binary files (BED/BIM/FAM) page for details.

BZ2

bzip2 is a file compression program that uses the Burrows–Wheeler algorithm. Extension is usually .bz2 See BZ2 page for details.

CLUSTAL

The alignment format of Clustal X and Clustal W. See CLUSTAL page for details.

COV

A bioconvert format to store coverage in the form of a 3 column tab-tabulated file. See COV page for details.

CRAM

A more compact version of BAM files used to store Sequence Alignment Map (SAM) format. See CRAM page for details.

CSV

A comma-separated values format is a delimited text file that uses a comma to separate values. See CSV format page for details.

DSRC

A compression tool dedicated to FastQ files See DSRC page for details.

EMBL

EMBL Flat File Format. See EMBL page for details.

FAA

FASTA-formatted sequence files containing amino acid sequences See FAA page for details.

FASTA

FASTA-formatted sequence files contain either nucleic acid sequence (such as DNA) or protein sequence information. FASTA files can also store multiple sequences in a single file. See FASTA page for details.

FASTQ

FASTQ-formatted sequence files are used to represent high-throughput sequencing data, where each read is described by a name, its sequence, and its qualities. See FastQ page for details.

GENBANK

GenBank Flat File Format. See GENBANK page for details.

GFA

Graphical Fragment Assembly format. https://github.com/GFA-spec/GFA-spec

GFF2

General Feature Format, used for describing genes and other features associated with DNA, RNA and Protein sequences. See GTF page for details.

GFF3

General Feature Format, used for describing genes and other features associated with DNA, RNA and Protein sequences. http://genome.ucsc.edu/FAQ/FAQformat#format3 See GTF page for details.

GZ

gzip is a file compression program based on the DEFLATE algorithm. See GZ page for details.

JSON

A human-readable data serialization language commonly used in configuration files. See JSON page for details.

MAF

A human-readable multiple alignment format. See MAF (Multiple Alignement Format) page for details.

NEWICK

Plain text minimal format used to store phylogenetic tree. See NEWICK page for details.

NEXUS

Plain text minimal format used to store multiple alignment and phylogenetic trees. See NEXUS page for details.

PAF

PAF is a text format describing the approximate mapping positions between two set of sequences.

PHYLIP

Plain text format to store a multiple sequence alignment. See PHYLIP page for details.

PHYLOXML

XML format to store a multiple sequence alignment. See PHYLOXML page for details.

Format used for analyzing genotypic data for Genome-wide Association Studies (GWAS). See PLINK flat files (MAP/PED) page for details.

QUAL

Sequence of qualities associated with a sequence of nucleotides. Associated with FastA file, the original FastQ file can be built back. See QUAL page for details.

SAM

Sequence Alignment Map is a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly. See SAM page for details.

SCF

Standard Chromatogram Format, a binary chromatogram format described in Staden package documentation SCF file format.

SRA

The Sequence Read Archive (SRA) is a website that stores sequencing data at https://www.ncbi.nlm.nih.gov/sra It is not a format per se. See SRA page for details.

STOCKHOLM

Stockholm format is a multiple sequence alignment format used to store multiple sequence alignment. See STOCKHOLM page for details.

TSV

A tab-separated values format is a delimited text file that uses a tab character to separate values. See TSV format page for details.

TWOBIT

2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself. See TWOBIT format page for details.

VCF

Variant Call Format (VCF) is a flexible and extendable format for storing variation in sequences such as single nucleotide variants, insertions/deletions, copy number variants and structural variants. See VCF page for details.

WIG

Synonym for the wiggle (WIG) format. See WIG.

WIGGLE

The wiggle (WIG) format stores dense, continuous data such as GC percent, probability scores, and transcriptome data. See WIG page for details.

XLS

Spreadsheet file format (Microsoft Excel file format). See XLS page for details.

XLSX

Spreadsheet file format defined in the Office Open XML specification. See XLSX page for details.

XMFA

TODO

YAML

A human-readable data serialization language commonly used in configuration files. See https://en.wikipedia.org/wiki/YAML See YAML page for details.

Faqs

Installation

On ubuntu, you need libz-dev and python3-dev libraries which are not necessarily present by default:

sudo apt-get install libz-dev python3-dev

Libraries

Bibliography

[BEDTOOLS]

BEDTools: a flexible suite of utilities for comparing genomic features Aaron R. quinlan, Ira M. Hall 2010 Bioinformatics 26(6) https://doi.org/10.1093/bioinformatics/btq033

[BIOCONVERT]

BioConvert: a comprehensive format converter for life sciences https://bioconvert.readthedocs.io

[BIOPYTHON]

Biopython: freely available Python tools for computational molecular biology and bioinformatics. Cock et al 2009, Bioinformatics 25(11) https://doi.org/10.1093/bioinformatics/btp163

[DEEPTOOLS]

Ramírez, Fidel, Devon P. Ryan, Björn Grüning, Vivek Bhardwaj, Fabian Kilpert, Andreas S. Richter, Steffen Heyne, Friederike Dündar, and Thomas Manke. “deepTools2: a next generation web server for deep-sequencing data analysis.” Nucleic Acids Research (2016): gkw257.

[MOSDEPTH]

Mosdepth: quick coverage calculation for genomes and exomes Brent S Pedersen, Aaron R Quinlan 2018 Bioinformatics, 34(5) https://doi.org/10.1093/bioinformatics/btx699

[PANDAS]

Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010)

[PYSAM]

The Sequence Alignment/Map format and SAMtools. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. Bioinformatics. 2009 Aug 15;25(16):2078-9. Epub 2009 Jun 8. PMID: 19505943

[SAMTOBAM]

Ogasawara T, Cheng Y, Tzeng T-HK (2016) Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools. PLoS ONE 11(11): e0167100. doi:10.1371/journal.pone.0167100

[SAMTOOLS]

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and 1000 Genome Project Data Processing Subgroup, The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics (2009) 25(16) 2078-9 [19505943]

[FASTQDUMP]

NCBI SRA tools https://edwards.flinders.edu.au/fastq-dump/

[WIGGLETOOLS]

Zerbino DR, Johnson N, Juettemann T, Wilder SP and Flicek PR: WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis. Bioinformatics 2014 30:1008-1009.