Current version: 1.1.1, Jul 18, 2023
Bioconvert¶
Bioconvert is a collaborative project to facilitate the interconversion of life science data from one format to another.

- contributions:
Want to add a convertor ? Please join https://github.com/bioconvert/bioconvert/issues/1
Overview¶
Life science uses many different formats. They may be old, or with complex syntax and converting those formats may be a challenge. Bioconvert aims at providing a common tool / interface to convert life science data formats from one to another.
Many conversion tools already exist but they may be dispersed, focused on few specific formats, difficult to install, or not optimised. With Bioconvert, we plan to cover a wide spectrum of format conversions; we will re-use existing tools when possible and provide facilities to compare different conversion tools or methods via benchmarking. New implementations are provided when considered better than existing ones.
In Jan 2023, we had 50 formats, 100 direct conversions available.

Installation¶
BioConvert is developped in Python. Please use conda or any Python environment manager to install BioConvert using the pip command:
pip install bioconvert
50% of the conversions should work out of the box. However, many conversions require external tools. This is why we recommend to use a conda environment. In particular, most external tools are available on the bioconda channel. For instance if you want to convert a SAM file to a BAM file you would need to install samtools as follow:
conda install -c bioconda samtools
Since bioconvert is available on bioconda on solution that installs BioConvert and all its dependencies is to use conda/mamba:
conda env create --name bioconvert mamba
conda activate bioconvert
mamba install bioconvert
bioconvert --help
See the Installation section for more details and alternative solutions (docker, singularity).
Quick Start¶
There are many conversions available. Type:
bioconvert --help
to get a list of valid method of conversions. Taking the example of a conversion from a FastQ file into a FastA file, you could do the conversion as follows:
bioconvert fastq2fasta input.fastq output.fasta
bioconvert fastq2fasta input.fq output.fasta
bioconvert fastq2fasta input.fq.gz output.fasta.gz
bioconvert fastq2fasta input.fq.gz output.fasta.bz2
When there is no ambiguity, you can be implicit:
bioconvert input.fastq output.fasta
The default method of conversion is used but you may use another one. Checkout the available methods with:
bioconvert fastq2fasta --show-methods
For more help about a conversion, just type:
bioconvert fastq2fasta --help
and more generally:
bioconvert --help
You may also call BioConvert from a Python shell:
# import a converter
from bioconvert.fastq2fasta import FASTQ2FASTA
# Instanciate with infile/outfile names
convert = FASTQ2FASTA(infile, outfile)
# the conversion itself:
convert()
Available Converters¶
Converters |
CI testing |
Default method |
---|---|---|
Unix commands |
||
Pandas |
||
DSRC software |
||
pigz/pbzip2 software |
||
DSRC software |
||
Python |
||
pyexcel library |
||
Pandas library |
||
Pandas library |
Contributors¶
Setting up and maintaining Bioconvert has been possible thanks to users and contributors. Thanks to all:
Changes¶
Version |
Description |
---|---|
1.1.1 |
|
1.1.0 |
|
1.0.0 |
|
0.6.3 |
|
0.6.2 |
|
0.6.1 |
|
0.6.0 |
|
0.5.2 |
|
0.5.1 |
|
0.5.0 |
|
0.4.X |
|
0.3.X |
may 2019. new methods abi2qual, bigbed2bed, etc. added --threads option |
0.2.X |
aug 2018. abi2fastx, bioconvert_stats tool added |
0.1.X |
major refactoring to have subcommands with implicit/explicit mode |
Complete documentation including User and Developer Guides¶
Installation¶
Bioconvert is developed in Python so you can use the pip method to install it easily. We recommend to use a virtual environment to not interfere with your system. In any case, install BioConvert with:
pip install bioconvert
Note, however, that you will be able to use only about half of the conversions (pure Python). Others depend on third-party software.
One solution is to create a dedicated environment using conda. In particular, we use bioconda to install those dependencies.
conda / bioconda /mamba installation¶
One workable and relatively straightfoward installation is based on conda/mamba:
conda create --name bioconvert python=3.8
conda activate bioconvert
conda install mamba
Then, use mamba to install the missing executable. Dependencies and BioConvert are available on the bioconda channel (see more about channels at the bottom of the page). For example for samtools:
mamba install samtools -c bioconda
Third-package executables can be installed with your own method. We recommend and provide solutions for conda. Indeed, BioConvert is available on the bioconda channel (see Conda channels section for details).
So, you could create a conda environment and install bioconvert directly with all dependencies. This is, however, pretty slow due to the large number of dependencies:
conda create --name bioconvert bioconvert
Instead, we recommend to use an intermediate tool called mamba that will provide a more robust and faster installation:
conda create -c bioconda --name bioconvert mamba
conda activate bioconvert
mamba install bioconvert -c bioconda
In Jan 2023, this method worked out of box and created an environment with Python3.10 and bioconvert 0.6.2 with all its dependencies.
We also provide a frozen version of an environment with the bioconvert github repository. Note, however, that this file may change with time. This will create a conda environment called bioconvert. See the link
wget https://raw.githubusercontent.com/bioconvert/bioconvert/main/environment.yml -O test.yml conda create install create -f test.yml
Docker¶
A Dockerfile (version 0.6.1 of BioConvert) is available on dockerhub:
docker pull bioconvert/bioconvert:0.6.1
Which can be used as follows:
docker run bioconvert -d /home/user:/home/user bioconvert /home/user/test_file.fastq /home/user/test_file.fasta
Since bioconvert is on bioconda, it is also available on quay.io. For instance, version 0.6.2 is reachable here:
docker pull quay.io/biocontainers/bioconvert:0.6.2--pyhdfd78af_0
Singularity/Apptainer¶
We provide Singularity/Apptainer images of BioConvert within the https://damona.readthedocs.io project.
The version 0.6.2 of BioConvert is available for downloads.
Using damona:
pip install damona
# create and activate an environment
damona env --create test_bioconvert
damona activate test_bioconvert
damona install bioconvert
bioconvert
You can also install the singularity image yourself by downloading it:
wget https://zenodo.org/record/7034822/files/bioconvert_0.6.1.img
singularity exec bioconvert_0.6.1.img bioconvert
# you can also create an alias
alias bioconvert="singularity run bioconvert.simg bioconvert"
Warning
You will need singularity of course. If you have a conda environment, you are lucky. singularity is there/
Conda channels¶
First, you will need to set up the bioconda channel if not already done:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
Warning
it is important to add them in this order, as mentionned on bioconda webpage (https://bioconda.github.io/).
If you have already set the channels, please check that the order is correct. With the following command:
conda config --get channels
You should see:
--add channels 'defaults'
--add channels 'bioconda'
--add channels 'conda-forge'# highest priority
User Guide¶
Quick Start¶
Most of the time, Bioconvert simply requires the input and output filenames. If there is no ambiguity, the extension are used to infer the type of conversion you wish to perform.
For instance, to convert a FASTQ to a FASTA file, use this type of command:
bioconvert test.fastq test.fasta
If the converter fastq to fasta*² exists in **Bioconvert*, it will work out of the box. In order to get a list of all possible conversions, just type:
bioconvert --help
To obtain more specific help about a converter that you found in the list:
bioconvert fastq2fasta --help
Note
All converters are named as <input_extension>2<output_extension>
Explicit conversion¶
Sometimes, Bioconvert won't be able to know what you want solely based on the input and ouput extensions. So, you may need to be explicit and use a subcommand. For instance to use the converter fastq2fasta, type:
bioconvert fastq2fasta input.fastq output.fasta
The rationale behind the subcommand choice is manyfold. First, you may have dedicated help for a given conversion, which may be different from one conversion to the other:
bioconvert fastq2fasta --help
Second, the extensions of your input and output may be non-standard or different from the choice made by the bioconvert developers. So, using the subcommand you can do:
bioconvert fastq2fasta input.fq output.fas
where the extensions can actually be whatever you want.
If you do not provide the output file, it will be created based on the input filename by replacing the extension automatically. So this command:
bioconvert fastq2fasta input.fq
generates an output file called input.fasta. Note that it will be placed in the same directory as the input file, not locally. So:
bioconvert fastq2fasta ~/test/input.fq
will create the input.fasta file in the ~/test directory.
If an output file exists, it will not be overwritten. If you want to do so, use the --force argument:
bioconvert fastq2fasta input.fq output.fa --force
Implicit conversion¶
If the extensions match the conversion name, you can perform implicit conversion:
bioconvert input.fastq output.fasta
Internally, a format may be registered with several extensions. For instance
the extensions possible for a FastA file are fasta
and fa
so you can
also write:
bioconvert input.fastq output.fa
Compression¶
Input files may be compressed. For instance, most FASTQ are compressed in GZ format. Compression are handled in some converters. Basically, most of the humand-readable files handle compression. For instance, all those commands should work and can be used to compress output files, or handle input compressed files:
bioconvert test.fastq.gz test.fasta
bioconvert test.fastq.gz test.fasta.gz
bioconvert test.fastq.gz test.fasta.bz2
Note that you can also decompress and compress into another compression keeping without doing any conversion (note the fastq extension in both input and output files):
bioconvert test.fastq.gz test.fastq.dsrc
Parallelization¶
In Bioconvert, if the input contains a wildcard such as *
or ?
characters, then, input filenames are treated separately and converted sequentially:
bioconvert fastq2fasta "*.fastq"
Note, however, that the files are processed sequentially one by one. So, we may want to parallelise the computation.
Iteration with unix commands¶
You can use a bash script under unix to run Bioconvert on a set of files. For instance the following script takes all files with the .fastq extension and convert them to fasta:
#!/bin/bash
FILES=*fastq
CONVERSION=fastq2fasta
for f in $FILES
do
echo "Processing $f file..."
bioconvert $CONVERSION $f --force
done
Note, however, that this is still a sequential computation. Yet, you may now change it slightly to run the commands on a cluster. For instance, on a SLURM scheduler, you can use:
#!/bin/bash
FILES=*fastq
CONVERSION=fastq2fasta
for f in $FILES
do
echo "Processing $f file..."
sbatch -c 1 bioconvert $CONVERSION $f --force
done
Snakemake option¶
If you have lots of files to convert, a snakemake pipeline is available in the Sequana project and can be installed using pip install sequana_bioconvert. It also installs bioconvert with an ap ptainer image that contains all dependencies for you.
Here is another way of running your jobs in parallel using a simple Snakefile (snakemake) that can be run easily either locally or on a cluster.
You can download the following file Snakefile
inext = "fastq"
outext = "fasta"
command = "fastq2fasta"
import glob
samples = glob.glob("*.{}".format(inext))
samples = [this.rsplit(".")[0] for this in samples]
rule all:
input: expand("{{dataset}}.{}".format(outext), dataset=samples)
rule bioconvert:
input: "{{dataset}}.{}".format(inext)
output: "{{dataset}}.{}".format(outext)
run:
cmd = "bioconvert {} ".format(command) + "{input} {output}"
shell(cmd)
and execute it locally as follows (assuming you have 4 CPUs):
snakemake -s Snakefile --cores 4
or on a cluster:
snakemake -s Snakefile --cluster "--mem=1000 -j 10"
Tutorial¶
Here is a tutorial that allows you to quickly start bioconvert and see some features on a real data set.
We are looking to highlight CNVs (Copy Number Variation) by identifying a significant increase in sequencing coverage in the following samples
Data¶
For this tutorial, we will work with 6 sequencing samples of the Staphylococcus aureus genome :
But these files are compressed in SRA format while we want fastq files.
Downloading¶
To download archive files (SRA):
bioconvert sra2fastq ERR043367
So we download the ERR043367 archive in SRA format to convert them into fastq just with the sample id.
It's paired sequencing so bioconvert creates two fastq files :
ERR043367_1.fastq (contains the reads 1)
ERR043367_2.fastq (contains the reads 2)
Bioconvert behaves differently when it's single sequencing but always with the same syntax, for example the ERR3295124 sample is single sequencing. let's try:
bioconvert sra2fastq ERR3295124
It's exactly the same command for single sequencing, but only one file as output :
ERR3295124.fastq
Compression¶
Fastq files can be huge. If you want to conserve the files bioconvert can perform compression on the fly but in this case you have to be explicit (specify the output, with the gz extension):
bioconvert sra2fastq ERR043367 ERR043367.fastq.gz
Mapping¶
Now, from fastq files, we can perform an alignment on the reference genome using bwa for example:
bwa index staphylococcus_aureus.fasta
bwa mem -M -t 4 staphylococcus_aureus.fasta ERR043367_1.fastq ERR043367_2.fastq > ERR043367.sam
Note
Find the reference genome of staphylococcus aureus with the accession FN433596 on NCBI :
We get a sam file that we can visualize but if you want to reduce the size of the file.
You can used bioconvert by two ways to convert the sam file to a bam file:
Implicit way:
bioconvert ERR043367.sam ERR043367.bam
This is the implicit way because bioconvert deduces the converter to use from the input and output extension
Explicit way:
bioconvert sam2bam ERR043367.sam
By this way, we specify the converter so bioconvert is able to deduces the extension of the output file.
Note
In both cases, we have the same output file (ERR043367.bam)
Visualization¶
Then from this bam file you can visualize the mapping with igv for example.
Here we have a global view of 500bp from the position 2.828.460 to 2.828.960 using IGV. From this point of view, we can see a significant difference between the region in red and the other two blue.

We expected to obtain fairly uniform coverage across all samples. But on this region we observe that this is not the case. We can therefore say that there is a possible variation in the number of copies.
In order to confirm what we saw, We want to convert our alignment (BAM) to a BED file to know the number of reads mapped by position:
bioconvert bam2bedgraph ERR043367.bam ERR043367.bed
In this bed file, we can check the visual results obtained a little earlier with word processing tools that allow us to get some quick statistics like the average coverage (168) and compare to the most covered regions to identify CNVs.
On all the samples we have identified 2 regions that are significantly more covered.
Developer guide¶
Quick start¶
As a developer, assuming you have a valid environment and installed Bioconvert (Installation for developers), go to bioconvert directory and type the bioconvert init command for the input and output formats you wish to add (here we want to convert format A to B). You may also just copy an existing file:
cd bioconvert
bioconvert_init -i A -o B > A2B.py
see How to add a new conversion section for details. Edit the file, update the method that performs the conversion by adding the relevant code (either python or external tools). Once done, please
add an input test file in the ./test/data directory (see How to add a test)
add the relevant data test files in the
./bioconvert/test/data/
directory (see How to add a test file)Update the documentation as explained in How to add you new converter to the main documentation ? section:
add the module in doc/ref_converters.rst in the autosummary section
add the A2B in the README.rst
add a CI action in .github/workflows named after the conversion (A2B.yml)
Note also that a converter (a Python module, e.g., fastq2fasta) may have several methods included and it is quite straightforward to add a new method (How to add a new method to an existing converter). They can later be compared thanks to our benchmarking framework.
If this is a new formats, you may also update the glossary.rst file in the documentation.
Installation for developers¶
To develop on bioconvert it is highly recommended to install bioconvert in a virtualenv
mkdir bioconvert
cd bioconvert
python3.7 -m venv py37
source py37/bin/activate
And clone the bioconvert project
mkdir src
cd src
git clone https://github.com/bioconvert/bioconvert.git
cd bioconvert
We need to install some extra requirements to run the tests or build the doc so to install these requirements
pip install -e . [testing]
Warning
The extra requirements try to install pygraphviz so you need to install graphviz on your computer. If you running a distro based on debian you have to install libcgraph6, libgraphviz-dev and graphviz packages.
Note
You may need to install extra tools to run some conversion. The requirements_tools.txt file list conda extra tools
How to add a new conversion¶
Officially, Bioconvert supports one-to-one conversions only (from one format to another format). See the note here below about One-to-many and many-to-one conversions.
Let us imagine that we want to include a new format conversion from FastQ to FastA format.
First, you need to add a new file in the ./bioconvert
directory called:
fastq2fasta.py
Please note that the name is all in small caps and that we concatenate the input format name, the character 2 and the output format name. Sometimes a format already includes the character 2 in its name (e.g. bz2), which may be confusing. For now, just follow the previous convention meaning duplicate the character 2 if needed (e.g., for bz2 to gz format, use bz22gz).
As for the class name, we us all in big caps. In the newly created file (fastq2fasta.py) you can (i) copy / paste the content of an existing converter (ii) use the bioconvert_init executable (see later), or (iii) copy / paste the following code:
1"""Convert :term:`FastQ` format to :term:`FastA` formats"""
2from bioconvert import ConvBase
3
4__all__ = ["FASTQ2FASTA"]
5
6
7class FASTQ2FASTA(ConvBase):
8 """
9
10 """
11 _default_method = "v1"
12
13 def __init__(self, infile, outfile):
14 """
15 :param str infile: information
16 :param str outfile: information
17 """
18 super().__init__(infile, outfile)
19
20 @requires(external_library="awk")
21 def _method_v1(self, *args, **kwargs):
22 # Conversion is made here.
23 # You can use self.infile and self.outfile
24 # If you use an external command, you can use self.execute:
25 self.execute(cmd)
26
27 @requires_nothing
28 def _method_v2(self, *args, **kwargs):
29 #another method
30 pass
On line 1, please explain the conversion using the terms available in the Glossary (./doc/glossary.rst
file). If not available, you may edit the glossary.rst file to add a quick description of the formats.
Warning
If the format is not already included in Bioconvert, you will need to update the file core/extensions.py to add the format name and its possible extensions.
On line 2, just import the common class.
On line 7, name the class after your input and output formats; again include the character 2 between the input and output formats. Usually, we use big caps for the formats since most format names are acronyms. If the input or output format exists already in Bioconvert, please follow the existing conventions.
On line 13, we add the constructor.
On line 21, we add a method to perform the conversion named _method_v1. Here, the prefix _method_ is compulsary: it tells Bioconvert that is it a possible conversion to include in the user interface. This is also where you will add your code to perform the conversion. The suffix name (here v1) is the name of the conversion. That way you can add as many conversion methods as you need (e.g. on line 28, we implemented another method called v2).
Line 20 and line 27 show the decorator that tells bioconvert which external tools are required. See Decorators section.
Since several methods can be implemented, we need to define a default method (line 11; here v1).
In order to simplify the creation of new converters, you can also use the standalone bioconvert_init. Example:
$ bioconvert_init -i fastq -o fasta > fastq2fasta.py
Of course, you will need to edit the file to add the conversion itself in the appropriate method (e.g. _method_v1).
If you need to include extra arguments, such as a reference file, you may add extra argument, although this is not yet part of the official Bioconvert API. See for instance SAM2CRAM
converter.
One-to-many and many-to-one conversions¶
The one-to-many and many-to-one conversions are now implemented in Bioconvert. We have only 2 instances so far namely class:bioconvert.fastq2fasta_qual and class:bioconvert.fasta_qual2fastq. We have no instances of many-to-many so far. The underscore character purpose is to indicate a and connection. So you need QUAL and FASTA to create a FASTQ file.
For developers, we ask the input or output formats to be sorted alphabetically to ease the user experience.
How to add a new method to an existing converter¶
As shown above, use this code and add it to the relevant file in ./bioconvert
directory:
def _method_UniqueName(self, *args, **kwargs):
# from kwargs, you can use any kind of arguments.
# threads is an example, reference, another example.
# Your code here below
pass
Then, it will be available in the class and bioconvert automatically; the bioconvert executable should show the name of your new method in the help message.
In order to add your new method, you can add:
Pure Python code
Python code that relies on third-party library. If so, you may use:
Python libraries available on pypi. Pleaes add the library name to the requirements.txt
if the Python library requires lots of compilation and is available on bioconda, you may add the library name to the requirements_tools.txt instead.
Third party tools available on bioconda (e.g., squizz, seqtk, etc) that you can add to the requirements_tools.txt
Perl and GO code are also accepted. If so, use the self.install_tool(NAME) and add a script in
./misc/install_NAME.sh
Decorators¶
Decorators have
been defined in bioconvert/core/decorators.py
that can be used to "flag" or
"modify" conversion methods:
@in_gz
can be used to indicate that the method is able to transparently handle input files that are compressed in.gz
format. This is done by adding anin_gz
attribute (set toTrue
) to the method.@compressor
will wrap the method in code that handles input decompression from.gz
format and output compression to.gz
,.bz2
or.dsrc
. This automatically applies@in_gz
.Example:
@compressor
def _method_noncompressor(self, *args, **kwargs):
"""This method does not handle compressed input or output by itself."""
pass
# The decorator transforms the method that now handles compressed
# input and output; the method has an in_gz attribute (which is set to True)
@out_compressor
will wrap the method in code that handles output compression to.gz
,.bz2
or.dsrc
. It is intended to be used on methods that already handle compressed input transparently, and therefore do not need the input decompression provided by@compressor
. Typically, one would also apply@in_gz
to such methods. In that case,@in_gz
should be applied "on top" of@out_compressor
. The reason is that decorators closest to the function are applied first, and applying another decorator on top of@in_gz
would typically not preserve thein_gz
attribute. Example:
@in_gz
@out_compressor
def _method_incompressor(self, *args, **kwargs):
"""This method already handles compressed .gz input."""
pass
# This results in a method that handles compressed input and output
# This method is further modified to have an in_gz attribute
# (which is set to True)
Another bioconvert decorator is called requires.
It should be used to annotate a method with the type of tools it needs to work.
It is important to decorate all methods with the requires decorator so that user
interface can tell what tools are properly installed or not. You can use 4
arguments as explained in bioconvert.core.decorators
:
1@requires_nothing
2def _method_python(self, *args, **kwargs):
3 # a pure Python code does not require extra libraries
4 with open(self.outfile, "w") as fasta, open(self.infile, "r") as fastq:
5 for (name, seq, _) in FASTQ2FASTA.readfq(fastq):
6 fasta.write(">{}\n{}\n".format(name, seq))
7
8 @requires(python_library="mappy")
9 def _method_mappy(self, *args, **kwargs):
10 with open(self.outfile, "w") as fasta:
11 for (name, seq, _) in fastx_read(self.infile):
12 fasta.write(">{}\n{}\n".format(name, seq))
13
14 @requires("awk")
15 def _method_awk(self, *args, **kwargs):
16 # Note1: since we use .format, we need to escape the { and } characters
17 # Note2: the \n need to be escaped for Popen to work
18 awkcmd = """awk '{{printf(">%s\\n",substr($0,2));}}' """
19 cmd = "{} {} > {}".format(awkcmd, self.infile, self.outfile)
20 self.execute(cmd)
On line 1, we decorate the method with the requires_nothing()
decorator because the method is implemented in Pure Python.
One line 8, we decorate the method with the requires()
decorator to inform bioconvert that the method relies on the external Python library called mappy.
One line 14, we decorate the method with the requires()
decorator to inform bioconvert that the method relies on an external tool called awk. In theory, you should write:
@requires(external_library="awk")
but external_library
is the first optional argument so it can be omitted. If several libraries are required, you can use:
@requires(external_libraries=["awk", ""])
or:
@requires(python_libraries=["scipy", "pandas"])
Note
For more general explanations about decorators, see https://stackoverflow.com/a/1594484/1878788.
How to add a test¶
Following the example from above (fastq2fasta), we need to add a test file. To
do so, go to the ./test
directory and add a file named test_fastq2fasta.py
.
1import pytest
2
3from bioconvert.fastq2fasta import FASTQ2FASTA
4from bioconvert import bioconvert_data
5from easydev import TempFile, md5
6
7from . import test_dir
8
9@pytest.mark.parametrize("method", FASTQ2FASTA.available_methods)
10def test_fastq2fasta(method):
11 # your code here
12 # you will need data for instance "mydata.fastq and mydata.fasta".
13 # Put it in bioconvert/bioconvert/data
14 # you can then use ::
15 infile = f"{test_dir}/data/fastq/test_mydata.fastq"
16 expected_outfile = f"{test_dir}/data/fasta/test_mydata.fasta"
17 with TempFile(suffix=".fasta") as tempfile:
18 converter = FASTQ2FASTA(infile, tempfile.name)
19 converter(method=method)
20
21 # Check that the output is correct with a checksum
22 assert md5(tempfile.name) == md5(expected_outfile)
In Bioconvert, we use pytest as our test framework. In principle, we need one test function per method found in the converter. Here on line 7 we serialize the tests by looping through the methods available in the converter using the pytest.mark.parametrize function. That way, the test file remains short and do not need to be duplicated.
How to add a test file¶
Files used for testing should be added in ./bioconvert/test/data/ext/converter_name.ext
.
How to locally run the tests¶
Go to the source directory of Bioconvert.
If not already done, install all packages required for testing:
cd bioconvert
pip3 install .[testing]
Then, run the tests using:
pytest test/ -v
Or, to run a specific test file, for example for your new converter fastq2fasta:
pytest test/test_fastq2fasta.py -v
or
pytest -v -k test_fastq2fasta
How to benchmark your new method vs others¶
from bioconvert import Benchmark
from bioconvert.fastq2fasta import FASTQ2FASTA
converter = FASTQ2FASTA(infile, outfile)
b = Benchmark(converter)
b.plot()
you can also use the bioconvert standalone with -b option.
How to add you new converter to the main documentation ?¶
Edit the doc/ref_converters.rst and add this code (replacing A2B by your conversion):
.. automodule:: bioconvert.A2B
:members:
:synopsis:
:private-members:
and update the autosummary section:
.. autosummary::
bioconvert.A2B
pep8 and conventions¶
In order to write your Python code, use PEP8 convention as much as possible. Follow the conventions used in the code. For instance,
class A():
"""Some documentation"""
def __init__(self):
"""some doc"""
pass
def another_method(self):
"""some doc"""
c = 1 + 2
class B():
"""Another class"""
def __init__(self, *args, **kwargs):
"""some doc"""
pass
def AFunction(x):
"""some doc"""
return x
2 blank lines between classes and functions
1 blank lines between methods
spaces around operators (e.g. =, +)
Try to have 80 characters max on one line
Add documentation in triple quotes
Since v0.5.2, we apply black on the different Python modules.
Requirements files¶
requirements.txt : should contain the packages to be retrieved from Pypi only. Those are downloaded and installed (if missing) when using python setup.py install
environment_rtd.yml : do not touch. Simple file for readthedocs
readthedocs.yml : all conda and pip dependencies to run the example and build the doc
environment.yml is a conda list of all dependencies
How to update bioconvert on bioconda¶
Fork bioconda-recipes github repository and clone locally. Follow instructions on https://bioconda.github.io/contributing.html
In a nutshell, install bioconda-utils:
git clone YOURFORKED_REPOSITORY
cd bioconda-recipes
edit bioconvert recipes and update its contents. If a new version pypi exists, you need to change the md5sum in recipes/bioconvert/meta.yaml
.
check the recipes:
bioconda-utils build recipes/ config.yml --packages bioconvert
Finally, commit and created a PR:
#git push -u origin my-recipe
git commit .
git push
Sphinx Documentation¶
In order to update the documentation, go the ./doc directory and update any of the .rst file. Then, for Linux users, just type:
make html
Regarding the Formats page, we provide simple ontology with 3 entries: Type, Format and Status. Please choose one of the following values:
Type: sequence, assembly, alignement, other, index, variant, database, compression
Format: binary, human-readable
Status: deprecated, included, not included
Docker¶
In order to create the docker file, use this command:
docker build .
The Dockerfile found next to setup.py is self-content and has been tested for v0.5.2 ; it uses the spec-file.txt that was generated in a conda environment using:
conda list --explicit
Benchmarking¶
Introduction¶
Converters (e.g. FASTQ2FASTA
) may have several
methods implemented. A developer may also want to compare his/her methods with
those available in Bioconvert.
In order to help developers comparing their methods, we provide a benchmark framework.
Of course, the first thing to do is to add your new method inside the converter (see Developer guide) and use the method boxplot_benchmark()
.
Then, you have two options. Either use the bioconvert command or use the bioconvert Python library. In both case you will first need a local data set as input file. We do not provide such files inside Bioconvert. We have a tool to generate random FastQ file inside the fastq()
for the example below but this is not generalised for all input formats.
So, you could use the following code to run the benchmark fro Python:
# Generate the dummy data, saving the results in a temporary file
from easydev import TempFile
from bioconvert.simulator.fastq import FastqSim
infile = TempFile(suffix=".fastq")
outfile = TempFile(suffix=".fasta")
fs = FastqSim(infile.name)
fs.nreads = 1000 # 1,000,000 by default
fs.simulate()
# Perform the benchmarking
from bioconvert.fastq2fasta import FASTQ2FASTA
c = FASTQ2FASTA(infile.name, outfile.name)
c.compute_benchmark(N=10)
# you may study the memory or CPU usage using mode="CPU" or mode="memory"
c.boxplot_benchmark(mode="time")
infile.delete()
outfile.delete()
(Source code, png, hires.png, pdf)

Here, the boxplot_benchmark methods is called 10 times for each available method.
Be aware that the pure Python methods may be faster for small data and slower for large data. Indeed, each method has an intrinsec delay to start the processing. Therefore, benchmarking needs large files to be meaningful !
If we use 1,000,000 reads instead of just 1,000, we would get different results (which may change depending on your system and IO performance):

Here, what you see more robust and reproducible results.
Multiple benchmarking for more robustness¶
With the previous method, even though you can decrease the error bars using more trials per method, we still suffer from
local computation or IO access that may bias the results. We provide a Snakefile here: Snakefile_benchmark
that allows to run the previous benchmarking several times. So at the end you have a benchmark ... of benchmarks
somehow. We found it far more robust. Here is an example for the fastq2fasta case where each method was run 3 times and
in each case, 10 instances of conversion were performed. The orange vertical lines give the median and a final statement
indicates whther the final best method is significantly better than the others.

Note
The computation can be long and the Snakefile allows to parallelised the computation.
Zenodo¶
The benchmarking requires input files, which can be large. Those files are stored on Zenodo: https://zenodo.org/communities/bioconvert/
Gallery¶
Note
Go to the end to download the full example code
Possible Conversion (annotated)¶
Plot directed graph of possible conversions with annotation (color indicated degree of each format)
from bioconvert.core.graph import create_graph
If you use pygraphviz, you can have a good quality image using:
import matplotlib as mpl
mpl.rcParams["figure.dpi"] = 250
In order to create the following image, you need graphviz and pygraphviz. If you cannot install those packages, you may use a singularity image like in the following example by setting the use_singularity parameter to True. This would work under Linux. Not tested on other systems yet.
try:
create_graph("conversion.png", use_singularity=False)
except:
create_graph("conversion.png", use_singularity=True)
from pylab import imshow, imread, xticks, yticks, gca
imshow(imread("conversion.png"), interpolation="nearest")
xticks([])
yticks([])
ax = gca()
ax.axis("off")

(-0.5, 2746.5, 826.5, -0.5)
Total running time of the script: ( 0 minutes 0.617 seconds)
Note
Go to the end to download the full example code
Possible Conversion¶
Plot directed graph of possible conversion
from bioconvert.core.graph import create_graph
If you use pygraphviz, you can have a good quality image using:
import matplotlib as mpl
mpl.rcParams["figure.dpi"] = 250
In order to create the following image, you need graphviz and pygraphviz. If you cannot install those packages, you may use a singularity image like in the following example by setting the use_singularity parameter to True. This would work under Linux. Not tested on other systems yet.
try:
create_graph("conversion.png", use_singularity=True)
except:
create_graph("conversion.png", use_singularity=False)
Creating directory /home/docs/.config/bioconvert
Downloading singularity. Please wait
singularity pull --name /home/docs/.config/bioconvert/graphviz.simg shub://cokelaer/graphviz4all:v1
Warning ! Singularity must be installed if you want to you used it ! Switching to local graphviz executable if available
from pylab import imshow, imread, xticks, yticks, gca
imshow(imread("conversion.png"), interpolation="nearest")
xticks([])
yticks([])
ax = gca()
ax.axis("off")

(-0.5, 2278.5, 826.5, -0.5)
Total running time of the script: ( 0 minutes 3.068 seconds)
Note
Go to the end to download the full example code
Possible Conversion (clustered)¶
Plot directed graph of possible conversions clustered by field
from bioconvert.core.graph import create_graph
If you use pygraphviz, you can have a good quality image using:
import matplotlib as mpl
mpl.rcParams["figure.dpi"] = 250
In order to create the following image, you need graphviz and pygraphviz. If you cannot install those packages, you may use a singularity image like in the following example by setting the use_singularity parameter to True. This would work under Linux. Not tested on other systems yet.
try:
create_graph("conversion.png", use_singularity=False, include_subgraph=True)
except:
create_graph("conversion.png", use_singularity=True, include_subgraph=True)
from pylab import imshow, imread, xticks, yticks, gca
imshow(imread("conversion.png"), interpolation="nearest")
xticks([])
yticks([])
ax = gca()
ax.axis("off")

(-0.5, 2962.5, 651.5, -0.5)
Total running time of the script: ( 0 minutes 0.466 seconds)
Note
Go to the end to download the full example code
Converter benchmarking¶
Converter have a default method.
Notem however, that several methods may be available. Moreover, you may have a method that you want to compare with the implemented one. To do so you will need to implement your method first. Then, simply use our benchmarking framework as follows.
from bioconvert import Benchmark
from bioconvert import bioconvert_data
from bioconvert.bam2cov import BAM2COV
from bioconvert.fastq2fasta import FASTQ2FASTA
Get the convert you wish to benchmark
input_file = bioconvert_data("test_measles.sorted.bam")
conv = BAM2COV(input_file, "test.cov")
# input_file = bioconvert_data("test_fastq2fasta_v1.fastq")
# conv = FASTQ2FASTA(input_file, "test.fasta")
Get the Benchmark instance
bench = Benchmark(conv)
bench.plot()
# You can now see the different methods implemented in this
# converter and which one is the fastest.

Evaluating method bedtools: 0%| | 0/5 [00:00<?, ?it/s]
Evaluating method bedtools: 40%|#### | 2/5 [00:00<00:00, 18.07it/s]
Evaluating method bedtools: 80%|######## | 4/5 [00:00<00:00, 14.24it/s]
Evaluating method bedtools: 100%|##########| 5/5 [00:00<00:00, 13.21it/s]
{'time': {'bedtools': [0.05561065673828125, 0.05501890182495117, 0.10530495643615723, 0.05533647537231445, 0.10578155517578125]}, 'CPU': {'bedtools': [26.1, 45.5, 31.85, 18.2, 39.4]}, 'memory': {'bedtools': [16.0, 16.0, 16.0, 16.0, 16.0]}}
Total running time of the script: ( 0 minutes 0.517 seconds)
Note
Go to the end to download the full example code
Available methods per converter¶
Plot number of implemented methods per converter.
from bioconvert.core.registry import Registry
r = Registry()
info = r.get_info()
# The available unique converters
converters = [x for x in info.items()]
# the number of methods per converter
data = [info[k] for k, v in info.items()]
# the number of formats
A1 = [y for x in list(r.get_conversions()) for y in x[0]]
A2 = [y for x in list(r.get_conversions()) for y in x[1]]
formats = set(A1 + A2)
print("Number of formats: {}".format(len(formats)))
print("Number of converters: {}".format(len(converters)))
print("Number of methods : {}".format(sum(data)))
Number of formats: 51
Number of converters: 101
Number of methods : 89
from pylab import hist, clf, xlabel, grid
clf()
hist(data, range(17), ec="k", zorder=2, align="left")
xlabel("Number of methods")
grid(zorder=-1)

Total running time of the script: ( 0 minutes 0.145 seconds)
References¶
Core functions¶
Main factory of Bioconvert |
|
Tools for benchmarking |
|
Standalone application dedicated to conversion |
|
Provides a general tool to perform pre/post compression |
|
Download singularity image |
|
List of formats and associated extensions |
|
Network tools to manipulate the graph of conversion |
|
Main bioconvert registry that fetches automatically the relevant converter |
|
|
Simplified version of shell.py module from snakemake package |
misc utility functions |
Base¶
Main factory of Bioconvert
- class ConvArg(names, help, **kwargs)[source]¶
This class can be used to add specific extra arguments to any converter
For instance, imagine a conversion named A2B that requires the user to provide a reference. Then, you may want to provide the --reference extra argument. This is possible by adding a class method named get_additional_arguments that will yield instance of this class for each extra argument.
@classmethod def get_additional_arguments(cls): yield ConvArg( names="--reference", default=None, help="the referenc" )
Then, when calling bioconvert as follows,:
bioconvert A2B --help
the new argument will be shown in the list of arguments.
- class ConvBase(infile, outfile)[source]¶
Base class for all converters.
To build a new converter, create a new class which inherits from
ConvBase
and implement method that performs the conversion. The name of the converter method must start with_method_
.For instance:
class FASTQ2FASTA(ConvBase): def _method_python(self, *args, **kwargs): # include your code here. You can use the infile and outfile # attributes. self.infile self.outfile
constructor
- Parameters:
- boxplot_benchmark(rot_xticks=90, boxplot_args={}, mode='time')[source]¶
This function plots the benchmark computed in
compute_benchmark()
- compute_benchmark(N=5, to_exclude=[], to_include=[])[source]¶
Simple wrapper to call
Benchmark
This function computes the benchmark
see
Benchmark
for details.
- install_tool(executable)[source]¶
Install the given tool, using the script: bioconvert/install_script/install_executable.sh if the executable is not already present
- Parameters:
executable -- executable to install
- Returns:
nothing
- property name¶
The name of the class
Benchmark¶
Tools for benchmarking
- class Benchmark(obj, N=5, to_exclude=None, to_include=None)[source]¶
Convenient class to benchmark several methods for a given converter
c = BAM2COV(infile, outfile) b = Benchmark(c, N=5) b.run_methods() b.plot()
Constructor
- Parameters:
Use one of to_exclude or to_include. If both are provided, only the to_include one is used.
- plot(rerun=False, ylabel=None, rot_xticks=0, boxplot_args={}, mode='time')[source]¶
Plots the benchmark results, running the benchmarks if needed or if rerun is True.
- Parameters:
rot_xlabel -- rotation of the xticks function
boxplot_args -- dictionary with any of the pylab.boxplot arguments
mode -- either time, CPU or memory
- Returns:
dataframe with all results
- plot_multi_benchmark_max(path_json, output_filename='multi_benchmark.png', min_ylim=0, mode=None)[source]¶
Plotting function for the Snakefile_benchmark to be found in the doc
The json file looks like:
{ "awk":{ "0":0.777020216, "1":0.9638044834, "2":1.7623617649, "3":0.8348755836 }, "seqtk":{ "0":1.0024843216, "1":0.6313509941, "2":1.4048073292, "3":1.0554351807 }, "Benchmark":{ "0":1, "1":1, "2":2, "3":2 } }
Number of benchmark is infered from field 'Benchmark'.
Converter¶
Standalone application dedicated to conversion
Decorators¶
Provides a general tool to perform pre/post compression
- compressor(func)[source]¶
Decompress/compress input file without pipes
Does not use pipe: we decompress and compress back the input file. The advantage is that it should work for any files (even very large).
This decorator should be used by method that uses pure python code
- make_in_gz_tester(converter)[source]¶
Generates a function testing whether a conversion method of converter has the in_gz tag.
- out_compressor(func)[source]¶
Compress output file without pipes
This decorator should be used by method that uses pure python code
- requires(external_binary=None, python_library=None, external_binaries=None, python_libraries=None)[source]¶
- Parameters:
external_binary -- a system binary required for the method
python_library -- a python library required for the method
external_binaries -- an array of system binaries required for the method
python_libraries -- an array of python libraries required for the method
- Returns:
Downloader¶
Download singularity image
Extensions¶
List of formats and associated extensions
- extensions = {'abi': ['abi', 'ab1'], 'agp': ['agp'], 'bam': ['bam'], 'bcf': ['bcf'], 'bed': ['bed'], 'bedgraph': ['bedgraph', 'bg'], 'bigbed': ['bb', 'bigbed'], 'bigwig': ['bigwig', 'bw'], 'bplink': ['bplink'], 'bz2': ['bz2'], 'cdao': ['cdao'], 'clustal': ['clustal', 'aln', 'clw'], 'cov': ['cov'], 'cram': ['cram'], 'csv': ['csv'], 'dsrc': ['dsrc'], 'embl': ['embl'], 'ena': ['ena'], 'faa': ['faa', 'mpfa', 'aa'], 'fast5': ['fast5'], 'fasta': ['fasta', 'fa', 'fst'], 'fastq': ['fastq', 'fq'], 'genbank': ['genbank', 'gbk', 'gb'], 'gfa': ['gfa'], 'gff2': ['gff'], 'gff3': ['gff3'], 'gtf': ['gtf'], 'gz': ['gz'], 'json': ['json'], 'maf': ['maf'], 'newick': ['newick', 'nw', 'nhx', 'nwk'], 'nexus': ['nexus', 'nx', 'nex', 'nxs'], 'ods': ['ods'], 'paf': ['paf'], 'pdb': ['pdb'], 'phylip': ['phy', 'ph', 'phylip'], 'phyloxml': ['phyloxml', 'xml'], 'plink': ['plink'], 'pod5': ['pod5'], 'qual': ['qual'], 'sam': ['sam'], 'scf': ['scf'], 'sra': ['sra'], 'stockholm': ['sto', 'sth', 'stk', 'stockholm'], 'tsv': ['tsv'], 'twobit': ['2bit'], 'vcf': ['vcf'], 'wig': ['wig'], 'wiggle': ['wig', 'wiggle'], 'xls': ['xls'], 'xlsx': ['xlsx'], 'xmfa': ['xmfa'], 'yaml': ['yaml', 'YAML']}¶
List of formats and their extensions included in Bioconvert
Graph¶
Network tools to manipulate the graph of conversion
- create_graph(filename, layout='dot', use_singularity=False, color_for_disabled_converter='red', include_subgraph=False)[source]¶
- Parameters:
filename -- should end in .png or .svg or .dot
If extension is .dot, only the dot file is created without annotations. This is useful if you have issues installing graphviz. If so, under Linux you could use our singularity container see github.com/cokelaer/graphviz4all
Registry¶
Main bioconvert registry that fetches automatically the relevant converter
- class Registry[source]¶
class to centralise information about available conversions
from bioconvert.core.registry import Registry r = Registry() r.conversion_exists("BAM", "BED") r.info() # returns number of available methods for each converter conv_class = r[(".bam", ".bed")] converter = conv_class(input_file, output_file) converter.convert()
- conversion_path(input_fmt, output_fmt)[source]¶
Return a list of conversion steps to get from input and output formats
Each step in the list is a pair of formats.
- get_all_conversions()[source]¶
- Returns:
a generator which allow to iterate on all available conversions and their availability; a conversion is encoded by a tuple of 2 strings (input format, output format)
- Retype:
generator (input format, output format, status)
- get_conversions()[source]¶
- Returns:
a generator which allow to iterate on all available conversions a conversion is encoded by a tuple of 2 strings (input format, output format)
- Retype:
generator
- get_conversions_from_ext()[source]¶
- Returns:
a generator which allow to iterate on all available conversions a conversion is encoded by a tuple of 2 strings (input extension, output extension)
- Return type:
generator
- get_converters_names()[source]¶
- Returns:
a generator that allows to get the name of the converter from the subclass (ConvBase object)
- Return type:
generator
- get_ext(ext_pair)[source]¶
Copy the registry into a dict that behaves like a list to be able to have multiple values for a single key and from a key have all converter able to do the conversion from the input extension to the output extension.
- Parameters:
ext_pair (tuple of 2 strings) -- the input extension, the output extension
- Returns:
list of objects of subclass o
ConvBase
- iter_converters(allow_indirect: bool = False)[source]¶
- Parameters:
allow_indirect (bool) -- also return indirect conversion
- Returns:
a generator to iterate over (in_fmt, out_fmt, converter class when direct, path when indirect)
- Return type:
a generator
- set_ext(ext_pair, convertor)[source]¶
Register new convertor from input extension and output extension in a list. We can have a list of multiple convertors for one ext_pair.
- Parameters:
ext_pair (tuple) -- tuple containing the input extensions and the output extensions e.g. ( ("fastq",) , ("fasta") )
convertor (list of
ConvBase
object) -- the convertor which handle the conversion from input_ext -> output_ext
Utils¶
misc utility functions
- class TempFile(suffix='', dir=None)[source]¶
A small wrapper around tempfile.NamedTemporaryFile function
f = TempFile(suffix="csv") f.name f.delete() # alias to delete=False and close() calls
Copy from easydev package
- generate_outfile_name(infile, out_extension)[source]¶
simple utility to replace the file extension with the given one.
- get_extension(filename, remove_compression=False)[source]¶
Return extension of a filename
>>> get_extension("test.fastq") fastq >>> get_extension("test.fastq.gz") fastq
Reference converters¶
Summary¶
|
|
|
Convert PDB to FAA format |
All converters documentation¶
Convert ABI format to FASTA format
- class ABI2FASTA(infile, outfile, *args, **kargs)[source]¶
Convert ABI file to FASTQ file
ABI files are created by ABI sequencing machine and includes PHRED quality scores for base calls. This allows the creation of FastA files.
Method implemented is based on BioPython [BIOPYTHON].
constructor
- _default_method = 'biopython'¶
Default value
Convert ABI format to FASTQ format
- class ABI2FASTQ(infile, outfile, *args, **kargs)[source]¶
Convert ABI file to FASTQ file
ABI files are created by ABI sequencing machine and includes PHRED quality scores for base calls. This allows the creation of FastQ files.
Method implemented is based on BioPython [BIOPYTHON].
constructor
- _default_method = 'biopython'¶
Default value
Convert ABI format to QUAL format
- class ABI2QUAL(infile, outfile, *args, **kargs)[source]¶
-
ABI files are created by ABI sequencing machine and includes PHRED quality scores for base calls. This allows the creation of QUAL files.
Method implemented is based on BioPython [BIOPYTHON].
constructor
- _default_method = 'biopython'¶
Default value
Convert BAM format to BEDGRAPH format
- class BAM2BEDGRAPH(infile, outfile)[source]¶
Convert sorted BAM file into BEDGRAPH file
Compute the coverage (depth) in BEDGRAPH. Regions with zero coverage are also reported.
Note that this BEDGRAPH format is of the form:
chrom chromStart chromEnd dataValue
Note that consecutive positions with same values are compressed.
chr1 0 75 0 chr1 75 176 1 chr1 176 177 2
Warning
the BAM file must be sorted. This can be achieved with bamtools.
Methods available are based on bedtools [BEDTOOLS] and mosdepth [MOSDEPTH].
Constructor
- Parameters:
- _default_method = 'bedtools'¶
Default value
Convert BAM format to COV format
- class BAM2COV(infile, outfile)[source]¶
Convert sorted BAM file into COV file
Note that the COV format is of the form:
chr1 1 0 chr1 2 0 chr1 3 0 chr1 4 0 chr1 5 0
that is contig name, position, coverage.
Warning
the BAM file must be sorted. This can be achieved with bamtools using bamtools sort -in INPUT.bam
Methods available are based on samtools [SAMTOOLS] or bedtools [BEDTOOLS].
Constructor
- Parameters:
Convert BAM file to BIGWIG format
- class BAM2BIGWIG(infile, outfile, *args, **kargs)[source]¶
Convert BAM file to BIGWIG file
Convert BAM into a binary version of WIG format.
Methods are base on bamCoverage [DEEPTOOLS] and bedGraphToBigWig from wiggletools [WIGGLETOOLS]. Wiggletools method requires an extra argument (--chrom-sizes) therefore default one is bamCoverage for now.
Moreover, the two methods do not return exactly the same info!
You can check this by using bioconvert to convert into a human readable file such as wiggle. We will use the bamCoverage as our default conversion.
constructor
- _default_method = 'bamCoverage'¶
Default value
Convert BAM file to CRAM format
- class BAM2CRAM(infile, outfile, *args, **kargs)[source]¶
-
The conversion requires the reference corresponding to the input file It can be provided as an argument with the standalone (--reference). Otherwise, users are asked to provide it.
Methods available are based on samtools [SAMTOOLS].
constructor
- _default_method = 'samtools'¶
Default value
Convert BAM format to FASTA format
- class BAM2FASTA(infile, outfile)[source]¶
Convert sorted BAM file into FASTA file
Methods available are based on samtools [SAMTOOLS] or bedtools [BEDTOOLS].
Warning
Using the bedtools method, the R1 and R2 reads must be next to each other so that the reads are sorted similarly
Warning
there is no guarantee that the R1/R2 output file are sorted similarly in paired-end case due to supp and second reads
constructor
- _default_method = 'samtools'¶
Default value
Convert BAM format to FASTQ foarmat
- class BAM2FASTQ(infile, outfile)[source]¶
Convert sorted BAM file into FASTQ file
Methods available are based on samtools [SAMTOOLS] or bedtools [BEDTOOLS].
Warning
Using the bedtools method, the R1 and R2 reads must be next to each other so that the reads are sorted similarly
Warning
there is no guarantee that the R1/R2 output file are sorted similarly in paired-end case due to supp and second reads
constructor
- _default_method = 'samtools'¶
Default value
Convert BAM format to JSON format
- class BAM2JSON(infile, outfile)[source]¶
Convert BAM format to JSON file
Methods available are based on bamtools [BAMTOOLS].
constructor
- _default_method = 'bamtools'¶
Default value
Convert SAM file to BAM format
- class BAM2SAM(infile, outfile, *args, **kargs)[source]¶
-
Methods available are based on samtools [SAMTOOLS] , sam-to-bam [SAMTOBAM] , sambamba [SAMBAMBA] and pysam [PYSAM].
constructor
- _default_method = 'sambamba'¶
default value
- _method_sambamba(*args, **kwargs)[source]¶
Here we use the Sambamba tool. This is the default method because it is the fastest.
Convert BAM file to TSV format
- class BAM2TSV(infile, outfile, *args, **kargs)[source]¶
Convert sorted BAM file into TSV stats
This is not a conversion per se but the extraction of BAM statistics saved into a TSV format. The 4 columns of the TSV file are:
Reference sequence name, Sequence length,Mapped reads, Unmapped reads
Methods are based on samtools [SAMTOOLS] and pysam [PYSAM].
constructor
Methods are based on samtools [SAMTOOLS] and pysam [PYSAM].
- _default_method = 'samtools'¶
Default value
- class BAM2WIGGLE(infile, outfile)[source]¶
Convert sorted BAM file into WIGGLE file
Methods available are based on wiggletools [WIGGLETOOLS].
- Parameters:
Convert BCF file to VCF format
- class BCF2VCF(infile, outfile, *args, **kargs)[source]¶
-
Methods available are based on bcftools [BCFTOOLS].
constructor
Convert BCF format to WIGGLE format
- class BCF2WIGGLE(infile, outfile)[source]¶
Convert sorted BCF file into WIGGLE file
Methods available are based on wiggletools [WIGGLETOOLS].
- Parameters:
- _default_method = 'wiggletools'¶
Default value
Convert BED format to WIGGLE format
- class BED2WIGGLE(infile, outfile)[source]¶
Convert sorted BED file into WIGGLE file
Methods available are based on wiggletools [WIGGLETOOLS].
- Parameters:
Convert BEDGRAPH file to COV format
- class BEDGRAPH2COV(infile, outfile)[source]¶
Converts a BEDGRAPH (4 cols) to COV format (3 cols)
Input example:
chr19 49302000 4930205 -1 chr19 49302005 4930210 1
becomes:
chr19 4930201 -1 chr19 4930202 -1 chr19 4930203 -1 chr19 4930204 -1 chr19 4930205 -1 chr19 4930206 1 chr19 4930207 1 chr19 4930208 1 chr19 4930209 1 chr19 4930210 1
Method available is a Bioconvert implementation (Python).
constructor
- _default_method = 'python'¶
Default value
Convert BEDGRAPH to BIGWIG format
- class BEDGRAPH2BIGWIG(infile, outfile)[source]¶
Converts BEDGRAPH format to BIGWIG format
Conversion is based on bedGraph2BigWig tool. Note that an argument --chrom-sizes is required.
constructor
- _default_method = 'ucsc'¶
Default value
Convert BEDGRAPH format to WIGGLE format
- class BEDGRAPH2WIGGLE(infile, outfile)[source]¶
Convert sorted BEDGRAPH file into WIGGLE file
Methods available are based on wiggletools [WIGGLETOOLS].
- Parameters:
- _default_method = 'wiggletools'¶
Default value
Convert BIGBED format to WIGGLE format
- class BIGBED2WIGGLE(infile, outfile)[source]¶
Convert sorted BIGBED file into WIGGLE file
Methods available are based on wiggletools [WIGGLETOOLS].
- Parameters:
- _default_method = 'wiggletools'¶
Default value
Convert BIGBED format to BED format
- class BIGBED2BED(infile, outfile)[source]¶
Converts a sequence alignment in BIGBED format to BED4 format
Methods available are based on pybigwig [DEEPTOOLS].
constructor
- _default_method = 'pybigwig'¶
Default value
Convert BIGWIG to BEDGRAPH format
- class BIGWIG2BEDGRAPH(infile, outfile)[source]¶
Converts a sequence alignment in BIGWIG format to BEDGRAPH format
Conversion is based on ucsc bigWigToBedGraph tool or pybigwig (default) [DEEPTOOLS].
constructor
- _default_method = 'pybigwig'¶
Default value
- _method_pybigwig(*args, **kwargs)[source]¶
In this method we use the python extension written in C, pyBigWig.
Convert BIGWIG format to WIGGLE format
- class BIGWIG2WIGGLE(infile, outfile)[source]¶
Convert sorted BIGWIG file into WIGGLE file
Methods available are based on pybigwig [DEEPTOOLS].
- Parameters:
- _default_method = 'wiggletools'¶
Default value
Convert BPLINK to PLINK format
- class BPLINK2PLINK(infile, outfile=None, *args, **kwargs)[source]¶
Converts a genotype dataset bed+bim+fam in BPLINK format to ped+map PLINK format.
Conversion is based on plink [PLINK] executable.
Warning
plink takes several inputs and outputs and does not need extensions. What is required is a prefix. Bioconvert usage is therefore:
bioconvert bplink2plink plink_toy
Since there is no extension, you must be explicit by providing the conversion name (bplink2plink). This command will search for 3 input files plink_toy.bed, plink_toy.bim and plink_toy.fam. It will then create two output files named plink_toy.ped and plink_toy.map
constructor
- _default_method = 'plink'¶
Default value
- class BZ22GZ(infile, outfile, *args, **kargs)[source]¶
-
Methods based on bunzip2 or zlib/bz2 Python libraries.
constructor
- _default_method = 'bz2_gz'¶
Default value
Convert CLUSTAL to FASTA format
- class CLUSTAL2FASTA(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment from CLUSTAL to FASTA format.
Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON], and goalign [GOALIGN].
constructor
- _default_method = 'biopython'¶
Default value
Convert CLUSTAL to PHYLIP format
- class CLUSTAL2PHYLIP(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment from CLUSTAL format to PHYLIP format.
Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON], and goalign [GOALIGN].
constructor
- _default_method = 'biopython'¶
Default value
Convert CLUSTAL to STOCKHOLM format
- class CLUSTAL2STOCKHOLM(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment from CLUSTAL format to STOCKHOLM format.
Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON], and goalign [GOALIGN].
constructor
- _default_method = 'biopython'¶
Default value
Convert CRAM file to BAM format
- class CRAM2BAM(infile, outfile, *args, **kargs)[source]¶
-
The conversion requires the reference corresponding to the input file It can be provided as an argument with the standalone (--reference). Otherwise, users are asked to provide it.
Methods available are based on samtools [SAMTOOLS].
constructor
- _default_method = 'samtools'¶
Default value
Convert CRAM file to FASTQ format
- class CRAM2FASTA(infile, outfile, *args, **kargs)[source]¶
Convert CRAM file to FASTA file
Methods available are based on samtools [SAMTOOLS].
constructor
- _default_method = 'samtools'¶
Default value
Convert CRAM file to FASTQ format
- class CRAM2FASTQ(infile, outfile, *args, **kargs)[source]¶
Convert CRAM file to FASTQ file
Methods available are based on samtools [SAMTOOLS].
constructor
- _default_method = 'samtools'¶
Default value
Convert CRAM file to SAM format
- class CRAM2SAM(infile, outfile, *args, **kargs)[source]¶
-
The conversion requires the reference corresponding to the input file It can be provided as an argument with the standalone (--reference). Otherwise, users are asked to provide it.
Methods available are based on samtools [SAMTOOLS].
constructor
- _default_method = 'samtools'¶
Default value
Convert CSV format to TSV format
- class CSV2TSV(infile, outfile)[source]¶
Convert CSV file into TSV file
Available methods: Python, Pandas
Methods available are based on python or Pandas [PANDAS].
See also
TSV2CSV
Constructor
- _default_method = 'python'¶
Default value
- _method_pandas(in_sep=',', out_sep='\t', line_terminator='\n', *args, **kwargs)[source]¶
- _method_python(in_sep=',', out_sep='\t', line_terminator='\n', *args, **kwargs)[source]¶
- class CSV2XLS(infile, outfile, *args, **kargs)[source]¶
-
Methods available are based on python, pyexcel [PYEXCEL], or pandas [PANDAS].
constructor
- _default_method = 'pandas'¶
Default value
- _method_pandas(in_sep=',', sheet_name='Sheet 1', *args, **kwargs)[source]¶
Convert a compressed FASTQ from DSRC to FASTQ format
- class DSRC2GZ(infile, outfile, *args, **kargs)[source]¶
Convert a compressed FASTQ from DSRC to GZ format
Methods available are based on dsrc [DSRC] and pigz [PIGZ].
constructor
- _default_method = 'dsrcpigz'¶
Default value
- _method_dsrcpigz(*args, **kwargs)[source]¶
Do the conversion dsrc -> GZ. Method that uses pigz and dsrc.
pigz documentation dsrc documentation
option threadig does not work with the dsrc version from conda so we do not add the -t threads option
Convert EMBL file to FASTA format
- class EMBL2FASTA(infile, outfile, *args, **kargs)[source]¶
Convert EMBL file to FASTA file
Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON].
constructor
- _default_method = 'biopython'¶
Default value
Convert EMBL file to GENBANK format
- class EMBL2GENBANK(infile, outfile, *args, **kargs)[source]¶
Convert EMBL file to GENBANK file
Methods available are based on squizz [SQUIZZ] and biopython [BIOPYTHON].
constructor
- _default_method = 'biopython'¶
Default value
Convert FASTA format to FASTQ format
- class FASTA_QUAL2FASTQ(infile, outfile)[source]¶
Convert FASTA and QUAL back into a FASTQ file
Method based on pysam [PYSAM].
- Parameters:
- _default_method = 'pysam'¶
Default value
Convert FASTA to CLUSTAL format
- class FASTA2CLUSTAL(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment from FASTA to CLUSTAL format
Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON], and goalign [GOALIGN].
constructor
- _default_method = 'biopython'¶
Default value
- _method_biopython(*args, **kwargs)[source]¶
Convert FASTA interleaved file in CLUSTAL format using biopython.
Convert FASTA format to FAA format
- class FASTA2FAA(infile, outfile)[source]¶
Methods available is a bioconvert implementation.
- Parameters:
- _default_method = 'bioconvert'¶
Default value
Convert FASTA format to FASTQ format
- class FASTA2FASTQ(infile, outfile)[source]¶
Methods available are based on pysam [PYSAM].
- Parameters:
- _default_method = 'pysam'¶
Default value
Convert FASTA to GENBANK format
- class FASTA2GENBANK(infile, outfile, *args, **kargs)[source]¶
Convert FASTA file to GENBANK file
Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON] or Bioconvert pure implementation (default).
constructor
- _default_method = 'bioconvert'¶
Default value
- class FASTA2NEXUS(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment in FASTA format to NEXUS format
Methods available are based on squizz [GOALIGN].
constructor
- _default_method = 'goalign'¶
Default value
Convert FASTA to PHYLIP format
- class FASTA2PHYLIP(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment in FASTA format to PHYLIP format
Conversion is based on Bio Python modules
Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON] or goalign [GOALIGN]. Squizz is the default (https://github.com/bioconvert/bioconvert/issues/149). Phylip created is a strict phylip that is with 10 characters on the first column.
constructor
- _default_method = 'biopython'¶
Default value
Convert FASTA to TWOBIT format
- class FASTA2TWOBIT(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment in FASTA format to TWOBIT format
Methods available are based on UCSC faToTwoBit [UCSC].
constructor
- _default_method = 'ucsc'¶
default value
- class FASTQ2FASTA(infile, outfile)[source]¶
-
This converter has lots of methods. Some of them have also been removed or commented with time. BioPython for instance is commented due to poo performance compared to others. Does not mean that it is not to be considered. Performances are decrease due to lot of sanity checks.
Similarly, bioawk and python_external method are commented because redundant with other equivalent method.
- Parameters:
- _default_method = 'bioconvert'¶
default value
- _method_awk(*args, **kwargs)[source]¶
Here we are using the awk method.
Note
Another method with awk has been tested but is less efficient. Here is which one was used:
box.awkcmd = """awk '{{if(NR%4==1) {{printf(">%s\n",substr($0,2));}} else if(NR%4==2) print;}}' """
- _method_bioconvert(*args, **kwargs)[source]¶
Bioconvert implementation in pure Python. This is the default method because it is the fastest.
- _method_mappy(*args, **kwargs)[source]¶
This method provides a fast and accurate C program to align genomic sequences and transcribe nucleotides.
- _method_mawk(*args, **kwargs)[source]¶
This variant of the awk method uses mawk, a lighter and faster implementation of awk.
Note
Other methods with mawk have been tested but are less efficient. Here are which ones were used:
mawkcmd_v2 = """mawk '{{if(NR%4==1) {{printf(">%s\n",substr($0,2));}} else if(NR%4==2) print;}}' """ mawkcmd_v3 = """mawk '(++n<=0){next}(n!=1){print;n=-2;next}{print">"substr($0,2)}'"""
- _method_perl(*args, **kwargs)[source]¶
This method uses the perl command which will call the "fastq2fasta.pl" script.
- _method_sed(*args, **kwargs)[source]¶
This method uses the UNIX function sed which is a non-interactive editor.
Note
Another method with sed has been tested but is less efficient. Here is which one was used:
cmd = """sed -n 's/^@/>/p;n;p;n;n'"""
Convert GENBANK to EMBL format
- class GENBANK2EMBL(infile, outfile, *args, **kargs)[source]¶
Convert GENBANK file to EMBL file
Some description.
constructor
- _default_method = 'biopython'¶
Default value
Convert GENBANK to EMBL format
- class GENBANK2FASTA(infile, outfile, *args, **kargs)[source]¶
Convert GENBANK file to FASTA file
Methods are based on biopython [BIOPYTHON], squizz [SQUIZZ] and our own Bioconvert implementation.
constructor
- _default_method = 'biopython'¶
Default value
Convert GENBANK to GFF3 format
- class GENBANK2GFF3(infile, outfile, *args, **kargs)[source]¶
Convert GENBANK file to GFF3 file
Method based on biocode.
constructor
- _default_method = 'biocode'¶
Default value
- _method_biocode(*args, **kwargs)[source]¶
Uses scripts from biocode copied and modified in bioconvert.utils.biocode
Please see Main entry
- class GFF22GFF3(infile, outfile, *args, **kargs)[source]¶
-
constructor
Method available is pure Python.
- _default_method = 'bioconvert'¶
Default value
- class GFF32GFF2(infile, outfile, *args, **kargs)[source]¶
-
Method available is Python-based.
constructor
- _default_method = 'bioconvert'¶
Default value
- class GFA2FASTA(infile, outfile)[source]¶
Convert sorted GFA file into FASTA file
Available methods are based on awk or python (default)
See also
bioconvert.simulator.gfa
- Parameters:
- _default_method = 'python'¶
Default value
- _method_awk(*args, **kwargs)[source]¶
For this method, we use the awk tools.
- Returns:
the standard output
- Return type:
io.StringIO
object.
Note
this method fold the sequence to 80 characters
- class GZ2BZ2(infile, outfile, *args, **kargs)[source]¶
-
Unzip input file using pigz or gunzip and compress using pbzip2. Default is pigz/pbzip2.
constructor
- _default_method = 'pigz_pbzip2'¶
Default value
- _method_gunzip_bzip2(*args, **kwargs)[source]¶
Single theaded conversion. Method that uses gunzip bzip2.
- class GZ2DSRC(infile, outfile, *args, **kargs)[source]¶
Convert compressed fastq.gz file into DSRC compressed file
constructor
- _default_method = 'pigzdsrc'¶
Default value
- _method_pigzdsrc(*args, **kwargs)[source]¶
do the conversion gz -> DSRC
- Returns:
the standard output
- Return type:
io.StringIO
object.
Method that uses pigz and dsrc.
- class JSON2YAML(infile, outfile, *args, **kargs)[source]¶
Convert JSON file into YAML file
Conversion is based on yaml and json standard Python modules Indentation is set to 4 by default and affects the sections (not the list). For example:
fruits_list: - apple - orange section1: do: true misc: 1
constructor
- _default_method = 'yaml'¶
Default value
Convert MAF file to SAM format
- class MAF2SAM(infile, outfile)[source]¶
This is the Multiple alignment format or MIRA assembly format
This is not Mutation Annotation Format (somatic)
pbsim creates this kind of data
Some references:
https://github.com/arq5x/nanopore-scripts/master/maf-convert.py
http://bioperl.org/formats/alignment_formats/MAF_multiple_alignment_format.html
Those two codes were in Py2 at the time of this implementation. We re-used some of the information from maf-convert but the code in bioconvert.io.maf can be considered original.
constructor
- Parameters:
- _default_method = 'python'¶
Default value
Converts NEWICK file to NEXUS format.
- class NEWICK2NEXUS(infile, outfile=None, *args, **kwargs)[source]¶
Converts a tree file from NEWICK format to NEXUS format.
Methods available are based on gotree [GOTREE].
constructor
- _default_method = 'gotree'¶
Default value
Converts NEWICK file to PHYLOXML format.
- class NEWICK2PHYLOXML(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a tree file from NEWICK format to PHYLOXML format.
Methods available are based on gotree [GOTREE].
constructor
- _default_method = 'gotree'¶
Default value
- class NEXUS2FASTA(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment from NEXUS format to FASTA format.
constructor
- _default_method = 'biopython'¶
Default value
- _method_biopython(*args, **kwargs)[source]¶
- Convert NEXUS interleaved or sequential file in FASTA format using biopython.
The FASTA output file will be an aligned FASTA file.
For instance:
We have a Nexus input file that look like
#NEXUS [TITLE: Test file] begin data; dimensions ntax=3 nchar=123; format interleave datatype=DNA missing=N gap=-; matrix read3 -AT--------CCCGCTCGATGGGCCTCATTGCGTCCACTAGTTGATCTT read2 -----------------------GGAAGCCCACGCCACGGTCTTGATACG read4 ---------------------AGGGATGAACGATGCTCGCAGTTGATGCT read3 CTGGAGTAT---T----TAGGAAAGCAAGTAAACTCCTTGTACAAATAAA read2 AATTTTTCTAATGGCTATCCCTACATAACCTAACCGGGCATGTAATGTGT read4 CAGAAGTGCCATTGCGGTAGAAACAAATGTTCCCAGATTGTTGACTGATA read3 GATCTTA-----GATGGGCAT-- read2 CACCGTTGTTTCGACGTAAAGAG read4 AGTAGGACCTCAGTCGTGACT-- ; end; begin assumptions; options deftype=unord; end;
the output file will look like
>read3 -AT--------CCCGCTCGATGGGCCTCATTGCGTCCACTAGTTGATCTTCTGGAGTAT- --T----TAGGAAAGCAAGTAAACTCCTTGTACAAATAAAGATCTTA-----GATGGGCA T-- >read2 -----------------------GGAAGCCCACGCCACGGTCTTGATACGAATTTTTCTA ATGGCTATCCCTACATAACCTAACCGGGCATGTAATGTGTCACCGTTGTTTCGACGTAAA GAG >read4 ---------------------AGGGATGAACGATGCTCGCAGTTGATGCTCAGAAGTGCC ATTGCGGTAGAAACAAATGTTCCCAGATTGTTGACTGATAAGTAGGACCTCAGTCGTGAC T--
and not
>read3 ATCCCGCTCGATGGGCCTCATTGCGTCCACTAGTTGATCTTCTGGAGTATTTAGGAAAGC AAGTAAACTCCTTGTACAAATAAAGATCTTAGATGGGCAT >read2 GGAAGCCCACGCCACGGTCTTGATACGAATTTTTCTAATGGCTATCCCTACATAACCTAA CCGGGCATGTAATGTGTCACCGTTGTTTCGACGTAAAGAG >read4 AGGGATGAACGATGCTCGCAGTTGATGCTCAGAAGTGCCATTGCGGTAGAAACAAATGTT CCCAGATTGTTGACTGATAAGTAGGACCTCAGTCGTGACT
Converts NEXUS file to NEWICK format.
- class NEXUS2NEWICK(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a tree file from NEXUS format to NEWICK format.
Methods available are based on biopython [BIOPYTHON] or goalign [GOALIGN].
constructor
- _default_method = 'gotree'¶
Default value
Converts NEXUS file to PHYLIP format.
- class NEXUS2PHYLIP(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment from NEXUS format to PHYLIP format.
Methods available are based on goalign [GOALIGN].
constructor
- _default_method = 'goalign'¶
Default value
Converts NEXUS file to PHYLOXML format.
- class NEXUS2PHYLOXML(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a tree file from NEXUS format to PHYLOXML format.
Methods available are based on squizz [SQUIZZ] or biopython [BIOPYTHON], and goalign [GOALIGN].
constructor
- _default_method = 'gotree'¶
Default value
Convert XLS format to CSV format
- class ODS2CSV(infile, outfile)[source]¶
Convert XLS file into CSV file
Method based on pyexcel [PYEXCEL].
constructor
- _default_method = 'pyexcel'¶
Default value
Converts PHYLIP file to CLUSTAL format.
- class PHYLIP2CLUSTAL(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment from PHYLIP format to CLUSTAL format
Methods available are based on biopython [BIOPYTHON], squiz [SQUIZZ].
constructor
- _default_method = 'biopython'¶
Default value
Converts PHYLIP file to FASTA format.
- class PHYLIP2FASTA(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment in PHYLIP format to FASTA format
Methods available are based on biopython [BIOPYTHON], squiz [SQUIZZ].
constructor
- _default_method = 'biopython'¶
default value
Converts PHYLIP file to NEXUS format.
- class PHYLIP2NEXUS(infile, outfile=None, *args, **kwargs)[source]¶
Converts a sequence alignment from PHYLIP format to NEXUS format.
Methods available are based on goalign [GOALIGN].
constructor
- _default_method = 'goalign'¶
Default value
Converts PHYLIP file to STOCKHOLM format.
- class PHYLIP2STOCKHOLM(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment from PHYLIP interleaved to STOCKHOLM
Methods available are based on biopython [BIOPYTHON], squiz [SQUIZZ].
constructor
- _default_method = 'biopython'¶
Default value
Converts PHYLIP file to XMFA format.
- class PHYLIP2XMFA(infile, outfile=None, *args, **kwargs)[source]¶
Converts a sequence alignment from PHYLIP format to XMFA
Methods available are based on biopython [BIOPYTHON].
constructor
- _default_method = 'biopython'¶
Default value
Converts PHYLOXML file to NEWICK format.
- class PHYLOXML2NEWICK(infile, outfile=None, *args, **kwargs)[source]¶
Converts a tree file from PHYLOXML format to NEWICK format.
Methods available are based on gotree [GOTREE].
constructor
- _default_method = 'gotree'¶
Default value
Converts PHYLOXML file to NEXUS format.
- class PHYLOXML2NEXUS(infile, outfile=None, *args, **kwargs)[source]¶
Converts a tree file from PHYLOXML format to NEXUS format.
Methods available are based on gotree [GOTREE].
constructor
- _default_method = 'gotree'¶
Default value
- class PLINK2BPLINK(infile, outfile=None, *args, **kwargs)[source]¶
Converts a genotype dataset ped+map in PLINK format to bed+bim+fam BPLINK format
Conversion is based on plink executable
constructor
- _default_method = 'plink'¶
Default value
Convert SAM file to BAM format
- class SAM2BAM(infile, outfile, *args, **kargs)[source]¶
-
constructor
- _default_method = 'samtools'¶
Default value
Convert SAM file to CRAM format
- class SAM2CRAM(infile, outfile, reference=None, *args, **kargs)[source]¶
-
The conversion requires the reference corresponding to the input file It can be provided as an argument with the standalone (--reference). Otherwise, users are asked to provide it.
Methods available are based on samtools [SAMTOOLS].
constructor
- _default_method = 'samtools'¶
Default value
- class SAM2PAF(infile, outfile, *args, **kargs)[source]¶
-
The SAM and PAF formats are described in the Formats section.
Description:
The header of the SAM file (lines starting with @) are dropped. However, the length of the target is retrieved from the @SQ line that must be present.
Consider this SAM file with two alignements only. One is aligned on the target (first) while the other is not (indicated by the
*
characters):@SQ SN:ENA|K01711|K01711.1 LN:15894 @PG ID:minimap2 PN:minimap2 VN:2.5-r572 CL:minimap2 -a measles.fa Hm2_GTGAAA_L005_R1_001.fastq.gz HISEQ:426:C5T65ACXX:5:2302:1943:2127 0 ENA|K01711|K01711.1 448 60 101M * 00 CTTACCTTCGCATCAAGAGGTACCAACATGGAGGATGAGGCGGACCAATACTTTTCACATGATGATCCAATTAGTAGTGATCAATCCAGGTTCGGATGGTT BCCFFFFFHHHHHIIJJJJJJIIJJJJJJJJFHIHIJJJIJIIIIGHFFFFFFEEEEEEEDDDDDFDDDDDDDDD>CDDEDEEDDDDDDCCDDDDDDDDCD NM:i:0 ms:i:202 AS:i:202 nn:i:0 tp:A:P cm:i:14 s1:i:94 s2:i:0 HISEQ:426:C5T65ACXX:5:2302:4953:2090 4 * 0 0 * * 0 0 AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAACAACCAAAAAGAGACGAACAA CCCFFDDFAFFBHJHGGGIHIJBGGHIIJJJJJJJHGEIJGIFIIIHCBGHIJIIIIIJJHHHHEF@D@=;=,0)0&5&))+(((+((((&+(((()&&)(
The equivalent PAF file is
HISEQ:426:C5T65ACXX:5:2302:1943:2127 101 0 101 + ENA|K01711|K01711.1 15894 447 548 101 101 60 NM:i:0 ms:i:202 AS:i:202 nn:i:0 tp:A:P cm:i:14 s1:i:94 s2:i:0 cg:Z:101M
In brief, the sequences are dropped. The final file is therefore smaller. Extra fields (starting from NM:i:0) can be dropped or kept using the keep_extra_field argument. Alignement with
*
characters are dropped. The first line (@SQ) is used to retrieve the length of the contigs that is stored in the PAF file (column 6).The 12 compulsary PAF fields are:
Col
Type
Description
1
string
Query sequence name
2
int
Query sequence length
3
int
Query start (0-based)
4
int
Query end (0-based)
5
char
Relative strand: "+" or "-"
6
string
Target sequence name
7
int
Target sequence length
8
int
Target start on original strand (0-based)
9
int
Target end on original strand (0-based)
10
int
Number of residue matches
11
int
Alignment block length
12
int
Mapping quality (0-255; 255 for missing)
For developesr:
Get the measles data from Sequana library (2 paired fastq files):
minimap2 measles.fa R1.fastq > approx-mapping.paf
You can ask minimap2 to generate CIGAR at the cg tag of PAF with:
minimap2 -c measles.fa R1.fastq > alignment.paf
or to output alignments in the SAM format:
minimap2 -a measles.fa R1.fastq > alignment.sam
The SAM lines must contains 11 positional element and the NM:i and nn:i fields (see example above).
constructor
- Parameters:
- Reference:
This function is a direct translation of https://github.com/lh3/miniasm/blob/master/misc/sam2paf.js (Dec. 2017).
- _default_method = 'python'¶
Default value
Convert SCF file to FASTA file
- class SCF2FASTA(infile, outfile)[source]¶
Converts a binary SCF/ABI file to Fasta format.
constructor
- Parameters:
- _default_method = 'python'¶
Default value
Convert SCF file to FASTQ file
- class SCF2FASTQ(infile, outfile)[source]¶
Converts a binary SCF file to FastQ file
constructor
- Parameters:
- _default_method = 'python'¶
Default value
Convert SRA format to FASTA format
- class SRA2FASTQ(infile, outfile, test=False)[source]¶
Download FASTQ from SRA archive
bioconvert sra2fastq ERR043367
This may take some times since the files are downloaded from SRA website.
constructor
https://edwards.flinders.edu.au/fastq-dump/
library used: sra-toolkit
- _default_method = 'fastq_dump'¶
Default value
Converts STOCKHOLM file to CLUSTAL file.
- class STOCKHOLM2CLUSTAL(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment from STOCKHOLM format to CLUSTAL format
Methods available are based on squizz [SQUIZZ] and biopython [BIOPYTHON].
constructor
- _default_method = 'biopython'¶
Default value
Converts STOCKHOLM to PHYLIP format.
- class STOCKHOLM2PHYLIP(infile, outfile=None, *args, **kwargs)[source]¶
Converts a sequence alignment from STOCKHOLM format to PHYLIP interleaved format
Methods available are based on squizz [SQUIZZ], and biopython [BIOPYTHON].
constructor
- _default_method = 'biopython'¶
Default value
Convert TSV format to CSV format
- class TSV2CSV(infile, outfile)[source]¶
Convert TSV file into CSV file
Available methods: Python, Pandas
Methods available are based on python or Pandas [PANDAS].
See also
CSV2TSV
Constructor
- _default_method = 'python'¶
Default value
- _method_pandas(in_sep='\t', out_sep=',', line_terminator='\n', *args, **kwargs)[source]¶
- _method_python(in_sep='\t', out_sep=',', line_terminator='\n', *args, **kwargs)[source]¶
Conversion from TWOBIT to FASTA format
- class TWOBIT2FASTA(infile, outfile=None, alphabet=None, *args, **kwargs)[source]¶
Converts a sequence alignment in TWOBIT format to FASTA format
Conversion is based on UCSC [UCSC] and py2bit.
constructor
- _default_method = 'py2bit'¶
Default value
- class VCF2BCF(infile, outfile=None, *args, **kwargs)[source]¶
Convert VCF file to BCF format
Method based on bcftools [BCFTOOLS].
- Parameters:
- _default_method = 'bcftools'¶
Default value
- class VCF2BED(infile, outfile)[source]¶
Convert VCF file to BED3 file by extracting positions.
The awk method implemented here below reports an interval of 1 for SNP, the length of the insertion or the length of the deleted part in case of deletion.
constructor
- Parameters:
- _default_method = 'awk'¶
Default value
- _method_awk(*args, **kwargs)[source]¶
do the conversion VCF -> BED using awk
- Returns:
the standard output
- Return type:
io.StringIO
object.
Convert VCF format to WIGGLE format
- class VCF2WIGGLE(infile, outfile)[source]¶
Convert sorted VCF file into WIGGLE file
- Parameters:
- _default_method = 'wiggletools'¶
Default value
Convert XLS format to CSV format
- class XLS2CSV(infile, outfile)[source]¶
Convert XLS file into CSV file
Extra arguments when using Bioconvert executable.
name
Description
--sheet-name
The name or id of the sheet to convert
--out-sep
The separator used in the output file
--line-terminator
The line terminator used in the output file
Methods available are based on pandas [PANDAS] and pyexcel [PYEXCEL].
Constructor
- _method_pandas(out_sep=',', line_terminator='\n', sheet_name=0, *args, **kwargs)[source]¶
Convert XLS format to CSV format
- class XLSX2CSV(infile, outfile)[source]¶
Convert XLSX file into CSV file
Extra arguments when using Bioconvert executable.
name
Description
--sheet-name
The name or id of the sheet to convert
--out-sep
The separator used in the output file
--line-terminator
The line terminator used in the output file
Methods available are based on pandas [PANDAS] and pyexcel [PYEXCEL].
Constructor
- _default_method = 'pandas'¶
Default value
- _method_pandas(out_sep=',', line_terminator='\n', sheet_name=0, *args, **kwargs)[source]¶
- class XMFA2PHYLIP(infile, outfile=None, *args, **kwargs)[source]¶
Converts a sequence alignment from XMFA to PHYLIP format.
Method available based on biopython [BIOPYTHON].
constructor
- _default_method = 'biopython'¶
Default value
IO functions¶
- read_from_buffer(f_file, length, offset)[source]¶
Return 'length' bits of file 'f_file' starting at offset 'offset'
- class MAFLine(line)[source]¶
A reader for MAF format.
mode refname start algsize strand refsize alignment
a s ref 100 10 + 100000 ---AGC-CAT-CATT s contig 0 10 + 10 ---AGC-CAT-CATT a s ref 100 12 + 100000 ---AGC-CAT-CATTTT s contig 0 12 + 12 ---AGC-CAT-CATTTT
The alignments are stored by pair, one item for the reference, one for the query. The query (second line) starts at zero.
Formats¶
Here below, we provide a list of formats used in bioinformatics or computational biology. Most of these formats are used in Bioconvert and available for conversion to another formats. Some are available for book-keeping.
We hope that this page will be useful to all developers and scientists. Would you like to contribute, please edit the file in our github doc/formats.rst.
If you wish to update this page, please see the Developer guide page.
TWOBIT¶
- Format:
binary
- Status:
available
- Type:
sequence
A 2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself.
The file begins with a 16-byte header containing the following fields:
signature: the number 0x1A412743 in the architecture of the machine that created the file
version: zero for now. Readers should abort if they see a version number higher than 0
sequenceCount: the number of sequences in the file
reserved: always zero for now
All fields are 32 bits unless noted. If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures.
The header is followed by a file index, which contains one entry for each sequence. Each index entry contains three fields:
nameSize: a byte containing the length of the name field
name: the sequence name itself (in ASCII-compatible byte string), of variable length depending on nameSize
offset: the 32-bit offset of the sequence data relative to the start of the file, not aligned to any 4-byte padding boundary
The index is followed by the sequence records, which contain nine fields:
dnaSize - number of bases of DNA in the sequence
nBlockCount - the number of blocks of Ns in the file (representing unknown sequence)
nBlockStarts - an array of length nBlockCount of 32 bit integers indicating the (0-based) starting position of a block of Ns
nBlockSizes - an array of length nBlockCount of 32 bit integers indicating the length of a block of Ns
maskBlockCount - the number of masked (lower-case) blocks
maskBlockStarts - an array of length maskBlockCount of 32 bit integers indicating the (0-based) starting position of a masked block
maskBlockSizes - an array of length maskBlockCount of 32 bit integers indicating the length of a masked block
reserved - always zero for now
packedDna - the DNA packed to two bits per base, represented as so: T - 00, C - 01, A - 10, G - 11. The first base is in the most significant 2-bit byte; the last base is in the least significant 2 bits. For example, the sequence TCAG is represented as 00011011.
Bioconvert conversions
AGP¶
- Format:
human-readable
- Status:
- Type:
assembly
AGP files are used to describe the assembly of a sequences from smaller fragments. The large object can be a contig, a scaffold (supercontig), or a chromosome. Each line (row) of the AGP file describes a different piece of the object, and has the column entries defined below. Several format exists: 1.0, 2.0, 2.1
you can validate your AGP file using this website: https://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_validate.cgi
ABI¶
- Format:
binary
- Status:
available
- Type:
sequence
ABI are trace files that include the PHRED quality scores for the base calls. This allows ABI to FASTQ conversion. Note that each ABI file contains one and only one sequence (no need for indexing the file). The trace data contains probablities of the four nucleotide bases along the sequencing run together with the sequence deduced from that data. ABI trace is a binary format.
File format produced by ABI sequencing machine. It produces ABI "Sanger" capillary sequence
ASQG¶
- Format:
human-readable
- Status:
not included (deprecated)
- Type:
assembly
The ASQG format describes an assembly graph. Each line is a tab-delimited record. The first field in each record describes the record type. The three types are:
HT: Header record. This record contains metadata tags for the file version (VN tag) and parameters associated with the graph (for example the minimum overlap length).
VT: Vertex records. The second field contains the vertex identifier, the third field contains the sequence. Subsequent fields contain optional tags.
- ED: Edge description records. Fields are:
sequence 1 name
sequence 2 name
sequence 1 overlap start (0 based)
sequence 1 overlap end (inclusive)
sequence 1 length
sequence 2 overlap start (0 based)
sequence 2 overlap end (inclusive)
sequence 2 length
sequence 2 orientation (1 for reversed with respect to sequence 1)
number of differences in overlap (0 for perfect overlaps, which is the default).
Example:
HT VN:i:1 ER:f:0 OL:i:45 IN:Z:reads.fa CN:i:1 TE:i:0
VT read1 GATCGATCTAGCTAGCTAGCTAGCTAGTTAGATGCATGCATGCTAGCTGG
VT read2 CGATCTAGCTAGCTAGCTAGCTAGTTAGATGCATGCATGCTAGCTGGATA
VT read3 ATCTAGCTAGCTAGCTAGCTAGTTAGATGCATGCATGCTAGCTGGATATT
ED read2 read1 0 46 50 3 49 50 0 0
ED read3 read2 0 47 50 2 49 50 0 0
References
BAI¶
- Format:
binary
- Status:
not included
- Type:
index
The index file of a BAM file is a BAI file format. The BAI files are not used in Bioconvert.
BAM¶
- Format:
binary
- Status:
included
- Type:
Sequence alignement
The BAM (Binary Alignment Map) is the binary version of the Sequence Alignment Map (SAM) format. It is a compact and index-able representation of nucleotide sequence alignments.
Bioconvert Conversions
BAM2SAM
,
BAM2CRAM
,
BAM2BEDGRAPH
,
BAM2COV
,
BAM2BIGWIG
,
BAM2FASTA
,
BAM2FASTQ
,
BAM2JSON
,
BAM2TSV
,
BAM2WIGGLE
References
BCF¶
- Format:
binary
- Status:
included
- Type:
variant
Binary version of the Variant Call Format (VCF).
Bioconvert conversions
BCL¶
- Format:
binary
- Status:
not included
- Type:
sequence
BCL is the raw format used by Illumina sequencer. This data is converted into FastQ thanks to a tool called bcl2fastq. This type of conversion is not included in Bioconvert. Indeed, Illumina provides a bcl2fastq executable and its user guide is available online. In most cases, the BCL files are already converted and users will only get the FastQ files so we will not provide such converter.
BED for plink¶
- Format:
binary
- Status:
included
- Type:
genotypic
This BED format is the binary PED file. Not to be confused with BED format used with BAM files. Please see PLINK binary files (BED/BIM/FAM) section.
BEDGRAPH¶
- Format:
human-readable
- Status:
included
- Type:
database
BedGraph is a subset of BED12 format. It is a 4-columns tab-delimited file with chromosome name, start and end positions and the fourth column is a number that is often used to show coverage depth. So, this is the same format as the BED4 format. Example:
chr1 0 75 0
chr1 75 176 1
chr1 176 177 2
See also
BED¶
- Format:
human-readable
- Status:
not included
- Type:
database
A Browser Extensible Data (BED) file is a tab-delimited text file. It is a concise way to represent genomic features and annotations.
The BED file is a very versatile format, which makes it difficult to handle in Bioconvert. So, let us describe exhaustively the BED format.
Although the BED description format supports up to 12 columsn, only the first 3 are required for some tools such as the UCSC browser, Galaxy, or bedtools software.
So, in general BED lines have 3 required fields and nine additional optional fields.
Generally, all BED files have the same extensions (.bed) irrespective of the number of columns. We can refer to the 3-columns version as BED3, the 4-columns BED as BED4 and so on.
The number of fields per line must be consistent. If some fields are empty, additional column information must be filled for consistency (e.g., with a "."). BED fields can be whitespace-delimited or tab-delimited although some variations of BED types such as "bed Detail" require a tab character delimitation for the detail columns (see Note box here below).
Note
BED detail format
It is an extension of BED format plus 2 additional fields. The first one is an ID, which can be used in place of the name field for creating links from the details pages. The second additional field is a description of the item, which can be a long description and can consist of html.
- Requirements:
fields must be tab-separated
"type=bedDetail" must be included in the track line,
the name and position fields should uniquely describe items so that the correct ID and description will be displayed on the details pages.
The following example uses the first 4 columns of BED format, but up to 12 may be used. Note the header, which contains the type=bedDetail string.:
track name=HbVar type=bedDetail description="HbVar custom track" db=hg19 visibility=3 url="blabla.html" chr11 5246919 5246920 Hb_North_York 2619 Hemoglobin variant chr11 5255660 5255661 HBD c.1 G>A 2659 delta0 thalassemia chr11 5247945 5247946 Hb Sheffield 2672 Hemoglobin variant chr11 5255415 5255416 Hb A2-Lyon 2676 Hemoglobin variant chr11 5248234 5248235 Hb Aix-les-Bains 2677 Hemoglobin variant
Warning
Browser such as the Genome Browser (http://genome.ucsc.edu/) can visualise BED files. Usually, BED files can be annotated using header lines, which begin with the word "browser" or "track" to assist the browser in the display and interpretation.
Such annotation track header lines are not permissible in utilities such as bedToBigBed, which convert lines of BED text to indexed binary files.
The file description below is modified from: http://genome.ucsc.edu/FAQ/FAQformat#format1.
The first three required BED fields are:
chrom - The name of the chromosome (e.g. chr3) or scaffold.
chromStart - The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd - The ending position of the feature in the chromosome. The chromEnd base is not included in the display of the feature.
The 9 additional optional BED fields are:
name - Label of the BED line
score - A score between 0 and 1000. In Genome Browser, the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed.
strand - Defines the strand. Either "." (=no strand) or "+" or "-".
thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.
thickEnd - The ending position at which the feature is drawn thickly.
itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0).
blockCount - The number of blocks (exons) in the BED line.
blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
blockStarts - A comma-separated list of block starts. Should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.
In BED files with block definitions, the first blockStart value must be 0, so that the first block begins at chromStart. Similarly, the final blockStart position plus the final blockSize value must equal to chromEnd. Blocks may not overlap.
Here is a simple example:
track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
Note
If your data set is BED-like, but it is very large (over 50MB) you can convert it to a BIGBED format.
See also
BED3¶
A BED3 is supported by bedtools. It is a BED file where each feature is described by chrom, start and end (with tab-delimited values). Example:
chr1 100 120
See BED section for details.
BED4¶
A BED4 is a BED file where each feature is described by chrom, start, end and name (with tab-delimited values). The last column could also be a number. Example:
chr1 100 120 gene1
See BED section for details.
See also
BED5¶
A BED5 is supported by bedtools. It is a BED file where each feature is described by chrom, start, end, name and score(with tab-delimited values). Example:
chr1 100 120 gene1 0
See BED section for details.
BED6¶
A BED6 is supported by bedtools. It is a BED file where each feature is described by chrom, start, end, name, score and strand (with tab-delimited values). Example:
chr1 100 120 gene1 0 +
See BED section for details.
BED12¶
A BED12 is supported by bedtools. It is a BED file where each feature is described by all 12 BED fields. Example:
chr1 100 120 gene1 0 + 100 100 0 3 1,2,3 4,5,6
See BED section.
BIGBED¶
- Format:
binary
- Status:
included
- Type:
database/track
The bigBed format stores annotation items. BigBed files are created initially from BED type files. The resulting bigBed files are in an indexed binary format. The main advantage of the bigBed files is that only the portions of the files needed to display a particular region is used.
bioconvert conversions
BIGBED2COV
,
BIGBED2WIGGLE
BIGWIG¶
- Format:
binary
- Status:
included
- Type:
database/track
The bigWig format is useful for dense, continuous data. They can be created from wiggle file (WIGGLE (WIG)). This type of file is an indexed binary format.
Wiggle data must be continuous unlike BED. You can convert a
BED/BEDGraph to bigwig using BEDGRAPH2BIGWIG
.
To create a bigwig from a wiggle, yo need to remove the existing "track" header
Bioconvert conversions::
BIGWIG2WIGGLE
,
BEDGRAPH2BIGWIG
Note
Wiggle, bigWig, and bigBed files use 0-based half-open coordinates, which are also used by this extension. So to access the value for the first base on chr1, one would specify the starting position as 0 and the end position as 1. Similarly, bases 100 to 115 would have a start of 99 and an end of 115. This is simply for the sake of consistency with the underlying bigWig file and may change in the future in various formats and tools dealing with those formats.
BIM¶
- Format:
human-readable
- Status:
included
- Type:
variants
The BIM formatted file is a variant information file accompanying a .bed or biallelic .pgen binary genotype table. Please see PLINK binary files (BED/BIM/FAM) section.
The fields are:
chromosome number (integer)
SNP marker ID (string) / variant ID
SNP generic position (cM) (float) / position in centimorgans (safe to use dummy value 0)
SNP physical position (bp) (1-based)
Alternate allele code
Reference allele code
Here is an example:
1 rs0 0 1000 0 1
1 rs10 0 1001 2 1
BZ2¶
- Format:
binary
- Status:
included
- Type:
Compression
bzip2 is a file compression program that uses the Burrows–Wheeler algorithm. Extension is usually .bz2 The BZ2 compression is usually better than gzip for Fastq format compression (factor 2-3).
COV¶
A simple TSV file with 3 columns to store coverage in a continuous way. First column is contig/chromosome name, second is position and third is coverage. Expected positions are continuous. The BEDGRAPH stores an extra column but can be a more compact way of storing coverage/depth.
Example:
chr1 1 10
chr1 2 11
chr1 3 15
chr1 4 12
chr1 5 11
CRAM¶
- Format:
binary
- Status:
not included
- Type:
Alignment
The CRAM file format is a more dense form of BAM files with the benefit of saving much disk space. While BAM files contain all sequence data within a file, CRAM files are smaller by taking advantage of an additional external reference sequence file. This file is needed to both compress and decompress the read information.
See also
Bioconvert Conversions
BAM2CRAM
, SAM2CRAM
,
CRAM2BAM
, CRAM2SAM
.
CLUSTAL¶
- Format:
human-readable
- Status:
included
- Type:
multiple alignment
In a Clustal format, the first line in the file must start with the words "CLUSTAL W" or "CLUSTALW". Nevertheless, many such files starts with CLUSTAL or CLUSTAL X. Other information in the first line is ignored. One or more empty lines. One or more blocks of sequence data. Each block consists of one line for each sequence in the alignment. Each line consists of the sequence name white space up to 60 sequence symbols. optional - white space followed by a cumulative count of residues for the sequences A line showing the degree of conservation for the columns of the alignment in this block. One or more empty lines.
Some rules about representing sequences:
Case does not matter.
Sequence symbols should be from a valid alphabet.
Gaps are represented using hyphens ("-").
- The characters used to represent the degree of conservation are
* - : all residues or nucleotides in that column are identical
: - : conserved substitutions have been observed
. - : semi-conserved substitutions have been observed
<SPACE> - : no match.
Here is an example of a multiple alignment in CLUSTAL W format:
CLUSTAL W (1.82) multiple sequence alignment
FOSB_MOUSE MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA 60
FOSB_HUMAN MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA 60
************************************************************
FOSB_MOUSE TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL 98
FOSB_HUMAN TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL 98
***********************:**************
Some bioconvert conversions
CLUSTAL2FASTA
,
CLUSTAL2NEXUS
,
CLUSTAL2PHYLIP
,
CLUSTAL2STOCKHOLM
,
Reference
CSV¶
- Format:
human-readable
- Type:
database
- Status:
included
A comma-separated values format is a delimited text file that uses a comma to separate values. See CSV format page for details.
DSRC¶
- Format:
binary
- Status:
included
- Type:
Compression
DSRC compression dedicated for DNA sequences.
EMBL¶
- Format:
human-readable
- Status:
included
- Type:
database
EMBL format stores sequence and its annotation together. The start of the annotation section is marked by a line beginning with the word "ID". The start of sequence section is marked by a line beginning with the word "SQ". The "//" (terminator) line also contains no data or comments and designates the end of an entry.
An example sequence in EMBL format is:
ID AB000263 standard; RNA; PRI; 368 BP.
XX
AC AB000263;
XX
DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.
XX
SQ Sequence 368 BP;
acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 60
ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 120
caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc 180
aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag 240
gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga 300
agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca 360
gacctgaa 368
Bioconvert conversions:
FAM¶
- Format:
human-readable
- Status:
included
- Type:
database
The FAM format is used to store sample information accompanying a .bed or biallelic .pgen binary genotype table. Please see PLINK binary files (BED/BIM/FAM) section.
In brief, it stores the first 6 columns of the PED file. So it is a text file with no header line, and one line per sample with the following six fields:
Family ID ('FID')
Individual ID ('IID'; cannot be '0')
Individual ID of father ('0' if father isn't in dataset)
Individual ID of mother ('0' if mother isn't in dataset)
Sex code ('1' = male, '2' = female, '0' = unknown)
Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)
For example:
1 1000000000 0 0 1 1
1 1000000001 0 0 1 2
FAA¶
Fasta formatted file storing amino acid sequences. A mutliple protein fasta file can have the more specific extension mpfa.
FASTA¶
- Format:
human-readable
- Status:
included
- Type:
Sequence
FASTA format is one of the most widely used sequence format. It can stores multiple records of sequence and their identifier.
A sequence entry has a one-line header followed by one or more lines of sequence. The header must start with the ">" character. The next word is the sequence identifier or the accession number; the rest of the line is considered as description.
The NCBI recommandation do not allowed blank lines in the middle of FASTA files. Note, however, that some tools can handle blank lines by ignoring them. This is not recommened to include blank lines though.
There is no standard file extension for a text file containing FASTA formatted sequences. Although
their is a plethora of ad-hoc file extensions: fasta, fas, fa, seq, fsa, fna, ffn, faa, frn, we use only fasta, fa and fst within Bioconvert (see extensions
). For completeness, fasta is the generic fasta file, fna stands for fasta nucleic acid, ffn for fasta nucleotide of gene resions, faa for fasta amino acid, frn for fasta non-coding RNA, etc.
An example sequence in FASTA format is:
>X65923.1 H.sapiens fau mRNA
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGCAGCTCTTTGT
CCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCGCCCAGATCAAGGCTCATGTA
GCCTCACTGGAGGGCATTGCCCCGGAAGATCAAGTCGTGCTCCTGGCAGGCGCGCCCCTGGAGGATGAGG
CCACTCTGGGCCAGTGCGGGGTGGAGGCCCTGACTACCCTGGAAGTAGCAGGCCGCATGCTTGGAGGTAA
AGTTCATGGTTCCCTGGCCCGTGCTGGAAAAGTGAGAGGTCAGACTCCTAAGGTGGCCAAACAGGAGAAG
AAGAAGAAGAAGACAGGTCGGGCTAAGCGGCGGATGCAGTACAACCGGCGCTTTGTCAACGTTGTGCCCA
CCTTTGGCAAGAAGAAGGGCCCCAATGCCAACTCTTAAGTCTTTTGTAATTCTGGCTTTCTCTAATAAAA
AAGCCACTTAGTTCAGTCAAAAAAAAAA
In this example, the header (also known as description line) is formatted as:
>ID description
Many variants of FASTA formats exists but differ only in the way the header is written. All starts with the ">" sign though. We can cite a few variants here below (for simplicity we give only puit 2 lines per sequence).
The NCBI style defines the identifier with database name, entry ID and optional accession or sequence version number separated by pipes:
>embl|X65923|X65923.1 H.sampiens fau mRNA
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGCAGCTCTTTGT
CCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCGCCCAGATCAAGGCTCATGTA
List of NCBI FASTA database are listed in https://tinyurl.com/y6wrzyad
The GI style is the same as NCBI style except that the sequence GI code is given instead of the entry ID:
>gi|31302|gnl|genbank|X65923 (X65923.1) H.sampiens fau mRNA
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGCAGCTCTTTGT
CCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCGCCCAGATCAAGGCTCATGTA
There is also a CGC-style FASTA format (not to be confused with the GCG format). Its header includes an optional database name as part of the identifier by using the : sign:
>DATABASE_NAME:DI accession description
>embl:X65923 X65923.1 H.sapiens fau mRNA
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGCAGCTCTTTGT
CCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCGCCCAGATCAAGGCTCATGTA
And more generally, we have the FASTA with accession and description style. The accession number or sequence version included after the identifier:
>X65923 X65923.1 H.sapiens fau mRNA
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGCAGCTCTTTGT
CCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCGCCCAGATCAAGGCTCATGTA
GCCTCACTGGAGGGCATTGCCCCGGAAGATCAAGTCGTGCTCCTGGCAGGCGCGCCCCTGGAGGATGAGG
CCACTCTGGGCCAGTGCGGGGTGGAGGCCCTGACTACCCTGGAAGTAGCAGGCCGCATGCTTGGAGGTAA
AGTTCATGGTTCCCTGGCCCGTGCTGGAAAAGTGAGAGGTCAGACTCCTAAGGTGGCCAAACAGGAGAAG
AAGAAGAAGAAGACAGGTCGGGCTAAGCGGCGGATGCAGTACAACCGGCGCTTTGTCAACGTTGTGCCCA
CCTTTGGCAAGAAGAAGGGCCCCAATGCCAACTCTTAAGTCTTTTGTAATTCTGGCTTTCTCTAATAAAA
AAGCCACTTAGTTCAGTCAAAAAAAAAA
Note
original FASTA format may include comments with the ; sign. This is not supported anymore in most programs.
Bioconvert conversions
FASTQ2FASTA
, FASTA2FASTQ
,
FASTA2CLUSTAL
, FASTA2NEXUS
,
FASTA2TWOBIT
References
NCBI recommandations: https://tinyurl.com/y6wrzyad
FastG¶
- Format:
- Status:
not included
- Type:
assembly
FastG is a Graph format used to faithfully representing genome assemblies in the face of allelic polymorphism and assembly uncertainty.
FastQ¶
- Format:
human-readable
- Status:
included
- Type:
Sequence
FASTQ is a text-based format for storing both biological sequence (usually nucleotide sequence) and its corresponding quality scores (QUAL). A FASTQ format can contain several sequences. All FASTQ variations are in the formatting of the quality scores. Currently, the recommended variant is the Sanger encoding also used by Illumina 1.8. It encodes the Phred quality score from 0 to 93 using ASCII 33 to 126. This format is also refered to PHRED+33 meaning there is an offset of 33 in the ASCII code. Other variants such as FASTQ-solexa or earlier Illumina versions. Currently conversions included in Bioconvert do not need to be aware of the quality score encoding.
A FASTQ file uses four lines per sequence:
a '@' character, followed by a sequence identifier and an optional description
the raw sequence letters.
a '+' character, optionally followed by the same sequence identifier (and any description)
quality values for the sequence in Line 2
An example sequence in FASTQ format is:
@SEQUENCE_ID1
GTGGAAGTTCTTAGGGCATGGCAAAGAGT
+
FAFFADEDGDBGEGGBCGGHE>EEBA@@=
@SEQUENCE_ID2
GTGGAAGTTCTTAGG
+
FAFFADEDGDBGEGG
Bioconvert conversions
FASTQ2FASTA
, FASTA2FASTQ
References
GENBANK¶
- Format:
human-readable
- Status:
included
- Type:
annotation/sequence
GenBank format (GenBank Flat File Format) stores sequence and its annotation together. The start of the annotation section is marked by a line beginning with the word LOCUS. The start of sequence section is marked by a line beginning with the word ORIGIN and the end of the section is marked by a line with only "//".
GenBank format for protein has been renamed GenPept.
An example sequence in GenBank format is:
LOCUS AB000263 368 bp mRNA linear PRI 05-FEB-1999
DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide, complete
cds.
ACCESSION AB000263
ORIGIN
1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg
61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg
121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc
181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag
241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga
301 agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca
361 gacctgaa
//
Bioconvert conversions
GENPEPT¶
see GENBANK
GFA¶
- Format:
human-readable
- Status:
included
- Type:
assembly graph
The Graphical Fragment Assembly (GFA) can be used to represent genome assemblies. GFA stores sequence graphs as the product of an assembly, a representation of variation in genomes, splice graphs in genes, or even overlap between reads from long-read sequencing technology.
The GFA format is a tab-delimited text format for describing a set of sequences and their overlap. The first field of the line identifies the type of the line. Header lines start with H. Segment lines start with S. Link lines start with L. A containment line starts with C. A path line starts with P.
Segment a continuous sequence or subsequence.
Link an overlap between two segments. Each link is from the end of one segment to the beginning of another segment. The link stores the orientation of each segment and the amount of basepairs overlapping.
Containment an overlap between two segments where one is contained in the other.
Path an ordered list of oriented segments, where each consecutive pair of oriented segments are supported by a link record.
See details in the reference above.
Example:
H VN:Z:1.0 S 11 ACCTT S 12 TCAAGG S 13 CTTGATT L 11 + 12 - 4M L 12 - 13 + 5M L 11 + 13 + 3M P 14 11+,12-,13+ 4M,5M
Notes: sometimes you would have extra field (fourth one) on segment lines. Convertion to fasta will store this fourth line after the name.
GFA2 is a generalization of GFA that allows one to specify an assembly graph in either less detail, e.g. just the topology of the graph, or more detail, e.g. the multi-alignment of reads giving rise to each sequence. It is further designed to be a able to represent a string graph at any stage of assembly, from the graph of all overlaps, to a final resolved assembly of contig paths with multi-alignments. Apart from meeting these needs, the extensions also supports other assembly and variation graph types.
Like GFA, GFA2 is tab-delimited in that every lexical token is separated from the next by a single tab.
Bioconvert conversions
References:
GTF¶
- Format:
human-readable
- Status:
included
- Type:
Annotation
GTF2 (General Feature Format version 2) is a file format used to represent genomic features and their locations in a genome. It is a tab-delimited text file that contains one line for each genomic feature, with each line consisting of nine fields separated by tabs.
The fields in a GTF2 file are as follows:
Seqid: The identifier of the genomic sequence.
Source: The source of the annotation.
Feature: The type of feature.
Start: The starting position of the feature.
End: The ending position of the feature.
Score: A score associated with the feature.
Strand: The strand on which the feature is located.
Phase: The phase of the feature, if applicable.
Attributes: A list of attributes associated with the feature, encoded as a semicolon-separated list of key-value pairs.
GFF¶
- Format:
human-readable
- Status:
included
- Type:
Annotation
GFF is a standard file format for storing genomic features in a text file. GFF stands for Generic Feature Format. It is 9 column tab-delimited file, each line of which corresponds to an annotation, or feature.
The GFF v2 is deprecated and v3 should be used instead. In particular, GFF2 is sunable to deal with the three-level hierarchy of gene -> transcript -> exon.
The first line is a comment (starting with #) followed by a series of data lines, each of which correspond to an annotation. Here is an example:
##gff-version 3
ctg123 . exon 1300 1500 . + . ID=exon00001
ctg123 . exon 1050 1500 . + . ID=exon00002
ctg123 . exon 3000 3902 . + . ID=exon00003
ctg123 . exon 5000 5500 . + . ID=exon00004
ctg123 . exon 7000 9000 . + . ID=exon00005
The header is compulsary and following lines must have 9 columns as follows:
seqname - The name of the sequence (e.g. chromosome) on which the feature exists. Any string can be used. For example, chr1, III, contig1112.23. Any character not in
[a-zA-Z0-9.:^*$@!+_?-|]
must be escaped with the % character followed by its hexadecimal value.source - The source of this feature. This field will normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc. If there is no source, use the . character.
feature - The feature type name. Equivalent to BED’s name field. For example, exon, etc. Should be a term from the lite sequence ontology (SOFA).
start - The one-based starting position of feature on seqname. bedtools uses a one-based position and BED uses a zero-based start position.
end - The one-based ending position of feature on seqname.
score - A score assigned to the GFF feature.
strand - Defines the strand. Use +, - or .
frame/phase - The frame of the coding sequence. Use 0, 1, 2. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.
attribute - A list of feature attributes in the format tag=value separated by semi columns. All non-printing characters in such free text value strings (e.g. newlines, tabs, control characters, etc) must be explicitly represented by their C (UNIX) style backslash-escaped representation (e.g. newlines as ‘n’, tabs as ‘t’). Tabs must be replaced with %09 URL escape. There are predefined tags:
ID: unique identifier of the feature.
Name: name of the feature
Alias
Parent: can be used to group exons into transcripts, transcripts into genes and so on.
Target
Gap
Derives_from
Note
Dbxref
Ontology_term
Multiple attributes of the same type are separated by comma. Case sensitive: Parent is difference from parent.
GZ¶
- Format:
binary
- Status:
included
- Type:
Compression
gzip is a file compression program that is based on the DEFLATE algorithm, which is a combination of LZ77 and Hufmfman coding.
JSON¶
- Format:
human-readable
- Status:
included
- Type:
database
JSON format stands for Javascript Object Notation. Basic data types used in JSON:
Number: a signed decimal number that may contain a fractional part and may use exponential E notation, but cannot include non-numbers such as NaN. The format makes no distinction between integer and floating-point. JavaScript uses a double-precision floating-point format for all its numeric values, but other languages implementing JSON may encode numbers differently.
String: a sequence of zero or more Unicode characters. Strings are delimited with double-quotation marks and support a backslash escaping syntax.
Boolean: either of the values true or false
Array: an ordered list of zero or more values, each of which may be of any type. Arrays use square bracket notation and elements are comma-separated.
Object: an unordered collection of name–value pairs where the names (also called keys) are strings. Since objects are intended to represent associative arrays, it is recommended that each key is unique within an object. Objects are delimited with curly brackets and use commas to separate each pair, while within each pair the colon ':' character separates the key or name from its value.
null: An empty value, using the word null
Limited whitespace is allowed and ignored around or between syntactic elements (values and punctuation, but not within a string value). Only four specific characters are considered whitespace for this purpose: space, horizontal tab, line feed, and carriage return. In particular, the byte order mark must not be generated by a conforming implementation (though it may be accepted when parsing JSON). JSON does not provide syntax for comments.
Example:
{
"database": "AB",
"date": "13-10-2010",
"entries":
[
{
"ID": 1,
"coverage": 10
},
{
"ID": 2,
"coverage": 15
}
]
}
References
MAF (Mutation Annotation Format)¶
- Format:
human-readable
- Status:
not included
- Type:
multiple alignement
MAF (Multiple Alignement Format)¶
- Format:
human-readable
- Status:
included
- Type:
phylogeny
The Multiple Alignment Format stores a series of multiple alignments.
Warning
Not to be confused with MAF (Mutation Annotation Format)
Here are some rules about the MAF syntax:
It is line-oriented.
Each multiple alignment ends with a blank line.
Each sequence in an alignment is on a single line, which can get quite long, but there is no length limit.
Words in a line are delimited by any white space.
Lines starting with # are considered to be comments.
Lines starting with ## can be ignored by most programs, but contain meta-data of one form or another.
The file is divided into paragraphs that terminate in a blank line.
Within a paragraph, the first word of a line indicates its type.
Each multiple alignment is in a separate paragraph that begins with an a line and contains an s line for each sequence in the multiple alignment.
Some MAF files may contain other optional line types:
i line contains information about what is in the aligned species DNA before and after the immediately preceding s line
e line contains information about the size of the gap between the alignments that span the current block
q line indicates the quality of each aligned base for the species.
Here is an example of s lines (alignment block):
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon 249182 13 + 4622798 gcagctgaaaaca
s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATA
The s and a lines define a multiple alignment. The columns of the s lines have the following fields:
src: The name of one of the source sequences for the alignment. The form 'database.chromosome' allows automatic creation of links to other assemblies in some browsers.
start: The start of the aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).
size: The size of the aligning region in the source sequence. This number is equal to the number of non-dash characters in the alignment text field below.
strand: Either + or -. If -, then the alignment is to the reverse-complemented source.
srcSize: The size of the entire source sequence, not just the parts involved in the alignment.
text: The nucleotides (or amino acids) in the alignment and any insertions (dashes).
Lines starting with i give information about what's happening before and after this block in the aligning species:
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
i panTro1.chr6 N 0 C 0
s baboon 249182 13 + 4622798 gcagctgaaaaca
i baboon I 234 n 19
The i lines contain information about the context of the sequence lines immediately preceding them. The following fields are defined by position rather than name=value pairs:
src: The name of the source sequence for the alignment. Should be the same as the s line immediately above this line.
leftStatus: A character that specifies the relationship between the sequence in this block and the sequence that appears in the previous block.
leftCount: Usually the number of bases in the aligning species between the start of this alignment and the end of the previous one.
rightStatus: A character that specifies the relationship between the sequence in this block and the sequence that appears in the subsequent block.
rightCount: Usually the number of bases in the aligning species between the end of this alignment and the start of the next one.
The status characters can be one of the following values:
C: the sequence before or after is contiguous with this block.
I: there are bases between the bases in this block and the one before or
after it.
N: this is the first sequence from this src chrom or scaffold.
n: this is the first sequence from this src chrom or scaffold but it is
bridged by another alignment from a different chrom or scaffold.
M: there is missing data before or after this block (Ns in the sequence).
T: the sequence in this block has been used before in a previous block
(likely a tandem duplication)
Lines starting with e gives information about empty parts of the alignment block:
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca
e mm4.chr6 53310102 13 + 151104725 I
The e lines indicate that there isn't aligning DNA for a species but that the current block is bridged by a chain that connects blocks before and after this block. The following fields are defined by position rather than name=value pairs.
src: The name of one of the source sequences for the alignment.
start: The start of the non-aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).
size: The size in base pairs of the non-aligning region in the source sequence.
strand: Either + or -. If -, then the alignment is to the reverse-complemented source.
srcSize: The size of the entire source sequence, not just the parts involved in the alignment; alignment and any insertions (dashes) as well.
status*: A character that specifies the relationship between the non-aligning sequence in this block and the sequence that appears in the previous and subsequent blocks.
The status character can be one of the following values:
C: the sequence before and after is contiguous implying that this region
was either deleted in the source or inserted in the reference sequence.
The browser draws a single line or a "-" in base mode in these blocks.
I: there are non-aligning bases in the source species between chained alignment
blocks before and after this block. The browser shows a double line
or "=" in base mode.
M: there are non-aligning bases in the source and more than 90% of them are Ns in
the source. The browser shows a pale yellow bar.
n: there are non-aligning bases in the source and the next aligning block starts
in a new chromosome or scaffold that is bridged by a chain between still
other blocks. The browser shows either a single line or a double line based
on how many bases are in the gap between the bridging alignments.
Lines starting with q -- information about the quality of each aligned base for the species:
s hg18.chr1 32741 26 + 247249719 TTTTTGAAAAACAAACAACAAGTTGG
s panTro2.chrUn 9697231 26 + 58616431 TTTTTGAAAAACAAACAACAAGTTGG
q panTro2.chrUn 99999999999999999999999999
s dasNov1.scaffold_179265 1474 7 + 4584 TT----------AAGCA---------
q dasNov1.scaffold_179265 99----------32239---------
The q lines contain a compressed version of the actual raw quality data, representing the quality of each aligned base for the species with a single character of 0-9 or F. The following fields are defined by position rather than name=value pairs:
src: The name of the source sequence for the alignment. Should be the same as the "s" line immediately preceding this line.
value: A MAF quality value corresponding to the aligning nucleotide acid in the preceding "s" line. Insertions (dashes) in the preceding "s" line are represented by dashes in the "q" line as well. The quality value can be "F" (finished sequence) or a number derived from the actual quality scores (which range from 0-97) or the manually assigned score of 98. These numeric values are calculated as:
MAF quality value = min( floor(actual quality value/5), 9 )
This results in the following mapping:
MAF quality value Raw quality score range Quality level
0-8 0-44 Low
9 45-97 High
0 98 Manually assigned
F 99 Finished
A Simple Example (three alignment blocks derived from five starting sequences). Repeats are shown as lowercase, and each block may have a subset of the input sequences. All sequence columns and rows must contain at least one nucleotide (no columns or rows that contain only insertions):
##maf version=1 scoring=tba.v8
# tba.v8 (((human chimp) baboon) (mouse rat))
a score=23262.0
s hg18.chr7 27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s baboon 116834 38 + 4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
s mm4.chr6 53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
s rn3.chr4 81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
a score=5062.0
s hg18.chr7 27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
s baboon 241163 6 + 4622798 TAAAGA
s mm4.chr6 53303881 6 + 151104725 TAAAGA
s rn3.chr4 81444246 6 + 187371129 taagga
a score=6636.0
s hg18.chr7 27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon 249182 13 + 4622798 gcagctgaaaaca
s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATA
MAP¶
- Format:
human-readable
- Status:
included
- Type:
Genotypic
PLINK is a very widely used application for analyzing genotypic data.
The fields in a MAP file are:
Chromosome
Marker ID
Genetic distance
Physical position
Example of a MAP file of the standard PLINK format:
21 rs11511647 0 26765
X rs3883674 0 32380
X rs12218882 0 48172
9 rs10904045 0 48426
9 rs10751931 0 49949
8 rs11252127 0 52087
10 rs12775203 0 52277
8 rs12255619 0 52481
References
NEWICK¶
- Format:
human-readable
- Status:
included
- Type:
phylogeny
Newick format is typically used for tools like PHYLIP and is a minimal definition for a phylogenetic tree. It is a way of representing graph-theoretical trees with edge lengths using parentheses and commas.
(,,(,)); no nodes are named
(A,B,(C,D)); leaf nodes are named
(A,B,(C,D)E)F; all nodes are named
(:0.1,:0.2,(:0.3,:0.4):0.5); all but root node have a distance to parent
(:0.1,:0.2,(:0.3,:0.4):0.5):0.0; all have a distance to parent
(A:0.1,B:0.2,(C:0.3,D:0.4):0.5); distances and leaf names (popular)
(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; distances and all names
((B:0.2,(C:0.3,D:0.4)E:0.5)A:0.1)F; a tree rooted on a leaf node (rare)
Bioconvert conversions
NEXUS¶
- Format:
human-readable
- Status:
included
- Type:
phylogeny
The NEXUS multiple alignment format, also known as PAUP format is used to multiple alignment or phylogentic trees.
After a header to indicate the format (#NEXUS ), blocks are stored and start with Begin NAME; and end with END;
Example of a DNA alignment:
#NEXUS
Begin data;
Dimensions ntax=4 nchar=15;
Format datatype=dna missing=? gap=-;
Matrix
Species1 atgctagctagctcg
Species2 atgcta??tag-tag
Species3 atgttagctag-tgg
Species4 atgttagctag-tag
;
End;
It can be used to store phylogenetic trees using the TREES block:
#NEXUS
BEGIN TAXA;
TAXLABELS A B C;
END;
BEGIN TREES;
TREE tree1 = ((A,B),C);
END;
Bioconvert conversions
NEXUS2CLUSTAL
,
NEXUS2NEWICK
,
NEXUS2PHYLIP
,
NEXUS2PHYLIPXML
,
References
ODS¶
- Format:
human-readable
- Status:
included
- Type:
Sequence
ODS stands for OpenDocument Spreadsheet (.ods) file format. It should be equivalent to the XLS format.
PAF (Pairwise mApping Format)¶
- Format:
human-readable
- Status:
included
- Type:
mapping
PAF is a text format describing the approximate mapping positions between two set of sequences. PAF is used for instance in miniasm tool (see reference above), an ultrafast de novo assembly for long noisy reads. PAF is TAB-delimited with each line consisting of the following predefined fields:
Col |
Type |
Description |
---|---|---|
1 |
string |
Query sequence name |
2 |
int |
Query sequence length |
3 |
int |
Query start (0-based) |
4 |
int |
Query end (0-based) |
5 |
char |
Relative strand: "+" or "-" |
6 |
string |
Target sequence name |
7 |
int |
Target sequence length |
8 |
int |
Target start on original strand (0-based) |
9 |
int |
Target end on original strand (0-based) |
10 |
int |
Number of residue matches |
11 |
int |
Alignment block length |
12 |
int |
Mapping quality (0-255; 255 for missing) |
If PAF is generated from an alignment, column 10 equals the number of sequence matches, and column 11 equals the total number of sequence matches, mismatches, insertions and deletions in the alignment. If alignment is not available, column 10 and 11 are still required but can be approximate.
A PAF file may optionally contain SAM-like typed key-value pairs at the end of each line.
Bioconvert conversion
PDB¶
Todo
coming soon
PED¶
- Format:
human-readable
- Status:
included
- Type:
Genotypic
PLINK is a very widely used application for analyzing genotypic data.
The fields in a PED file are:
Family ID
Sample ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Affection (0=unknown; 1=unaffected; 2=affected)
Genotypes (space or tab separated, 2 for each marker. 0=missing)
Example of a PED file of the standard PLINK format:
FAM1 NA06985 0 0 1 1 A T T T G G C C A T T T G G C C
FAM1 NA06991 0 0 1 1 C T T T G G C C C T T T G G C C
0 NA06993 0 0 1 1 C T T T G G C T C T T T G G C T
0 NA06994 0 0 1 1 C T T T G G C C C T T T G G C C
0 NA07000 0 0 2 1 C T T T G G C T C T T T G G C T
0 NA07019 0 0 1 1 C T T T G G C C C T T T G G C C
0 NA07022 0 0 2 1 C T T T G G 0 0 C T T T G G 0 0
0 NA07029 0 0 1 1 C T T T G G C C C T T T G G C C
FAM2 NA07056 0 0 0 2 C T T T A G C T C T T T A G C T
FAM2 NA07345 0 0 1 1 C T T T G G C C C T T T G G C C
PHYLOXML¶
- Format:
human-readable
- Status:
included
- Type:
phylogeny
PhyloXML is an XML language for the analysis, exchange, and storage of phylogenetic trees.
A shortcoming of formats such as Nexus and Newick is a lack of a standardized means to annotate tree nodes and branches with distinct data fields (species names, branch lengths, multiple support values). A well defined XML format addresses these problems in a general and extensible manner and allows for interoperability between specialized and general purpose software.
Here is an example (source https://en.wikipedia.org/wiki/PhyloXML)
<phyloxml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.10/phyloxml.xsd"
xmlns="http://www.phyloxml.org">
<phylogeny rooted="true">
<name>example from Prof. Joe Felsenstein's book "Inferring Phylogenies"</name>
<description>MrBayes based on MAFFT alignment</description>
<clade>
<clade branch_length="0.06">
<confidence type="probability">0.88</confidence>
<clade branch_length="0.102">
<name>A</name>
</clade>
<clade branch_length="0.23">
<name>B</name>
</clade>
</clade>
<clade branch_length="0.5">
<name>C</name>
</clade>
</clade>
</phylogeny>
</phyloxml>
Bioconvert conversions
PHYLIP¶
- Format:
human-readable
- Status:
included
- Type:
phylogeny / alignement
The PHYLIP format stores a multiple sequence alignement.
It is a plain test format with a header describing the dimensions of the alignment followed by the mutliple sequence alignment. The following sequence is exactly 10 characters long (padded wit spaces if needed).
PHYLIP does not support blank lines between header and the alignment.
In the header, the first integer defines the number of sequences. The second intefer defines the number of alignments. There are several spaces between the two integers.
Here is an example:
5 50
Seq0000 GATTAATTTG CCGTAGGCCA GAATCTGAAG ATCGAACACT TTAAGTTTTC
Seq0001 ACTTCTAATG GAGAGGACTA GTTCATACTT TTTAAACACT TTTACATCGA
Seq0002 TGTCGGACCT AAGTATTGAG TACAACGGTG TATTCCAGCG GTGGAGAGGT
Seq0003 CTATTTTTCC GGTTGAAGGA CTCTAGAGCT GTAAAGGGTA TGGCCATGTG
Seq0004 CTAAGCGCGG GCGGATTGCT GTTGGAGCAA GGTTAAATAC TCGGCAATGC
Bioconvert conversions
PHYLIP2CLUSTAL
,
PHYLIP2FASTA
,
PHYLIP2NEXUS
,
PHYLIP2STOCKHOLM
PLINK flat files (MAP/PED)¶
- Format:
human-readable
- Status:
included
- Type:
genotypic
PLINK is a used application for analyzing genotypic data. It can be considered the de-facto standard of the field.
The standard PLINK files can be a bundle of plain text files (PED & MAP dataset, or its transpose, TPED & FAM dataset), or a bundle of binary files (BED, BIM & FAM) as explained in PLINK binary files (BED/BIM/FAM).
PLINK provides commands to convert between text and binary formats. In Bioconvert, you can use the plink2bpblink conversion:
bioconvert plink2bplink input_prefix output_prefix
Note
Since there are several input and output files, we do not provide the extension. Instead, we use the prefix filename.
Since PLINK files do not specify for a variant which allele is reference and which is alternative, importing data to a variant tools project requires matching each variant to the reference sequence to determine reference and alternative alleles
The Genotypic data are separated in two flat files: MAP and PED.
The MAP files describes the SNPs and contains those fields:
chromosome number (integer)
SNP marker ID (string)
SNP generit position (cM) (float)
SNP physical position (bp)
It is spaced or tabulated file with 4 columns. All SNPs must be ordered by physical position. Example:
X rs3883674 0 32380
X rs12218882 0 48172
9 rs10904045 0 48426
9 rs10751931 0 49949
The PED (pedigree) file describes the individuals and the genetic data. The PED file can be spaced or tab delimited. Each line corresponds to a single individual. The first 6 columns are:
family ID (or pedigree name): a unique alpha numeric identifier
individual ID: should be unique within his family
father ID: 0 if unknown. If specified, must also appear as an individual in the file
mother ID: same as above
Sex: 1 Male, 2 Female
Phenotype
Then, additional columns can be:
columns 7 and 8 code for the observed alleles at SNP1
comumns 9 and 10 code for the observed alleles at SNP2 and so on
Missing data are coded as "0 0". So we have N lines times 2L + 6 columns where N is the number of individuals and L the numbers of SNPs
Given a .ped file (plink format), we can convert it into the 012 format (0 hom ancestral), 1 het, 2 dom derived using
plink --file [.ped/.map fileset prefix] --recodeA --out [output prefix]
Bioconvert conversions
PLINK2BPLINK
,
BPLINK2PLINK
,
PLINK2VCF
PLINK binary files (BED/BIM/FAM)¶
- Format:
human-readable and biarny
- Status:
included
- Type:
genotypic
PLINK binary format (BED, BIM and FAM) is a valid input for many software. If you have the PLINK flat files (MAP/PED) version, use PLINK to convert text to binary format if necessary. In Bioconvert, you can use the plink2bpblink as explained in the PLINK flat files (MAP/PED) section.
Here, the BED file is binary and is not to be confused with the BEDGRAPH format.
Bioconvert conversions
PLINK2BPLINK
,
BPLINK2PLINK
,
PLINK2VCF
QUAL¶
- Format:
human-readable
- Status:
included
- Type:
Sequence
QUAL files include qualities of each nucleotide in FASTA format.
Bioconvert conversions
FASTA2FASTQ
SAM¶
- Format:
human readable
- Status:
included
- Type:
alignment
In the SAM format, each alignment line typically represents the linear alignment of a segment. Each line has 11 mandatory fields in the same order. Their values can be 0 or * if the field is unavailable. Here is an overview of those fields:
Col |
Field |
Type |
Regexp/Range |
Brief description |
---|---|---|---|---|
1 |
QNAME |
String |
[!-?A-~]{1,254} |
Query template NAME |
2 |
FLAG |
Int |
[0,2^16-1] |
bitwise FLAG |
3 |
RNAME |
String |
*|[!-()+-<>-~][!-~]* |
Reference sequence NAME |
4 |
POS |
Int |
[0,2^31-1] |
1-based leftmost mapping POSition |
5 |
MAPQ |
Int |
[0,2^8-1] |
MAPping Quality |
6 |
CIGAR |
String |
*|([0-9]+[MIDNSHPX=])+ |
CIGAR string |
7 |
RNEXT |
String |
*|=|[!-()+-<>-~][!-~]* |
Ref. name of the mate/next read |
8 |
PNEXT |
Int |
[0,2^31-1] |
Position of the mate/next read |
9 |
TLEN |
Int |
[-2^31+1,2^31-1] |
observed Template LENgth |
10 |
SEQ |
String |
*|[A-Za-z=.]+ |
segment SEQuence |
11 |
QUAL |
String |
[!-~]+ |
ASCII of Phred-scaled base QUALity+33 |
All optional fields follow the TAG:TYPE:VALUE format where TAG is a two-character string that matches /[A-Za-z][A-Za-z0-9]/ . Each TAG can only appear once in one alignment line.
The tag NM:i:2 means: Edit distance to the reference (number of changes necessary to make this equal to the reference, exceluding clipping).
The optional fields are tool-dependent. For instance with BWA mapper, we can get these tags
Tag |
Meaning |
---|---|
NM |
Edit distance |
MD |
Mismatching positions/bases |
AS |
Alignment score |
BC |
Barcode sequence |
X0 |
Number of best hits |
X1 |
Number of suboptimal hits found by BWA |
XN |
Number of ambiguous bases in the referenece |
XM |
Number of mismatches in the alignment |
XO |
Number of gap opens |
XG |
Number of gap extentions |
XT |
Type: Unique/Repeat/N/Mate-sw |
XA |
Alternative hits; format: (chr,pos,CIGAR,NM;)* |
XS |
Suboptimal alignment score |
XF |
Support from forward/reverse alignment |
XE |
Number of supporting seeds |
SCF¶
- Format:
human readable
- Status:
included
- Type:
alignment
Trace File Format - Sequence Chromatogram Format (SCF) is a binary file containing raw data output from automated sequencing instruments.
This converter was translated from BioPerl.
SCF file organisation (more or less)
Length in bytes |
Data |
---|---|
128 |
header |
Number of samples * sample size |
Samples for A trace |
Number of samples * sample size |
Samples for C trace |
Number of samples * sample size |
Samples for G trace |
Number of samples * sample size |
Samples for T trace |
Number of bases * 4 |
Offset into peak index for each base |
Number of bases |
Accuracy estimate bases being 'A' |
Number of bases |
Accuracy estimate bases being 'C' |
Number of bases |
Accuracy estimate bases being 'G' |
Number of bases |
Accuracy estimate bases being 'T' |
Number of bases |
The called bases |
Number of bases * 3 |
Reserved for future use |
Comments size |
Comments |
Private data size |
Private data |
Bioconvert conversions
SCF2FASTQ
,
SCF2FASTA
.
SRA¶
The Sequence Read Archive (SRA) makes biological sequence data available to the research community. It stores raw sequencing data and alignment information from high-throughput sequencing platforms, including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, Complete Genomics, and Pacific Biosciences SMRT.
It is not a format per se but is included in Bioconvert by allowing the retrieval of sequencing data given a SRA identifier:
bioconvert sra2fastq <SRA_ID>
This will retrieve the fastq reads (single read or paired end data).
Bioconvert conversions
Reference:
TSV¶
- Format:
human readable
- Type:
database
- Status:
included
A tab-separated values format is a delimited text file that uses a tab character to separate values. See CSV format page for details.
Bioconvert conversions:
STOCKHOLM¶
- Format:
human readable
- Status:
included
- Type:
multiple sequence alignment
Stockholm format is a multiple sequence alignment format used by Pfam and Rfam to store protein and RNA sequence alignments.
Here is a simple example:
# STOCKHOLM 1.0
#=GF ID UPSK
#=GF SE Predicted; Infernal
#=GF SS Published; PMID 9223489
#=GF RN [1]
#=GF RM 9223489
#=GF RT The role of the pseudoknot at the 3' end of turnip yellow mosaic
#=GF RT virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
#=GF RT polymerase.
#=GF RA Deiman BA, Kortlever RM, Pleij CW;
#=GF RL J Virol 1997;71:5990-5996.
AF035635.1/619-641 UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104 UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234 UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23 UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons .AAA....<<<<aaa....>>>>
//
A minimal well-formed Stockholm file should contain a header which states the format and version identifier, currently '# STOCKHOLM 1.0', followed by the sequences and corresponding unique sequence names:
<seqname> <aligned sequence>
<seqname> <aligned sequence>
<seqname> <aligned sequence>
Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.:
#=GF <feature> <Generic per-File annotation, free text>
#=GC <feature> <Generic per-Column annotation, exactly 1 char per column>
#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>
#=GR <seqname> <feature> <Generic per-Residue annotation, exactly 1 char per residue>
Bioconvert conversions:
VCF¶
- Format:
human readable
- Status:
included
- Type:
variant
Variant Call Format (VCF) is a flexible and extendable format for storing variation in sequences such as single nucleotide variants, insertions/deletions, copy number variants and structural variants.
Bioconvert conversions:
VCF2PLINK
VCF2BPLINK
WIG¶
See WIGGLE (WIG).
WIGGLE (WIG)¶
- Format:
human readable
- Status:
included
- Type:
database-style
The wiggle (WIG) format is a format used for display of dense, continuous data such as GC percent. Wiggle data elements must be equally sized.
Similar format such as the bedGraph format is also an older format used to display sparse data or data that contains elements of varying size.
For speed and efficiency, wiggle data is usually stored in BIGWIG format.
Wiggle format is line-oriented. It is composed of declaration lines and data lines. There are two options: variableStep and fixedStep.
The VariableStep format is used for data with irregular intervals between new data points, and is the more commonly used wiggle format. The variableStep begins with a declaration line and is followed by two columns containing chromosome positions and data values:
variableStep chrom=chrN
[span=windowSize]
chromStartA dataValueA
chromStartB dataValueB
... etc ... ... etc ...
The declaration line starts with the word variableStep and is followed by a specification for a chromosome. The optional span parameter (default: span=1) allows data composed of contiguous runs of bases with the same data value to be specified more succinctly. The span begins at each chromosome position specified and indicates the number of bases that data value should cover. For example, this variableStep specification:
variableStep chrom=chr2
300701 12.5
300702 12.5
300703 12.5
300704 12.5
300705 12.5
is equivalent to:
variableStep chrom=chr2 span=5
300701 12.5
The variableStep format becomes very inefficient when there are only a few data points per 1024 bases. If variableStep data points (i.e., chromStarts) are greater than about 100 bases apart, it is advisable to use BedGraph format.
The fixedStep format is used for data with regular intervals between new data values and is the more compact wiggle format. The fixedStep begins with a declaration line and is followed by a single column of data values:
fixedStep chrom=chrN
start=position step=stepInterval
[span=windowSize]
dataValue1
dataValue2
... etc ...
The declaration line starts with the word fixedStep and includes specifications for chromosome, start coordinate, and step size. The span specification has the same meaning as in variableStep format. For example, this fixedStep specification:
fixedStep chrom=chr3 start=400601 step=100
11
22
33
displays the values 11, 22, and 33 as single-base regions on chromosome 3 at positions 400601, 400701, and 400801, respectively. Adding span=5 to the declaration line:
fixedStep chrom=chr3 start=400601 step=100 span=5
11
22
33
causes the values 11, 22, and 33 to be displayed as 5-base regions on chromosome 3 at positions 400601-400605, 400701-400705, and 400801-400805, respectively.
Note that for both variableStep and fixedStep formats, the same span must be used throughout the dataset. If no span is specified, the default span of 1 is used. As the name suggests, fixedStep wiggles require the same size step throughout the dataset. If not specified, a step size of 1 is used.
Data values can be integer or real, postive or negative values. Positions specified in the input data must be in numerical order.
Warning
BigWig files created from bedGraph format use "0-start, half-open" coordinates, but bigWigs that represent variableStep and fixedStep data are generated from wiggle files that use 1-start, fully-closed coordinates. For example, for a chromosome of length N, the first position is 1 and the last position is N. For more information, see:
Bioconvert conversions
WIG2BED
XLS¶
- Format:
human readable
- Type:
database
- Status:
included
Spreadsheet file format (Microsoft Excel file format).
Until 2007, Microsoft Excel used a proprietary binary file format called Excel Binary File Format (.XLS). In Excel 2007, the Office Open XML was introduced. We support the later formnat only.
With bioconvert you can convert an XLS file into CSV or TSV format. If several sheets are to be found, you can select one or the other.
Bioconvert conversions:
XLS2CSV
,
XLSZ2CSV
,
XLSX¶
- Type:
database
- Status:
included
Spreadsheet file format in Office Open XML format.
With bioconvert you can convert an XLSX file into CSV or TSV format. If several sheets are to be found, you can select one or the other.
See also
XLS format.
XMFA¶
- Format:
human-readable
- Status:
included
- Type:
alignment
XMFA stands for eXtended Multi-FastA file format. The .alignment file contains the complete genome alignment. This standard file format is also used by other genome alignment systems that align sequences with rearrangements.
The XMFA file format supports the storage of several collinear sub-alignments, each separated with an = sign, that constitute a single genome alignment. Each sub-alignment consists of one FastA format sequence entry per genome where the entry’s defline gives the strand (orientation) and location in the genome of the sequence in the alignment.
Example (from darlinglab.org/mauve ):
>seq_num:start1-end1 ± comments (sequence name, etc.)
AC-TG-NAC--TG
AC-TG-NACTGTG
...
> seq_num:startN-endN ± comments (sequence name, etc.)
AC-TG-NAC--TG
AC-TG-NACTGTG
...
= comments, and optional field-value pairs, i.e. score=12345
> seq_num:start1-end1 ± comments (sequence name, etc.)
AC-TG-NAC--TG
AC-TG-NACTGTG
...
> seq_num:startN-endN ± comments (sequence name, etc.)
AC-TG-NAC--TG
AC-TG-NACTGTG
...
= comments, and optional field-value pairs, i.e. score=12345
Bioconvert conversions
YAML¶
- Format:
human-readable
- Status:
included
- Type:
database
YAML ("YAML Ain't Markup Language") is a human-readable data-serialization language. It is commonly used for configuration files, but could be used in many applications where data is being stored.
The full syntax cannot be described here. The full specification are available at the official site (https://yaml.org/refcard.html)
In brief: - whitespace indentation is used to denote srtucture. Tab spaces are not allowed. - Comments begin with the number sign #. Can start anywhere on a line. - List are denoted by the - character with one member per line, or, enclosed in square brackets [ ] . - associated arrays are represented with the colon space : in the form of key:value - strings can be unquoted or quoted.
Example:
# example of a yaml file
- {name: Jean, age: 33}
- name: Marie
age : 32
men:
- Pierre
- Jean
women:
- Marie
Bioconvert conversions
JSON2YAML
,
JSON2YAML
.
Others¶
ACE ~~~-
Human-readable file format used by the AceDB database, which is a genome database designed for the handling of bioinformatics data. The data looks like:
DNA : "HSFAU"
ttccttccagctactgttccttccagc
tactg
This format is obsolet and will not be included in Bioconvert for now. BioPython seems to handle this format.
ASN1¶
ASN.1 Abstract Syntax Notation One, is an International Standards Organization (ISO) data representation format used to achieve interoperability. It is formal notation used for describing data transmitted by telecommunications protocols, regardless of language implementation and physical representation of these data, whatever the application, whether complex or very simple. NCBI uses ASN.1 for the storage and retrieval of data such as nucleotide and protein sequences, structures, genomes, PubMed records, and more.
GCG¶
- Format:
human-readable
- Status:
not included
- Type:
sequence
GCG format contains exactly one sequence. It begins with annotation lines and the start of the sequence is marked by a line ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length and a checksum. This format should only be used if the file was created with the GCG package.
An example sequence in GCG format is:
ID AB000263 standard; RNA; PRI; 368 BP.
XX
AC AB000263;
XX
DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.
XX
SQ Sequence 368 BP;
AB000263 Length: 368 Check: 4514 ..
1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg
61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg
121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc
181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag
241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga
301 agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca
361 gacctgaa
GVF¶
- Format:
human-readable
- Status:
not included
- Type:
variant
The Genome Variation Format (GVF) is a very simple file format for describing sequence_alteration features at nucleotide resolution relative to a reference genome.
Example:
##gvf-version 1.10
##genome-build NCBI B36.3
##sequence-region chr16 1 88827254
chr16 samtools SNV 49291141 49291141 . + . ID=ID_1;Variant_seq=A,G;Reference_seq=G;
chr16 samtools SNV 49291360 49291360 . + . ID=ID_2;Variant_seq=G;Reference_seq=C;
chr16 samtools SNV 49302125 49302125 . + . ID=ID_3;Variant_seq=T,C;Reference_seq=C;
chr16 samtools SNV 49302365 49302365 . + . ID=ID_4;Variant_seq=G,C;Reference_seq=C;
chr16 samtools SNV 49302700 49302700 . + . ID=ID_5;Variant_seq=T;Reference_seq=C;
chr16 samtools SNV 49303084 49303084 . + . ID=ID_6;Variant_seq=G,T;Reference_seq=T;
chr16 samtools SNV 49303156 49303156 . + . ID=ID_7;Variant_seq=T,C;Reference_seq=C;
chr16 samtools SNV 49303427 49303427 . + . ID=ID_8;Variant_seq=T,C;Reference_seq=C;
chr16 samtools SNV 49303596 49303596 . + . ID=ID_9;Variant_seq=T,C;Reference_seq=C;
IG¶
The IntelliGenetics (IG) format is a sequence format. It can contain several sequences, each consisting of a number of comment lines that must begin with a semicolon (";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear or '2' for circular sequences.
An example sequence in IG format is:
; comment
; comment
AB000263
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC
CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC
CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG
AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC
CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG
TTTAATTACAGACCTGAA1
PIR¶
- Format:
human-readable
- Status:
not included
- Type:
variant
The PIR (Protein Informatics Resource) may contain contain several sequences. A sequence in PIR format consists of One line starting with ">" character followed by a 2-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by a semicolon, followed by the sequence identification code (the database ID-code). Then, one line containing a textual description of the sequence and finally one or more lines containing the sequence itself. The end of the sequence is marked by a "*" character.
The PIR format is also often referred to as the NBRF format.
Example:
>P1;CRAB_ANAPL
Example protein sequence. Note the final * chraacter
MDITIHNPLI RRPLFSWLAP SRIFDQIFGE HLQESELLPA SPSLSPFLMR
SPIFRMPSWL ETGLSEMRLE KDKFSVNLDV KHFSPEELKV KVLGDMVEIH
GKHEERQDEH GFIAREFNRK YRIPADVDPL TITSSLSLDG VLTVSAPRKQ
SDVPERSIPI TREEKPAIAG AQRK*
imgt Unspecified (*.txt) This refers to the IMGT variant of the EMBL plain text file format.
phd PHD files are output from PHRED, used by PHRAP and CONSED for input.
seqxml Simple sequence XML file format.
sff Standard Flowgram Format (SFF) files produced by 454 sequencing. binary files produced by Roche 454 and IonTorrent/IonProton sequencing machines.
swiss Swiss-Prot aka UniProt format.
uniprot-xml UniProt XML format, successor to the plain text Swiss-Prot format.
pdb2gmx: This program reads a .pdb (or .gro) file, reads some database files, adds hydrogens to the molecules and generates coordinates in GROMACS (GROMOS), or optionally .pdb, format and a topology in GROMACS format. See http://manual.gromacs.org/archive/4.6.7/online/pdb2gmx.html for details. this tool is already quite complete and will not be provided for now.
Glossary¶
Note that formats mentionned here below have dedicated description in the Formats section.
- ABI¶
File format produced by ABI sequencing machines. Contains the trace data which includes probabilities of the four nucleotides. See the ABI format page for details.
- ASQG¶
The ASQG format describes an assembly graph. Each line is a tab-delimited record. The first field in each record describes the record type. See the ASQG page for details.
- BAI¶
The index file related to file generated in the BAM format. (This is a non-standard file type.) See the BAI page for details.
- BAM¶
Binary version of the Sequence Alignment Map (SAM) format. See the BAM format page for details.
- BCF¶
Binary version of the Variant Call Format (VCF). See BCF page for details.
- BCL¶
BCL is the raw format used by Illumina sequencers. See the BCL format page for details.
- BED¶
BEDGRAPH/BED format is line-oriented and allows display of continuous-valued data. Similar to WIG format. See the BED format page for details.
- BED3¶
Variants of the BED format with 4 columns storing the track name, start and end positions and values. See the BED4 format page for details.
- BED4¶
Variants of the BED format with 4 columns storing the track name, start and end positions and values. See the BED4 format page for details.
- BEDGRAPH¶
BEDGRAPH/BED format is line-oriented and allows display of continuous-valued data. Similar to WIG format. See the BED format page for details.
- BIGBED¶
An indexed binary version of a BED file See BIGBED page for details.
- BIGWIG¶
Indexed binary version of the Wiggle format. See BIGWIG page for details.
- BPLINK¶
Binary version of the PlINK forat used for analyzing genotypic data for Genome-wide Association Studies (GWAS). See PLINK binary files (BED/BIM/FAM) page for details.
- BZ2¶
bzip2 is a file compression program that uses the Burrows–Wheeler algorithm. Extension is usually .bz2 See BZ2 page for details.
- CLUSTAL¶
The alignment format of Clustal X and Clustal W. See CLUSTAL page for details.
- COV¶
A bioconvert format to store coverage in the form of a 3 column tab-tabulated file. See COV page for details.
- CRAM¶
A more compact version of BAM files used to store Sequence Alignment Map (SAM) format. See CRAM page for details.
- CSV¶
A comma-separated values format is a delimited text file that uses a comma to separate values. See CSV format page for details.
- DSRC¶
A compression tool dedicated to FastQ files See DSRC page for details.
- EMBL¶
EMBL Flat File Format. See EMBL page for details.
- FAA¶
FASTA-formatted sequence files containing amino acid sequences See FAA page for details.
- FASTA¶
FASTA-formatted sequence files contain either nucleic acid sequence (such as DNA) or protein sequence information. FASTA files can also store multiple sequences in a single file. See FASTA page for details.
- FASTQ¶
FASTQ-formatted sequence files are used to represent high-throughput sequencing data, where each read is described by a name, its sequence, and its qualities. See FastQ page for details.
- GENBANK¶
GenBank Flat File Format. See GENBANK page for details.
- GFA¶
Graphical Fragment Assembly format. https://github.com/GFA-spec/GFA-spec
- GFF2¶
General Feature Format, used for describing genes and other features associated with DNA, RNA and Protein sequences. See GTF page for details.
- GFF3¶
General Feature Format, used for describing genes and other features associated with DNA, RNA and Protein sequences. http://genome.ucsc.edu/FAQ/FAQformat#format3 See GTF page for details.
- GZ¶
gzip is a file compression program based on the DEFLATE algorithm. See GZ page for details.
- JSON¶
A human-readable data serialization language commonly used in configuration files. See JSON page for details.
- MAF¶
A human-readable multiple alignment format. See MAF (Multiple Alignement Format) page for details.
- NEWICK¶
Plain text minimal format used to store phylogenetic tree. See NEWICK page for details.
- NEXUS¶
Plain text minimal format used to store multiple alignment and phylogenetic trees. See NEXUS page for details.
- PAF¶
PAF is a text format describing the approximate mapping positions between two set of sequences.
- PHYLIP¶
Plain text format to store a multiple sequence alignment. See PHYLIP page for details.
- PHYLOXML¶
XML format to store a multiple sequence alignment. See PHYLOXML page for details.
- PLINK¶
Format used for analyzing genotypic data for Genome-wide Association Studies (GWAS). See PLINK flat files (MAP/PED) page for details.
- QUAL¶
Sequence of qualities associated with a sequence of nucleotides. Associated with FastA file, the original FastQ file can be built back. See QUAL page for details.
- SAM¶
Sequence Alignment Map is a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly. See SAM page for details.
- SCF¶
Standard Chromatogram Format, a binary chromatogram format described in Staden package documentation SCF file format.
- SRA¶
The Sequence Read Archive (SRA) is a website that stores sequencing data at https://www.ncbi.nlm.nih.gov/sra It is not a format per se. See SRA page for details.
- STOCKHOLM¶
Stockholm format is a multiple sequence alignment format used to store multiple sequence alignment. See STOCKHOLM page for details.
- TSV¶
A tab-separated values format is a delimited text file that uses a tab character to separate values. See TSV format page for details.
- TWOBIT¶
2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself. See TWOBIT format page for details.
- VCF¶
Variant Call Format (VCF) is a flexible and extendable format for storing variation in sequences such as single nucleotide variants, insertions/deletions, copy number variants and structural variants. See VCF page for details.
- WIG¶
Synonym for the wiggle (WIG) format. See WIG.
- WIGGLE¶
The wiggle (WIG) format stores dense, continuous data such as GC percent, probability scores, and transcriptome data. See WIG page for details.
- XLS¶
Spreadsheet file format (Microsoft Excel file format). See XLS page for details.
- XLSX¶
Spreadsheet file format defined in the Office Open XML specification. See XLSX page for details.
- XMFA¶
TODO
- YAML¶
A human-readable data serialization language commonly used in configuration files. See https://en.wikipedia.org/wiki/YAML See YAML page for details.
Faqs¶
Installation¶
On ubuntu, you need libz-dev and python3-dev libraries which are not necessarily present by default:
sudo apt-get install libz-dev python3-dev
Plink¶
If you have installed plink1.9 but bioconvert still can not use plink. It is maybe because bioconvert try to call the programme by the name "plink" so you have to make a symbolic link. First, you have to go in the repository where is plink, then use the command which:
which plink1.9
go into the repository then:
ln -s plink1.9 plink
after this bioconvert will be able to call plink
Libraries¶
Bibliography¶
BEDTools: a flexible suite of utilities for comparing genomic features Aaron R. quinlan, Ira M. Hall 2010 Bioinformatics 26(6) https://doi.org/10.1093/bioinformatics/btq033
BioConvert: a comprehensive format converter for life sciences https://bioconvert.readthedocs.io
Biopython: freely available Python tools for computational molecular biology and bioinformatics. Cock et al 2009, Bioinformatics 25(11) https://doi.org/10.1093/bioinformatics/btp163
Ramírez, Fidel, Devon P. Ryan, Björn Grüning, Vivek Bhardwaj, Fabian Kilpert, Andreas S. Richter, Steffen Heyne, Friederike Dündar, and Thomas Manke. “deepTools2: a next generation web server for deep-sequencing data analysis.” Nucleic Acids Research (2016): gkw257.
Mosdepth: quick coverage calculation for genomes and exomes Brent S Pedersen, Aaron R Quinlan 2018 Bioinformatics, 34(5) https://doi.org/10.1093/bioinformatics/btx699
Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010)
The Sequence Alignment/Map format and SAMtools. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. Bioinformatics. 2009 Aug 15;25(16):2078-9. Epub 2009 Jun 8. PMID: 19505943
Ogasawara T, Cheng Y, Tzeng T-HK (2016) Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools. PLoS ONE 11(11): e0167100. doi:10.1371/journal.pone.0167100
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and 1000 Genome Project Data Processing Subgroup, The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics (2009) 25(16) 2078-9 [19505943]
NCBI SRA tools https://edwards.flinders.edu.au/fastq-dump/
Zerbino DR, Johnson N, Juettemann T, Wilder SP and Flicek PR: WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis. Bioinformatics 2014 30:1008-1009.