7.1. Core functions

bioconvert.core.base

Main factory of Bioconvert

bioconvert.core.benchmark

Tools for benchmarking

bioconvert.core.converter

Standalone application dedicated to conversion

bioconvert.core.decorators

Provides a general tool to perform pre/post compression

bioconvert.core.downloader

Download singularity image

bioconvert.core.extensions

List of formats and associated extensions

bioconvert.core.graph

Network tools to manipulate the graph of conversion

bioconvert.core.registry

Main bioconvert registry that fetches automatically the relevant converter

bioconvert.core.shell

Simplified version of shell.py module from snakemake package

bioconvert.core.utils

misc utility functions

7.1.1. Base

Main factory of Bioconvert

class ConvArg(names, help, **kwargs)[source]

This class can be used to add specific extra arguments to any converter

For instance, imagine a conversion named A2B that requires the user to provide a reference. Then, you may want to provide the --reference extra argument. This is possible by adding a class method named get_additional_arguments that will yield instance of this class for each extra argument.

@classmethod
def get_additional_arguments(cls):
    yield ConvArg(
        names="--reference",
        default=None,
        help="the referenc"
    )

Then, when calling bioconvert as follows,:

bioconvert A2B --help

the new argument will be shown in the list of arguments.

class ConvBase(infile, outfile)[source]

Base class for all converters.

To build a new converter, create a new class which inherits from ConvBase and implement method that performs the conversion. The name of the converter method must start with _method_.

For instance:

class FASTQ2FASTA(ConvBase):

    def _method_python(self, *args, **kwargs):
        # include your code here. You can use the infile and outfile
        # attributes.
        self.infile
        self.outfile

constructor

Parameters:
  • infile (str) -- the path of the input file.

  • outfile (str) -- the path of The output file

boxplot_benchmark(rot_xticks=90, boxplot_args={}, mode='time')[source]

This function plots the benchmark computed in compute_benchmark()

compute_benchmark(N=5, to_exclude=[], to_include=[])[source]

Simple wrapper to call Benchmark

This function computes the benchmark

see Benchmark for details.

install_tool(executable)[source]

Install the given tool, using the script: bioconvert/install_script/install_executable.sh if the executable is not already present

Parameters:

executable -- executable to install

Returns:

nothing

property name

The name of the class

class ConvMeta(name, bases, namespace, **kwargs)[source]

This metaclass checks that the converter classes have

  • an attribute input_ext

  • an attribute output_ext

This is a meta class used by ConvBase class. For developers only.

make_chain(converter_map)[source]

Create a class performing step-by-step conversions following a path. converter_map is a list of pairs ((in_fmt, out_fmt), converter). It describes the conversion path.

7.1.2. Benchmark

Tools for benchmarking

class Benchmark(obj, N=5, to_exclude=None, to_include=None)[source]

Convenient class to benchmark several methods for a given converter

c = BAM2COV(infile, outfile)
b = Benchmark(c, N=5)
b.run_methods()
b.plot()

Constructor

Parameters:
  • obj -- can be an instance of a converter class or a class name

  • N (int) -- number of replicates

  • to_exclude (list) -- methods to exclude from the benchmark

  • to_include (list) -- methods to include ONLY

Use one of to_exclude or to_include. If both are provided, only the to_include one is used.

plot(rerun=False, ylabel=None, rot_xticks=0, boxplot_args={}, mode='time')[source]

Plots the benchmark results, running the benchmarks if needed or if rerun is True.

Parameters:
  • rot_xlabel -- rotation of the xticks function

  • boxplot_args -- dictionary with any of the pylab.boxplot arguments

  • mode -- either time, CPU or memory

Returns:

dataframe with all results

run_methods()[source]

Runs the benchmarks, and stores the timings in self.results.

plot_multi_benchmark_max(path_json, output_filename='multi_benchmark.png', min_ylim=0, mode=None)[source]

Plotting function for the Snakefile_benchmark to be found in the doc

The json file looks like:

{
  "awk":{
    "0":0.777020216,
    "1":0.9638044834,
    "2":1.7623617649,
    "3":0.8348755836
  },
  "seqtk":{
    "0":1.0024843216,
    "1":0.6313509941,
    "2":1.4048073292,
    "3":1.0554351807
  },
  "Benchmark":{
    "0":1,
    "1":1,
    "2":2,
    "3":2
  }
}

Number of benchmark is infered from field 'Benchmark'.

7.1.3. Converter

Standalone application dedicated to conversion

class Bioconvert(infile, outfile, force=False, threads=None, extra=None)[source]

Universal converter used by the standalone

from bioconvert import Bioconvert
c = Bioconvert("test.fastq", "test.fasta", threads=4, force=True)

constructor

Parameters:
  • infile (str) -- The path of the input file.

  • outfile (str) -- The path of The output file

  • force (bool) -- overwrite output file if it exists already otherwise raises an error

7.1.4. Decorators

Provides a general tool to perform pre/post compression

compressor(func)[source]

Decompress/compress input file without pipes

Does not use pipe: we decompress and compress back the input file. The advantage is that it should work for any files (even very large).

This decorator should be used by method that uses pure python code

in_gz(func)[source]

Marks a function as accepting gzipped input.

make_in_gz_tester(converter)[source]

Generates a function testing whether a conversion method of converter has the in_gz tag.

out_compressor(func)[source]

Compress output file without pipes

This decorator should be used by method that uses pure python code

requires(external_binary=None, python_library=None, external_binaries=None, python_libraries=None)[source]
Parameters:
  • external_binary -- a system binary required for the method

  • python_library -- a python library required for the method

  • external_binaries -- an array of system binaries required for the method

  • python_libraries -- an array of python libraries required for the method

Returns:

requires_nothing(func)[source]

Marks a function as not needing dependencies.

7.1.5. Downloader

Download singularity image

7.1.6. Extensions

List of formats and associated extensions

class AttrDict(**kwargs)[source]

Copy from easydev package.

update(content)[source]

See class/constructor documentation for details

Parameters:

content (dict) -- a valid dictionary

extensions = {'abi': ['abi', 'ab1'], 'agp': ['agp'], 'bam': ['bam'], 'bcf': ['bcf'], 'bed': ['bed'], 'bedgraph': ['bedgraph', 'bg'], 'bigbed': ['bb', 'bigbed'], 'bigwig': ['bigwig', 'bw'], 'bplink': ['bplink'], 'bz2': ['bz2'], 'cdao': ['cdao'], 'clustal': ['clustal', 'aln', 'clw'], 'cov': ['cov'], 'cram': ['cram'], 'csv': ['csv'], 'dsrc': ['dsrc'], 'embl': ['embl'], 'ena': ['ena'], 'faa': ['faa', 'mpfa', 'aa'], 'fast5': ['fast5'], 'fasta': ['fasta', 'fa', 'fst'], 'fastq': ['fastq', 'fq'], 'genbank': ['genbank', 'gbk', 'gb'], 'gfa': ['gfa'], 'gff2': ['gff'], 'gff3': ['gff3'], 'gtf': ['gtf'], 'gz': ['gz'], 'json': ['json'], 'maf': ['maf'], 'newick': ['newick', 'nw', 'nhx', 'nwk'], 'nexus': ['nexus', 'nx', 'nex', 'nxs'], 'ods': ['ods'], 'paf': ['paf'], 'pdb': ['pdb'], 'phylip': ['phy', 'ph', 'phylip'], 'phyloxml': ['phyloxml', 'xml'], 'plink': ['plink'], 'pod5': ['pod5'], 'qual': ['qual'], 'sam': ['sam'], 'scf': ['scf'], 'sra': ['sra'], 'stockholm': ['sto', 'sth', 'stk', 'stockholm'], 'tsv': ['tsv'], 'twobit': ['2bit'], 'vcf': ['vcf'], 'wig': ['wig'], 'wiggle': ['wig', 'wiggle'], 'xls': ['xls'], 'xlsx': ['xlsx'], 'xmfa': ['xmfa'], 'yaml': ['yaml', 'YAML']}

List of formats and their extensions included in Bioconvert

7.1.7. Graph

Network tools to manipulate the graph of conversion

create_graph(filename, layout='dot', use_singularity=False, color_for_disabled_converter='red', include_subgraph=False)[source]
Parameters:

filename -- should end in .png or .svg or .dot

If extension is .dot, only the dot file is created without annotations. This is useful if you have issues installing graphviz. If so, under Linux you could use our singularity container see github.com/cokelaer/graphviz4all

create_graph_for_cytoscape(all_converter=False)[source]
Parameters:

all_converter -- use all converters or only the ones available in the current installation

Returns:

7.1.8. Registry

Main bioconvert registry that fetches automatically the relevant converter

class Registry[source]

class to centralise information about available conversions

from bioconvert.core.registry import Registry
r = Registry()
r.conversion_exists("BAM", "BED")
r.info()  # returns number of available methods for each converter

conv_class = r[(".bam", ".bed")]
converter = conv_class(input_file, output_file)
converter.convert()
conversion_exists(input_fmt, output_fmt, allow_indirect=False)[source]
Parameters:
  • input_fmt (str) -- the input format

  • output_fmt (str) -- the output format

  • allow_indirect (boolean) -- whether to count indirect conversions

Returns:

True if a converter which transform input_fmt into output_fmt exists

Return type:

boolean

conversion_path(input_fmt, output_fmt)[source]

Return a list of conversion steps to get from input and output formats

Parameters:

Each step in the list is a pair of formats.

get_all_conversions()[source]
Returns:

a generator which allow to iterate on all available conversions and their availability; a conversion is encoded by a tuple of 2 strings (input format, output format)

Retype:

generator (input format, output format, status)

get_conversions()[source]
Returns:

a generator which allow to iterate on all available conversions a conversion is encoded by a tuple of 2 strings (input format, output format)

Retype:

generator

get_conversions_from_ext()[source]
Returns:

a generator which allow to iterate on all available conversions a conversion is encoded by a tuple of 2 strings (input extension, output extension)

Return type:

generator

get_converters_names()[source]
Returns:

a generator that allows to get the name of the converter from the subclass (ConvBase object)

Return type:

generator

get_ext(ext_pair)[source]

Copy the registry into a dict that behaves like a list to be able to have multiple values for a single key and from a key have all converter able to do the conversion from the input extension to the output extension.

Parameters:

ext_pair (tuple of 2 strings) -- the input extension, the output extension

Returns:

list of objects of subclass o ConvBase

iter_converters(allow_indirect: bool = False)[source]
Parameters:

allow_indirect (bool) -- also return indirect conversion

Returns:

a generator to iterate over (in_fmt, out_fmt, converter class when direct, path when indirect)

Return type:

a generator

set_ext(ext_pair, convertor)[source]

Register new convertor from input extension and output extension in a list. We can have a list of multiple convertors for one ext_pair.

Parameters:
  • ext_pair (tuple) -- tuple containing the input extensions and the output extensions e.g. ( ("fastq",) , ("fasta") )

  • convertor (list of ConvBase object) -- the convertor which handle the conversion from input_ext -> output_ext

7.1.9. Utils

misc utility functions

class TempFile(suffix='', dir=None)[source]

A small wrapper around tempfile.NamedTemporaryFile function

f = TempFile(suffix="csv")
f.name
f.delete() # alias to delete=False and close() calls

Copy from easydev package

class Timer(times)[source]

Timer working with with statement

Copy from easydev package.

generate_outfile_name(infile, out_extension)[source]

simple utility to replace the file extension with the given one.

Parameters:
  • infile (str) -- the path to the Input file

  • out_extension (str) -- Desired extension

Returns:

The file path with the given extension

Return type:

str

get_extension(filename, remove_compression=False)[source]

Return extension of a filename

>>> get_extension("test.fastq")
fastq
>>> get_extension("test.fastq.gz")
fastq
get_format_from_extension(extension)[source]

get format from extension.

Parameters:

extension -- the extension

Returns:

the corresponding format

Return type:

str

md5(fname, chunk=65536)[source]

Return the MD5 checksums of a file