.. _developer_guide: Developer guide =============== .. contents:: Quick start ----------- As a developer, assuming you have a valid environment and installed Bioconvert (:ref:`install_dev`), go to bioconvert directory and type the bioconvert init command for the input and output formats you wish to add (here we want to convert format A to B). You may also just copy an existing file:: cd bioconvert bioconvert_init -i A -o B > A2B.py see :ref:`add_converter` section for details. Edit the file, update the method that performs the conversion by adding the relevant code (either python or external tools). Once done, please 1. add an input test file in the ./test/data directory (see :ref:`add_test`) 2. add the relevant data test files in the ``./bioconvert/test/data/`` directory (see :ref:`add_test_file`) 3. Update the documentation as explained in :ref:`update_doc` section: 1. add the module in doc/ref_converters.rst in the autosummary section 2. add the A2B in the README.rst 4. add a CI action in .github/workflows named after the conversion (A2B.yml) Note also that a converter (a Python module, e.g., fastq2fasta) may have several methods included and it is quite straightforward to add a new method (:ref:`add_method`). They can later be compared thanks to our benchmarking framework. If this is a new formats, you may also update the glossary.rst file in the documentation. .. _install_dev: Installation for developers --------------------------- To develop on `bioconvert` it is highly recommended to install `bioconvert` in a virtualenv :: mkdir bioconvert cd bioconvert python3.7 -m venv py37 source py37/bin/activate And clone the bioconvert project :: mkdir src cd src git clone https://github.com/bioconvert/bioconvert.git cd bioconvert We need to install some extra requirements to run the tests or build the doc so to install these requirements :: pip install -e . [testing] .. warning:: The extra requirements try to install `pygraphviz` so you need to install `graphviz` on your computer. If you running a distro based on debian you have to install `libcgraph6`, `libgraphviz-dev` and `graphviz` packages. .. note:: You may need to install extra tools to run some conversion. The requirements_tools.txt file list conda extra tools .. _add_converter: How to add a new conversion --------------------------- Officially, **Bioconvert** supports one-to-one conversions only (from one format to another format). See the note here below about :ref:`multi_conversion`. Let us imagine that we want to include a new format conversion from :term:`FastQ` to :term:`FastA` format. First, you need to add a new file in the ``./bioconvert`` directory called:: fastq2fasta.py Please note that the name is **all in small caps** and that we concatenate the input format name, the character **2** and the output format name. Sometimes a format already includes the character 2 in its name (e.g. bz2), which may be confusing. For now, just follow the previous convention meaning duplicate the character 2 if needed (e.g., for bz2 to gz format, use bz22gz). As for the class name, we us **all in big caps**. In the newly created file (**fastq2fasta.py**) you can (i) copy / paste the content of an existing converter (ii) use the **bioconvert_init** executable (see later), or (iii) copy / paste the following code: .. code-block:: python :linenos: """Convert :term:`FastQ` format to :term:`FastA` formats""" from bioconvert import ConvBase __all__ = ["FASTQ2FASTA"] class FASTQ2FASTA(ConvBase): """ """ _default_method = "v1" def __init__(self, infile, outfile): """ :param str infile: information :param str outfile: information """ super().__init__(infile, outfile) @requires(external_library="awk") def _method_v1(self, *args, **kwargs): # Conversion is made here. # You can use self.infile and self.outfile # If you use an external command, you can use self.execute: self.execute(cmd) @requires_nothing def _method_v2(self, *args, **kwargs): #another method pass On line 1, please explain the conversion using the terms available in the :ref:`glossary` (``./doc/glossary.rst`` file). If not available, you may edit the glossary.rst file to add a quick description of the formats. .. warning:: If the format is not already included in **Bioconvert**, you will need to update the file core/extensions.py to add the format name and its possible extensions. On line 2, just import the common class. On line 7, name the class after your input and output formats; again include the character 2 between the input and output formats. Usually, we use big caps for the formats since most format names are acronyms. If the input or output format exists already in **Bioconvert**, please follow the existing conventions. On line 13, we add the constructor. On line 21, we add a method to perform the conversion named **_method_v1**. Here, the prefix **_method_** is compulsary: it tells **Bioconvert** that is it a possible conversion to include in the user interface. This is also where you will add your code to perform the conversion. The suffix name (here **v1**) is the name of the conversion. That way you can add as many conversion methods as you need (e.g. on line 28, we implemented another method called **v2**). Line 20 and line 27 show the decorator that tells **bioconvert** which external tools are required. See :ref:`decorator` section. Since several methods can be implemented, we need to define a default method (line 11; here **v1**). In order to simplify the creation of new converters, you can also use the standalone **bioconvert_init**. Example:: $ bioconvert_init -i fastq -o fasta > fastq2fasta.py Of course, you will need to edit the file to add the conversion itself in the appropriate method (e.g. _method_v1). If you need to include extra arguments, such as a reference file, you may add extra argument, although this is not yet part of the official **Bioconvert** API. See for instance :class:`~bioconvert.sam2cram.SAM2CRAM` converter. .. _multi_conversion: One-to-many and many-to-one conversions --------------------------------------- The one-to-many and many-to-one conversions are now implemented in **Bioconvert**. We have only 2 instances so far namely class:`bioconvert.fastq2fasta_qual` and class:`bioconvert.fasta_qual2fastq`. We have no instances of many-to-many so far. The underscore character purpose is to indicate a **and** connection. So you need QUAL *and* FASTA to create a FASTQ file. For developers, we ask the input or output formats to be sorted alphabetically to ease the user experience. .. _add_method: How to add a new method to an existing converter ------------------------------------------------ As shown above, use this code and add it to the relevant file in ``./bioconvert`` directory:: def _method_UniqueName(self, *args, **kwargs): # from kwargs, you can use any kind of arguments. # threads is an example, reference, another example. # Your code here below pass Then, it will be available in the class and **bioconvert** automatically; the **bioconvert** executable should show the name of your new method in the help message. In order to add your new method, you can add: * Pure Python code * Python code that relies on third-party library. If so, you may use: * Python libraries available on pypi. Pleaes add the library name to the requirements.txt * if the Python library requires lots of compilation and is available on bioconda, you may add the library name to the requirements_tools.txt instead. * Third party tools available on **bioconda** (e.g., squizz, seqtk, etc) that you can add to the requirements_tools.txt * Perl and GO code are also accepted. If so, use the self.install_tool(NAME) and add a script in ``./misc/install_NAME.sh`` .. _decorator: Decorators ---------- `Decorators `_ have been defined in ``bioconvert/core/decorators.py`` that can be used to "flag" or "modify" conversion methods: - ``@in_gz`` can be used to indicate that the method is able to transparently handle input files that are compressed in ``.gz`` format. This is done by adding an ``in_gz`` attribute (set to ``True``) to the method. - ``@compressor`` will wrap the method in code that handles input decompression from ``.gz`` format and output compression to ``.gz``, ``.bz2`` or ``.dsrc``. This automatically applies ``@in_gz``. Example: :: @compressor def _method_noncompressor(self, *args, **kwargs): """This method does not handle compressed input or output by itself.""" pass # The decorator transforms the method that now handles compressed # input and output; the method has an in_gz attribute (which is set to True) - ``@out_compressor`` will wrap the method in code that handles output compression to ``.gz``, ``.bz2`` or ``.dsrc``. It is intended to be used on methods that already handle compressed input transparently, and therefore do not need the input decompression provided by ``@compressor``. Typically, one would also apply ``@in_gz`` to such methods. In that case, ``@in_gz`` should be applied "on top" of ``@out_compressor``. The reason is that decorators closest to the function are applied first, and applying another decorator on top of ``@in_gz`` would typically not preserve the ``in_gz`` attribute. Example: :: @in_gz @out_compressor def _method_incompressor(self, *args, **kwargs): """This method already handles compressed .gz input.""" pass # This results in a method that handles compressed input and output # This method is further modified to have an in_gz attribute # (which is set to True) Another **bioconvert** decorator is called **requires**. It should be used to annotate a method with the type of tools it needs to work. It is important to decorate all methods with the **requires** decorator so that user interface can tell what tools are properly installed or not. You can use 4 arguments as explained in :mod:`bioconvert.core.decorators`: .. code-block:: python :linenos: @requires_nothing def _method_python(self, *args, **kwargs): # a pure Python code does not require extra libraries with open(self.outfile, "w") as fasta, open(self.infile, "r") as fastq: for (name, seq, _) in FASTQ2FASTA.readfq(fastq): fasta.write(">{}\n{}\n".format(name, seq)) @requires(python_library="mappy") def _method_mappy(self, *args, **kwargs): with open(self.outfile, "w") as fasta: for (name, seq, _) in fastx_read(self.infile): fasta.write(">{}\n{}\n".format(name, seq)) @requires("awk") def _method_awk(self, *args, **kwargs): # Note1: since we use .format, we need to escape the { and } characters # Note2: the \n need to be escaped for Popen to work awkcmd = """awk '{{printf(">%s\\n",substr($0,2));}}' """ cmd = "{} {} > {}".format(awkcmd, self.infile, self.outfile) self.execute(cmd) On line 1, we decorate the method with the :func:`~bioconvert.core.decorators.requires_nothing` decorator because the method is implemented in Pure Python. One line 8, we decorate the method with the :func:`~bioconvert.core.decorators.requires` decorator to inform **bioconvert** that the method relies on the external Python library called mappy. One line 14, we decorate the method with the :func:`~bioconvert.core.decorators.requires` decorator to inform **bioconvert** that the method relies on an external tool called awk. In theory, you should write:: @requires(external_library="awk") but ``external_library`` is the first optional argument so it can be omitted. If several libraries are required, you can use:: @requires(external_libraries=["awk", ""]) or:: @requires(python_libraries=["scipy", "pandas"]) .. note:: For more general explanations about decorators, see https://stackoverflow.com/a/1594484/1878788. .. _add_test: How to add a test ----------------- Following the example from above (fastq2fasta), we need to add a test file. To do so, go to the ``./test`` directory and add a file named ``test_fastq2fasta.py``. .. code-block:: python :linenos: import pytest from bioconvert.fastq2fasta import FASTQ2FASTA from bioconvert import bioconvert_data from easydev import TempFile, md5 from . import test_dir @pytest.mark.parametrize("method", FASTQ2FASTA.available_methods) def test_fastq2fasta(method): # your code here # you will need data for instance "mydata.fastq and mydata.fasta". # Put it in bioconvert/bioconvert/data # you can then use :: infile = f"{test_dir}/data/fastq/test_mydata.fastq" expected_outfile = f"{test_dir}/data/fasta/test_mydata.fasta" with TempFile(suffix=".fasta") as tempfile: converter = FASTQ2FASTA(infile, tempfile.name) converter(method=method) # Check that the output is correct with a checksum assert md5(tempfile.name) == md5(expected_outfile) In **Bioconvert**, we use **pytest** as our test framework. In principle, we need one test function per method found in the converter. Here on line 7 we serialize the tests by looping through the methods available in the converter using the pytest.mark.parametrize function. That way, the test file remains short and do not need to be duplicated. .. _add_test_file: How to add a test file ---------------------- Files used for testing should be added in ``./bioconvert/test/data/ext/converter_name.ext``. How to locally run the tests ---------------------------- Go to the source directory of **Bioconvert**. If not already done, install all packages required for testing:: cd bioconvert pip3 install .[testing] Then, run the tests using:: pytest test/ -v Or, to run a specific test file, for example for your new converter fastq2fasta:: pytest test/test_fastq2fasta.py -v or :: pytest -v -k test_fastq2fasta How to benchmark your new method vs others ------------------------------------------ :: from bioconvert import Benchmark from bioconvert.fastq2fasta import FASTQ2FASTA converter = FASTQ2FASTA(infile, outfile) b = Benchmark(converter) b.plot() you can also use the **bioconvert** standalone with -b option. .. _update_doc: How to add you new converter to the main documentation ? -------------------------------------------------------- Edit the doc/ref_converters.rst and add this code (replacing A2B by your conversion):: .. automodule:: bioconvert.A2B :members: :synopsis: :private-members: and update the autosummary section:: .. autosummary:: bioconvert.A2B pep8 and conventions -------------------- In order to write your Python code, use PEP8 convention as much as possible. Follow the conventions used in the code. For instance, :: class A(): """Some documentation""" def __init__(self): """some doc""" pass def another_method(self): """some doc""" c = 1 + 2 class B(): """Another class""" def __init__(self, *args, **kwargs): """some doc""" pass def AFunction(x): """some doc""" return x - 2 blank lines between classes and functions - 1 blank lines between methods - spaces around operators (e.g. =, +) - Try to have 80 characters max on one line - Add documentation in triple quotes Since v0.5.2, we apply black on the different Python modules. Requirements files ------------------ - requirements.txt : should contain the packages to be retrieved from Pypi only. Those are downloaded and installed (if missing) when using **python setup.py install** - environment_rtd.yml : do not touch. Simple file for readthedocs - readthedocs.yml : all conda and pip dependencies to run the example and build the doc - environment.yml is a conda list of all dependencies How to update bioconvert on bioconda ------------------------------------ Fork bioconda-recipes github repository and clone locally. Follow instructions on https://bioconda.github.io/contributing.html In a nutshell, install bioconda-utils:: git clone YOURFORKED_REPOSITORY cd bioconda-recipes edit bioconvert recipes and update its contents. If a new version pypi exists, you need to change the md5sum in ``recipes/bioconvert/meta.yaml``. check the recipes:: bioconda-utils build recipes/ config.yml --packages bioconvert Finally, commit and created a PR:: #git push -u origin my-recipe git commit . git push Sphinx Documentation -------------------- In order to update the documentation, go the *./doc* directory and update any of the .rst file. Then, for Linux users, just type:: make html Regarding the :ref:`formats` page, we provide simple ontology with 3 entries: Type, Format and Status. Please choose one of the following values: - Type: sequence, assembly, alignement, other, index, variant, database, compression - Format: binary, human-readable - Status: deprecated, included, not included Docker ------ In order to create the docker file, use this command:: docker build . The Dockerfile found next to setup.py is self-content and has been tested for v0.5.2 ; it uses the spec-file.txt that was generated in a conda environment using:: conda list --explicit