9. Glossary

Note that formats mentionned here below have dedicated description in the Formats section.

ABI

File format produced by ABI sequencing machines. Contains the trace data which includes probabilities of the four nucleotides. See the ABI format page for details.

ASQG

The ASQG format describes an assembly graph. Each line is a tab-delimited record. The first field in each record describes the record type. See the ASQG page for details.

BAI

The index file related to file generated in the BAM format. (This is a non-standard file type.) See the BAI page for details.

BAM

Binary version of the Sequence Alignment Map (SAM) format. See the BAM format page for details.

BCF

Binary version of the Variant Call Format (VCF). See BCF page for details.

BCL

BCL is the raw format used by Illumina sequencers. See the BCL format page for details.

BED

BEDGRAPH/BED format is line-oriented and allows display of continuous-valued data. Similar to WIG format. See the BED format page for details.

BED3

Variants of the BED format with 4 columns storing the track name, start and end positions and values. See the BED4 format page for details.

BED4

Variants of the BED format with 4 columns storing the track name, start and end positions and values. See the BED4 format page for details.

BEDGRAPH

BEDGRAPH/BED format is line-oriented and allows display of continuous-valued data. Similar to WIG format. See the BED format page for details.

BIGBED

An indexed binary version of a BED file See BIGBED page for details.

BIGWIG

Indexed binary version of the Wiggle format. See BIGWIG page for details.

Binary version of the PlINK forat used for analyzing genotypic data for Genome-wide Association Studies (GWAS). See PLINK binary files (BED/BIM/FAM) page for details.

BZ2

bzip2 is a file compression program that uses the Burrows–Wheeler algorithm. Extension is usually .bz2 See BZ2 page for details.

CLUSTAL

The alignment format of Clustal X and Clustal W. See CLUSTAL page for details.

COV

A bioconvert format to store coverage in the form of a 3 column tab-tabulated file. See COV page for details.

CRAM

A more compact version of BAM files used to store Sequence Alignment Map (SAM) format. See CRAM page for details.

CSV

A comma-separated values format is a delimited text file that uses a comma to separate values. See CSV format page for details.

DSRC

A compression tool dedicated to FastQ files See DSRC page for details.

EMBL

EMBL Flat File Format. See EMBL page for details.

FAA

FASTA-formatted sequence files containing amino acid sequences See FAA page for details.

FASTA

FASTA-formatted sequence files contain either nucleic acid sequence (such as DNA) or protein sequence information. FASTA files can also store multiple sequences in a single file. See FASTA page for details.

FASTQ

FASTQ-formatted sequence files are used to represent high-throughput sequencing data, where each read is described by a name, its sequence, and its qualities. See FastQ page for details.

GENBANK

GenBank Flat File Format. See GENBANK page for details.

GFA

Graphical Fragment Assembly format. https://github.com/GFA-spec/GFA-spec

GFF2

General Feature Format, used for describing genes and other features associated with DNA, RNA and Protein sequences. See GTF page for details.

GFF3

General Feature Format, used for describing genes and other features associated with DNA, RNA and Protein sequences. http://genome.ucsc.edu/FAQ/FAQformat#format3 See GTF page for details.

GZ

gzip is a file compression program based on the DEFLATE algorithm. See GZ page for details.

JSON

A human-readable data serialization language commonly used in configuration files. See JSON page for details.

MAF

A human-readable multiple alignment format. See MAF (Multiple Alignement Format) page for details.

NEWICK

Plain text minimal format used to store phylogenetic tree. See NEWICK page for details.

NEXUS

Plain text minimal format used to store multiple alignment and phylogenetic trees. See NEXUS page for details.

PAF

PAF is a text format describing the approximate mapping positions between two set of sequences.

PHYLIP

Plain text format to store a multiple sequence alignment. See PHYLIP page for details.

PHYLOXML

XML format to store a multiple sequence alignment. See PHYLOXML page for details.

Format used for analyzing genotypic data for Genome-wide Association Studies (GWAS). See PLINK flat files (MAP/PED) page for details.

QUAL

Sequence of qualities associated with a sequence of nucleotides. Associated with FastA file, the original FastQ file can be built back. See QUAL page for details.

SAM

Sequence Alignment Map is a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly. See SAM page for details.

SCF

Standard Chromatogram Format, a binary chromatogram format described in Staden package documentation SCF file format.

SRA

The Sequence Read Archive (SRA) is a website that stores sequencing data at https://www.ncbi.nlm.nih.gov/sra It is not a format per se. See SRA page for details.

STOCKHOLM

Stockholm format is a multiple sequence alignment format used to store multiple sequence alignment. See STOCKHOLM page for details.

TSV

A tab-separated values format is a delimited text file that uses a tab character to separate values. See TSV format page for details.

TWOBIT

2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself. See TWOBIT format page for details.

VCF

Variant Call Format (VCF) is a flexible and extendable format for storing variation in sequences such as single nucleotide variants, insertions/deletions, copy number variants and structural variants. See VCF page for details.

WIG

Synonym for the wiggle (WIG) format. See WIG.

WIGGLE

The wiggle (WIG) format stores dense, continuous data such as GC percent, probability scores, and transcriptome data. See WIG page for details.

XLS

Spreadsheet file format (Microsoft Excel file format). See XLS page for details.

XLSX

Spreadsheet file format defined in the Office Open XML specification. See XLSX page for details.

XMFA

TODO

YAML

A human-readable data serialization language commonly used in configuration files. See https://en.wikipedia.org/wiki/YAML See YAML page for details.