Skip to content

File Formats

Text-based format for storing both biological sequence data (usually nucleotide sequences) and their corresponding quality scores.

Each sequence entry contains four lines:

  1. Sequence identifier with description
  2. Raw sequence letters
  3. Plus sign (optional description)
  4. Quality scores encoded in ASCII characters
ElementRequirementsDescription
@@Each sequence identifier line starts with @
<instrument>Characters allowed: a–z, A–Z, 0–9 and underscoreInstrument ID
<run number>NumericalRun number on instrument
<flowcell ID>Characters allowed: a–z, A–Z, 0–9Flowcell ID
<lane>NumericalLane number
<tile>NumericalTile number
<x_pos>NumericalX coordinate of cluster
<y_pos>NumericalY coordinate of cluster
<read>NumericalRead number. 1 can be single read or Read 2 of paired-end
<is filtered>Y or NY if the read is filtered (did not pass), N otherwise
<control number>Numerical0 when none of the control bits are on, otherwise it is an even number. On HiSeq X systems, control specification is not performed and this number is always 0
<sample number>NumericalSample number from sample sheet
Example.fastq
@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTC
AAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIII
IIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
  • SAM (Sequence Alignment/Map): Text format for storing sequence alignments against a reference genome
  • BAM: Binary version of SAM, compressed and indexed for faster processing
  • CRAM: Highly compressed reference-based alternative to BAM, designed for long-term storage

Simple text-based format for representing nucleotide or peptide sequences. Typically, FASTA is used to store reference data (such as those from curtated databases).

Each entry consists of:

  • A description line (starts with ’>’)
  • The sequence data on subsequent lines
Example.fasta
:filename: example.fasta
>Mus_musculus_tRNA-Ala-AGC-1-1 (chr13.trna34-AlaAGC)
GGGGGTGTAGCTCAGTGGTAGAGCGCGTGCTTAGCATGCACGAGGcCCTGGGTTCGATCC
CCAGCACCTCCA
>Mus_musculus_tRNA-Ala-AGC-10-1 (chr13.trna457-AlaAGC)
GGGGGATTAGCTCAAATGGTAGAGCGCTCGCTTAGCATGCAAGAGGtAGTGGGATCGATG
CCCACATCCTCCA
  • VCF (Variant Call Format): Text file format for storing gene sequence variations
  • BCF: Binary version of VCF

Contains information about:

  • Genomic position of variants
  • Reference and alternative alleles
  • Quality scores
  • Filter statuses
  • Additional annotations

Browser Extensible Data format for defining genomic features.

Contains:

  • Chromosome name
  • Start position
  • End position
  • Optional fields (name, score, strand, etc.) Commonly used for displaying data tracks in genome browsers
Column numberTitleDefinitionRequired
1chromChromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671) nameYes
2chromStartStart coordinate on the chromosome or scaffold for the sequence considered (the first base on the chromosome is numbered 0 i.e. the number is zero-based)Yes
3chromEndEnd coordinate on the chromosome or scaffold for the sequence considered. This position is non-inclusive, unlike chromStart (the first base on the chromosome is numbered 1 i.e. the number is one-based)Yes
4nameName of the line in the BED fileNo
5scoreScore between 0 and 1000No
6strandDNA strand orientation (positive [”+”] or negative [”-”] or ”.” if no strand)No
7thickStartStarting coordinate from which the annotation is displayed in a thicker way on a graphical representation (e.g.: the start codon of a gene)No
8thickEndEnd coordinates from which the annotation is no longer displayed in a thicker way on a graphical representation (e.g.: the stop codon of a gene)No
9itemRgbRGB value in the form R, G, B (e.g. 255,0,0) determining the display color of the annotation contained in the BED fileNo
10blockCountNumber of blocks (e.g. exons) on the line of the BED fileNo
11blockSizesList of values separated by commas corresponding to the size of the blocks (the number of values must correspond to that of the “blockCount”)No
12blockStartsList of values separated by commas corresponding to the starting coordinates of the blocks, coordinates calculated relative to those present in the chromStart column (the number of values must correspond to that of the “blockCount”)No
  • GFF (General Feature Format): Describes genes and other features of DNA, RNA, and protein sequences
  • GTF (Gene Transfer Format): More specialized version of GFF

Contains:

  • Feature coordinates
  • Feature types
  • Score
  • Strand information
  • Frame
  • Attribute-value pairs
Column numberTitleDefinitionRequired
1seqnameName of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Must be a standard chromosome name or an Ensembl identifier such as a scaffold ID, without additional content like species or assemblyYes
2sourceName of the program that generated this feature, or the data source (database or project name)Yes
3featureFeature type name, e.g. Gene, Variation, SimilarityYes
4startStart position of the feature, with sequence numbering starting at 1Yes
5endEnd position of the feature, with sequence numbering starting at 1Yes
6scoreA floating point valueYes (use ’.’ if empty)
7strandDefined as + (forward) or - (reverse)Yes (use ’.’ if empty)
8frameOne of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so onYes (use ’.’ if empty)
9attributeA semicolon-separated list of tag-value pairs, providing additional information about each featureYes

The WIG (wiggle) format is used for displaying continuous-valued data in track format. It is particularly useful for showing expression data, probability scores, and GC percentage.

WIG files be formatted in two main ways:

fixedStep chrom=chrN start=pos step=stepInterval [span=windowSize]
dataValue
dataValue
dataValue

Required fields:

FieldDescription
chromChromosome name (e.g., chr1)
startStarting position
stepDistance between starts of adjacent windows
dataValueNumerical data value for each position

Optional fields:

FieldDescription
spanSize of window (defaults to step size)
variableStep chrom=chrN [span=windowSize]
chromStart dataValue
chromStart dataValue
chromStart dataValue

Required fields:

FieldDescription
chromChromosome name
chromStartStart position of each window
dataValueNumerical data value for each position

Optional fields:

FieldDescription
spanSize of window (defaults to 1)
Example1.wig
fixedStep chrom=chr3 start=400601 step=100 span=100
11.0
22.0
33.0
Example2.wig
variableStep chrom=chr3 span=150
500701 5.0
500801 3.0
500901 8.0
  • BAI: Index format for BAM files
  • CSI: Coordinate-sorted index format

These index files are typically required by common CLI tools

Enables:

  • Random access to compressed files
  • Quick retrieval of alignments
  • Efficient genome browsing
  • bigWig: Binary format of .wig files
  • bigBed: Binary format of .bed files

Advantages:

  • Efficient random access
  • Reduced memory usage
  • Fast display in genome browsers