Reference genome

A reference genome is a genome assembly that represents the complete genetic sequence of an organism as a continuous string of nucleotides (A, T, C, and G). For an assembly to serve as a reference genome, it is typically accompanied by annotations, produced through a process known as DNA or genome annotation. The annotations specify the genomic coordinates (start and end locations) of genes, exons, introns, and mRNA, and are often paired with corresponding transcript (mRNA) and protein sequences (algorithm predicted or experimentally validated).

Reference genomes exist for a wide variety of species, including species of viruses, bacteria, fungi, plants and animals, and they differ in how they are constructed and represented. A reference may be derived from a single individual or from multiple individuals whose sequences are collapsed into one representative assembly - haplotype. Two main factors determine reference genome's assembly quality: the sequencing technology which affects sequence accuracy and the assembly level which indicates how complete the genome representation is.

The ideal is a chromosome-level assembly, which is a complete DNA sequence for each chromosome with no unplaced segments. However, achieving this remains technically challenging, especially for large or repetitive genomes (dense in repetitive elements). Earlier sequencing technologies often produced assemblies at the contig (short contiguous sequences) or scaffold (ordered sets of contigs) level, with limited chromosomal context. The exact size of these fragments depends on the sequencing platform and bioinformatic methods available at the time.

For assemblies that are not fully resolved, summary statistics such as N50 and L50 are commonly used to characterise contiguity and assembly fragmentation; these metrics are explained in the Contigs and Scaffolds section.

Reference genomes are central to omics research, particularly genomics. They provide a reference for "mapping" DNA sequence data from many individuals, enabling efficient identification of the genomic location of these sequences and the detection of polymorphisms (sequence differences among individuals) through a process known as variant calling.

The limitations of this practice, such as reference bias and under-representation of population diversity, have led to the development of population-level reference sets and pangenomes.

Reference genomes and their annotations are publicly accessible through online genome browsers and archives such as Ensembl, the European Nucleotide Archive (ENA) at EMBL-EBI, the UCSC Genome Browser, and NCBI.