notes

Major International Sequencing Projects

DISCLAIMER: Generated 2025-11-14 by ChatGPT 5.1 Deep Research and validated content and links by Gemini 2.5 Flash.
Intended as a starting reading list and reference. Follow the links to supporting text.

Human Genome Project (HGP)

Years active: 1990–2003wi.mit.edu.
Scale: Sequenced the ~3 billion base pairs of the human genome (covering ~99% of gene-containing regions with 99.99% accuracy) at an estimated cost of $2.7 billionwi.mit.edu wi.mit.edu. The international consortium involved 20 sequencing centers worldwidewi.mit.edu.
Tech: Employed automated Sanger dideoxy sequencing of cloned DNA fragments (hierarchical shotgun approach) to assemble the first human reference genome. Improvements in sequencing technology and computing were integral to the projectbritannica.com.
Goals: Determine the complete sequence of human DNA and identify all human genes, while improving sequencing technologies and addressing ethical, legal, and social implications (ELSI) of genomicsbritannica.com. Ancillary goals included mapping model organism genomes and developing data standards.
Impact: Produced the first human reference genome, a foundational resource that “opened new avenues of discovery” in biology and medicinebritannica.com. The HGP’s data enabled countless advances, from pinpointing disease genes to launching the field of genomics. It also fostered international collaboration and data-sharing norms (e.g. the 1996 “Bermuda Principles” for rapid data releasewi.mit.edu). Economically, the project has been credited with significant return on investment through biotech innovationsdoe-humangenomeproject.ornl.gov.
Key Publications: The first draft sequence was published in 2001 (Nature and Science), followed by the “finished” human genome in 2003. Notably, the International Human Genome Sequencing Consortium’s 2004 paper detailed the completed sequence (99% of euchromatic DNA)wi.mit.edu wi.mit.edu. (The remaining tiny gaps were finally closed in 2022 by the T2T Consortium.)
Project Website: The NIH’s Genome.gov HGP page provides archival info and linksbritannica.com.
Resource Manifest: All human sequence data were deposited in public databases (GenBank/EMBL/DDBJ) with no usage restrictions, updated daily during the projectwi.mit.edu. Physical reagents (clones, cell lines) were made available to researchers via repositories.
Data Accessibility: Open access. HGP data were released rapidly and freely worldwidewi.mit.edu. The consortium’s commitment to open science set a precedent – e.g. sequence assemblies ≥2 kb were released immediately (per the Bermuda Principles)wi.mit.edu. Browsers like UCSC Genome Browser and Ensembl were created to visualize the HGP data.
Major Sequencing Centers: Five centers produced ~85% of the sequence: the Wellcome Trust Sanger Institute (UK), Washington University (USA), Whitehead/MIT Center (Broad Institute, USA), Baylor College of Medicine (USA), and DOE Joint Genome Institute (USA)plato.stanford.edu. These, along with other labs in France, Germany, Japan, China, etc., formed the international consortiumwi.mit.edu. The Sanger Institute alone contributed about one-third of the total sequenceplato.stanford.edu.

International HapMap Project

Years active: 2002–2010en.wikipedia.org. (Launched October 2002; Phase I HapMap published 2005, Phase II in 2007, final Phase III results in 2010en.wikipedia.org.)
Scale: Genotyped ~3.1 million common SNPs in 269 people from 4 populations in Phase I, expanding to 11 populations in Phase IIIen.wikipedia.org en.wikipedia.org. Final HapMap release catalogued common genetic variants (>1% frequency) across diverse global populations. Phase III analyzed 1,184 samples across those 11 groupsen.wikipedia.org.
Tech: Employed high-throughput SNP genotyping arrays and targeted sequencing. Ten genotyping centers across the US, Canada, China, Japan, and UK used multiple platforms (Illumina, Affymetrix, Perlegen, etc.) to genotype millions of SNPsen.wikipedia.org en.wikipedia.org. A large-scale re-sequencing effort discovered >10 million candidate SNPs by 2006en.wikipedia.org, which informed array designs.
Goals: Develop a genome-wide haplotype map of human DNA sequence variation to facilitate finding genetic variants influencing health, disease, and drug responseen.wikipedia.org en.wikipedia.org. By identifying blocks of SNPs inherited together (haplotypes), researchers could use tag SNPs to capture most variation efficientlygenome.gov genome.gov. The project’s aim was to chart common variation to enable genome-wide association studies of complex diseasesen.wikipedia.org en.wikipedia.org.
Impact: The HapMap was a landmark resource that “revolutionized human genetic studies”broadinstitute.org broadinstitute.org. It enabled the first wave of successful GWAS, leading to hundreds of gene loci for common diseases by late 2000sbroadinstitute.org broadinstitute.org. It demonstrated the haplotype structure of the human genome and validated the strategy of using tag SNPs for association studies, dramatically reducing genotyping costsgenome.gov en.wikipedia.org. The HapMap also catalyzed international collaboration and trained researchers in genomics in several countries (e.g. Nigeria, China).
Key Publications: The International HapMap Consortium’s Phase I paper (Nature, Oct 2005) reported the first haplotype mapen.wikipedia.org. Phase II results appeared in Nature (Oct 2007). The final Phase III analysis was published in Nature in 2010, which integrated common and rare variants across 11 populationsen.wikipedia.org researchgate.net. These publications provided maps of millions of SNPs and extensive analyses of linkage disequilibrium and population structure.
Project Website: HapMap Data Portal (hosted at NCBI, hapmap.ncbi.nlm.nih.gov) provided project data, tools, and participant infogenome.gov. (Now archived; data is available through dbSNP and other archives.)
Resource Manifest: All HapMap genotype data and SNP lists were made public. The project generated a reference panel of common SNP haplotypes – the phased genotype data are available for download, and SNPs are in dbSNP. Cell lines from sampled individuals (270 in Phase I) are available via the Coriell repositoryen.wikipedia.org.
Data Accessibility: Open access. HapMap data were freely available without embargo. The consortium released genotype data and haplotype frequencies on the public website and via dbSNP as soon as they were validatedgenome.gov. Researchers worldwide could use the HapMap database for GWAS and imputation of untyped variants.
Major Sequencing Centers: Genotyping for HapMap was performed at 10 centers: e.g., Broad Institute (USA), Baylor College of Medicine (USA), Illumina Inc. (USA), University of Tokyo (Japan), BGI Shenzhen led by Dr. Huanming Yang (China), McGill University (Canada), Sanger Institute (UK), UCSF (USA), and othersen.wikipedia.org en.wikipedia.org. These groups each analyzed assigned chromosome regions or subsets of SNPs. The project was truly international, with participating groups in Canada, China, Japan, Nigeria, the UK, and USAen.wikipedia.org. Coordination and data integration were led by NIH and Wellcome Trust scientists.

1000 Genomes Project

Years active: 2008–2015en.wikipedia.org. (International planning started in 2007; the full project ran 2008 through final publications in 2015.)
Scale: Whole-genome sequenced 2,504 individuals from 26 populations worldwideinternationalgenome.org, producing a deep catalogue of human genetic variation. Final Phase 3 data (released 2013) identified ~88 million variants (84.7M SNPs, 3.6M short indels, 60k structural variants) across those genomesinternationalgenome.orginternationalgenome.org. The project generated > 200 Terabases of sequence data, including low-coverage WGS (~4× per genome), higher-coverage exomes (2,662 samples), and some high-coverage genomesinternationalgenome.orginternationalgenome.org.
Tech: Took advantage of new next-generation sequencing (NGS) technologies (e.g. Illumina short-read sequencers) that dramatically lowered costsinternationalgenome.org. Used a low-coverage sequencing strategy (∼4× per genome) plus targeted deep exome sequencing, combining data across samples to detect variants down to ~1% frequencyinternationalgenome.orginternationalgenome.org. Pilot studies in 2008 tested low-pass WGS vs. high-pass of triosinternationalgenome.orginternationalgenome.org. The project established methods for multi-center NGS data production and joint variant calling on thousands of genomes.
Goals: Create a public reference database of human genetic variation by sequencing at least 1,000 diverse individualsbroadinstitute.org broadinstitute.org. Specifically, find the majority of variants with ≥1% allele frequency in each of multiple continental populationsinternationalgenome.org. This high-resolution map of SNPs, indels, and structural variants would improve genotype imputation and empower disease studies across global populationsbroadinstitute.org broadinstitute.org. In essence, 1000 Genomes aimed to build on the HapMap by capturing lower-frequency variants and structural variants that HapMap’s arrays missedbroadinstitute.org broadinstitute.org.
Impact: Provided the first comprehensive, open-access resource on human genetic variation at population scale. The 1000 Genomes Project is cited as a foundational reference panel for human geneticsbroadinstitute.org. It enabled researchers to impute millions of untyped variants into GWAS, greatly increasing power to find disease associationsbroadinstitute.org broadinstitute.org. The project discovered millions of new variants (especially rare variants) and demonstrated extensive human genetic diversity, informing studies of selection and population historynature.com nature.com. Its multi-ancestry data has been critical for ensuring genetic studies benefit diverse populationsen.wikipedia.orgen.wikipedia.org. Overall, 1000 Genomes pushed forward standards for big-data sharing in genomics and spurred development of new analysis tools.
Key Publications: Notable are the consortium’s flagship papers in Nature: the Pilot Project analysis (2010)internationalgenome.org, the Phase 1 paper “An integrated map of genetic variation from 1,092 human genomes” (Nov 2012)internationalgenome.org, and the Phase 3 final papers (Oct 2015) – “A global reference for human genetic variation”internationalgenome.org and a companion on structural variantsinternationalgenome.org. These articles detailed the project’s methodology and key findings, such as the allele frequency spectrum and recombination patternsen.wikipedia.org en.wikipedia.org.
Project Website: The International Genome Sample Resource (IGSR) maintains the 1000 Genomes data portal (internationalgenome.org)internationalgenome.orginternationalgenome.org. (Original site: 1000genomes.org now redirects to IGSR.) The portal provides data downloads, analysis results, and sample information for all phases.
Resource Manifest: All sequencing reads and variant calls are publicly available. Data can be browsed via the IGSR Data Portal or downloaded from the EBI FTP siteinternationalgenome.org. The project established a sequence index listing every fileinternationalgenome.org and released periodic data freezes (with the final freeze in May 2013)internationalgenome.org. The cell lines and DNA samples are available from the Coriell Instituteinternationalgenome.org. Variant call format (VCF) files for all releases are on public archives, and the data is integrated into Ensembl, dbSNP, the UCSC Genome Browser, etc.
Data Accessibility: Open access. All 1000 Genomes data were released openly without embargo or usage restrictionsinternationalgenome.org. The project adhered to rapid data release principles, making data available to researchers worldwide via public databases (ENA/NCBI). Data usage just requires citing the sourceinternationalgenome.org. Because samples were from consented anonymous donors (with no phenotype data), data could be fully open.
Major Sequencing Centers: A collaborative effort led by the Wellcome Sanger Institute (UK), Beijing Genomics Institute (BGI, China), and the Broad Institute (USA) – these were key funders and data producers in the project’s launchbroadinstitute.org broadinstitute.org. Other major contributors included the US National Institutes of Health (NHGRI) and sequencing labs at Washington University and Baylor College of Medicine (which contributed from their TCGA experience). The Broad Institute coordinated data production (with David Altshuler as co-chair)broadinstitute.org. The project brought together dozens of institutions; overall, research teams from China, the UK, USA, Kenya, Nigeria, Japan, Peru, and many other countries participateden.wikipedia.org, reflecting its international scope.

The Cancer Genome Atlas (TCGA)

Years active: 2006–2018. (Launched as a pilot in 2006; expanded in 2009; completed initial genomic data generation by 2014en.wikipedia.org en.wikipedia.org, with final integrative analyses – the Pan-Cancer Atlas – published in 2018cancer.gov.)
Scale: Molecularly characterized >11,000 tumor cases across 33 cancer types, generating 2.5 petabases of sequence and other omics datacancer.gov. For each case, TCGA produced multi-platform data: genomic DNA sequencing (exome for all, whole-genome for ~10%), RNA sequencing, SNP arrays, DNA methylation arrays, miRNA sequencing, and clinical metadataen.wikipedia.org en.wikipedia.org. The project created a public repository now hosting 2.5 PB of genomic, epigenomic, transcriptomic, and proteomic data from these cancerscancer.gov. By project’s end, TCGA had profiled on average 500 patient samples for each of 33 cancer types (in many cases, tumor/normal pairs)en.wikipedia.org en.wikipedia.org.
Tech: Employed a multi-omics approach. Initially (pilot phase) used array-based technologies (Affymetrix SNP arrays, Agilent expression arrays, CpG methylation arrays) and capillary sequencing of targeted gene setsen.wikipedia.org en.wikipedia.org. As NGS costs fell, TCGA shifted to whole-exome sequencing (WES) for all tumors and whole-genome sequencing (WGS) for ~10% of cases, as well as RNA-seq for transcriptomesen.wikipedia.org. Standardized sample processing and rigorous quality control were crucial due to multiple data types. Biospecimen collection was centralized, and sequencing was performed by dedicated Genome Sequencing Centers using Illumina and ABI platforms (moving to Illumina HiSeq for most WES/WGS by 2011)cancer.gov cancer.gov.
Goals: To apply high-throughput genome analysis to improve our ability to diagnose, treat, and prevent cancer by comprehensively cataloguing the genomic alterations in many cancer typesen.wikipedia.org. The pilot (2006–2009) aimed to establish infrastructure and confirm feasibility by characterizing 3 cancer types (brain, lung, ovarian)en.wikipedia.org. The full project then sought to “complete the genomic characterization” of ~20–25 cancersen.wikipedia.org (ultimately 33), including identifying all major somatic mutations, copy number changes, gene expression shifts, etc. The overarching goal was to create a rich public dataset that cancer researchers could mine to discover new drivers, subtypes, and therapeutic targetsen.wikipedia.org en.wikipedia.org.
Impact: TCGA has profoundly influenced cancer biology. Its multidimensional maps of cancer genomes enabled the discovery of new cancer subtypes and driver genes across many tumor typesblogs.bcm.edu blogs.bcm.edu. For example, TCGA data revealed that cancers should be classified by molecular characteristics rather than anatomy – some breast and ovarian cancers share molecular profilescancer.gov cancer.gov, etc., suggesting common treatmentsblogs.bcm.edu blogs.bcm.edu. The Pan-Cancer Atlas synthesis (2018) redefined the understanding of cancer pathways and interactions across tumor typesblogs.bcm.edu blogs.bcm.edu. Clinically, TCGA paved the way for precision oncology by identifying biomarkers (e.g. IDH1 mutations in glioma, subtypes of endometrial cancer) now used in diagnosis and trialsblogs.bcm.edu blogs.bcm.edu. It also generated new tools for data analysis and a culture of open data in oncology. In sum, TCGA’s comprehensive “atlas” of genomic changes in cancer has become an essential reference for cancer research, accelerating discoveries and drug developmentblogs.bcm.edu blogs.bcm.edu.
Key Publications: TCGA published many landmark studies, often “marker papers” in Nature or Cell for each cancer. Notable are the 2008 Nature paper on glioblastoma (first TCGA pilot report)cancer.gov cancer.gov, and subsequent integrative papers on ovarian cancer (2011, Nature)cancer.gov, colorectal cancer (2012, Nature)cancer.gov, breast cancer (2012, Nature)cancer.gov, lung squamous (2012, Nature)cancer.gov, etc. In 2018, the consortium released a Cell and Cell Reports series (“Pan-Cancer Atlas”) – 27 papers analyzing all 33 tumor types together, synthesizing a decade of TCGA datablogs.bcm.edu blogs.bcm.edu. These publications provided a compendium of the molecular taxonomy of cancer.
Project Website: The NCI’s TCGA Data Portal (now part of the Genomic Data Commons, GDC) was the main site. cancergenome.nih.gov (legacy) and the GDC portal host data and analysis tools. For project overview, the NCI’s TCGA program pagecancer.gov and publications list are useful.
Resource Manifest: TCGA generated multiple data types for each tumor: DNA sequence (whole exome/genome), RNA expression, miRNA, SNP genotypes, DNA methylation, etc. All raw data (FASTQs, BAMs) and processed results (mutation calls, segmented copy number, expression matrices) are available through the GDC. A marker paper and “analysis working group” publications exist for each cancer, summarizing key findingscancer.gov cancer.gov. The project’s final Pan-Cancer Atlas release provides an integrated resource across cancers. TCGA also established a Biospecimen Core Resource to distribute tissue materials, and a Data Coordinating Center (at UCSC then OHSU) to organize dataen.wikipedia.org en.wikipedia.org.
Data Accessibility: Partially open/controlled. Summary results (e.g. gene-level mutations, anonymized profiles) are open-access on the GDC Data Portalcancer.gov. However, individual-level genomic data that could identify patients (e.g. raw sequencing reads, germline variants, clinical data) are in a controlled-access tier. Qualified researchers can apply (via dbGaP/Data Use Committees) for access to these protected data. This two-tier model follows TCGA’s data policy balancing rapid data release with patient privacycancer.gov cancer.gov. As of 2016, TCGA data were merged into the broader Genomic Data Commons, which provides cloud-based access and computing.
Major Sequencing Centers: Three large Genome Sequencing Centers funded by TCGA did most of the sequencing: the Broad Institute (Cambridge, MA), the Washington University Genome Institute (St. Louis), and the Baylor College of Medicine Human Genome Sequencing Center (Houston)en.wikipedia.org. These centers collectively generated about 80–90% of TCGA’s sequence dataen.wikipedia.org. They coordinated on pipeline standards (e.g. using Illumina platforms as technology matured). Additionally, TCGA had multiple Genome Characterization Centers for other assays (e.g. UNC for expression, Johns Hopkins for epigenetics) and Genome Data Analysis Centers (institutions like Memorial Sloan Kettering, Institute for Systems Biology, etc. that led data integration)en.wikipedia.org en.wikipedia.org. The project office at NCI/NHGRI oversaw all componentsen.wikipedia.org en.wikipedia.org. (In total, over 150 contributing institutions participated.)

International Cancer Genome Consortium (ICGC)

Years active: 2008–2018 (initial phase). (Launched in 2008 with a 10-year goal; by 2018 it had met its primary objectives and transitioned to a new phasedcc.icgc.org. A follow-on, ICGC-ARGO, continues from 2019 onward.)
Scale: Coordinated 50+ coordinated cancer genome projects across the globe. ICGC’s initial goal was to generate reference genomes for 25,000 tumor genomes spanning at least 50 tumor types/subtypesen.wikipedia.org en.wikipedia.org. By 2018, the consortium had produced > 3,400 high-quality cancer genome sequences covering over 500 cancer subtypes/familiessanger.ac.uk sanger.ac.uk. Specifically, by end of 2017, ICGC members had released 1,664 tumor genomes in peer-reviewed papers and contributed another ~1,800 in public repositories, totaling 3,465 genomes from > 1,000 cancer patients (many studies did both tumor and normal)sanger.ac.uk sanger.ac.uk. Each ICGC project typically sequenced 500 tumors of a given type. The consortium aimed to cover all major types – for example, breast, prostate, liver, brain, blood cancers, etc. – and as of 2018 had at least some data for 33 primary sites (many overlapping with TCGA)en.wikipedia.org en.wikipedia.org.
Tech: Employed whole-genome sequencing (WGS) and/or whole-exome sequencing of tumors and matched normals, often at high coverage (30× or deeper) to catalog somatic mutations. Many ICGC projects also generated transcriptomes (RNA-seq) and other molecular data, but the hallmark was whole-genome sequencing of cancers. The consortium set standards for sequence quality and uniform analysis pipelines so data could be compared across projectsen.wikipedia.org en.wikipedia.org. For example, ICGC members adopted uniform consent and data release policies, and agreed on common file formats and annotation methods. Technologically, it coincided with the rapid evolution of NGS in 2010s – moving from Illumina GAIIx to HiSeq X and NovaSeq, enabling large-scale genome sequencing.
Goals: To generate a comprehensive catalogue of the genomic abnormalities in all main types of human canceren.wikipedia.org en.wikipedia.org. The vision was an international “moonshot” akin to HGP, but for cancer genomes: roughly 50 projects each sequencing ~500 tumors to cover 50 cancer typesen.wikipedia.org en.wikipedia.org. ICGC aimed to identify all recurrent mutations (genes, regulatory regions), mutational signatures, and pathways involved in cancer by aggregating data across many studiesen.wikipedia.org en.wikipedia.org. A key goal was also to rapidly share data so that the global scientific community could analyze it, and to ensure standardized, high-quality genomes are produced for diverse cancer types (avoiding duplication of effort). Essentially, ICGC sought to coordinate what would have been disparate national projects into a cohesive effort, maximizing discovery and minimizing redundancyen.wikipedia.org en.wikipedia.org.
Impact: ICGC greatly expanded the scope of cancer genomics beyond what any single country could do. It fostered data sharing for tens of thousands of cancer genomes. The project’s data have led to discovery of numerous new cancer driver genes and mutational patterns across different populationsen.wikipedia.org en.wikipedia.org. For instance, ICGC studies identified novel drivers in liver cancer (from Japan), gastric cancer (China), pediatric brain tumors (Germany) etc., and revealed environmental carcinogen signatures (e.g. aflatoxin mutations in liver tumors) by comparing across cohorts. The consortium also pioneered large-scale Pan-Cancer analyses in collaboration with TCGA – culminating in the PCAWG project in 2020, which compared ~2,658 whole genomes across dozens of cancers to reveal commonalities and differences (published in Nature). Moreover, ICGC advanced international norms for ethical-data governance and demonstrated the feasibility of sharing genomic data across borders. It also built capacity – e.g. funding local sequencing for member projects and training scientists. ICGC’s legacy includes the massive dataset that is now foundational for cancer genomics, fueling research into cancer evolution, precision oncology, and comparative oncologysanger.ac.uk sanger.ac.uk.
Key Publications: Each ICGC project published its findings in journals – for example, the International Pancancer Consortium marker papers (a special issue of Nature 2020) were a collaboration of ICGC and TCGA, analyzing whole genomes for 38 tumor types. Earlier, numerous flagship papers came out: e.g. pancreatic cancer (Nature 2012 by Australian ICGC), liver cancer (Cell 2013 by Japanese ICGC), medulloblastoma (Nature 2012 by German ICGC), etc. In 2018, ICGC published a summary in Nature outlining the project’s “Pan-Cancer” conclusions and announcing the shift to clinical implementation (ICGCmed). A notable overview is the ICGC Pilot papers in Nature (2010) setting the project designdkfz.de, and a 2020 article in Nature on PCAWG that integrated ICGC/TCGA data.
Project Website: icgc.org – the ICGC Data Portal – was the primary site. It listed all member projects, offered a data browser and download for available genomesacademic.oup.com. (As of 2024, the original ICGC portal has been retireddcc.icgc.org; data have moved to the ICGC ARGO Data Platform and PCAWG repositories.) Publications and policy documents are also on the site.
Resource Manifest: ICGC established a federated data system. Member projects deposited analysis results (e.g. mutation calls, curated data) into the ICGC Data Portal database, while raw sequencing data often resided in EGA or dbGaP due to patient privacy. By project’s end, the portal provided simple somatic mutation files for >18,000 tumors from 27 primary sitesdkfz.de goldenhelix.com. ICGC’s data coordination allowed integrative analyses – for example, the PCAWG effort compiled 800+ terabytes of aligned genomes from many projects for joint analysis. The consortium also maintained a policy of releasing data on a quarterly basis to authorized users.
Data Accessibility: Controlled access (with open summary data). Given the clinical nature of cancer genomes, ICGC set up a managed access system. Aggregated somatic mutation catalogs were made available openly (e.g. via the portal or publications)goldenhelix.com, but individual-level genomic data required authorized access. Researchers could apply to the ICGC Data Access Compliance Office for permission to download raw sequences or genotype data. This controlled-access model was analogous to TCGA’s, and indeed ICGC and TCGA harmonized their policies by 2010cancer.gov cancer.gov. Today, legacy ICGC data (including PCAWG) can be accessed through the GA4GH Data Repository or cloud platforms by approved users.
Major Sequencing Centers: ICGC was a true global consortium, with each participating country funding its own sequencing efforts. Major contributors included the Wellcome Sanger Institute (led the UK’s contributions, e.g. breast, prostate, and oesophageal cancers)sanger.ac.uk sanger.ac.uk, the Chinese Cancer Genome Consortium (sequenced gastric, liver cancers in China)en.wikipedia.org en.wikipedia.org, National Cancer Center Japan (liver, haematopoietic cancers)en.wikipedia.org, the National Cancer Institute (USA) (many TCGA overlaps), Canada’s OICR in Toronto which served as ICGC Secretariat and Data Coordination Centreen.wikipedia.org, Germany’s DKFZ (pediatric brain tumors), and many more. By 2025, ICGC had over 2,200 scientists across 88 countries in its networksanger.ac.uk sanger.ac.uk. The sequencing itself was done in distributed fashion: e.g. the Australian project sequenced pancreatic cancers in Melbourne; Brazil did a liver cancer project; India did oral cancer, etc.en.wikipedia.org en.wikipedia.org. Notably, ICGC’s open collaboration meant that even as sequencing was done locally, data were shared globally.

ENCODE (Encyclopedia of DNA Elements)

Years active: 2003–present. (Started as a pilot in Sept 2003genome.gov ngdc.cncb.ac.cn; had major phases in 2007–2012 and 2012–2017; currently in Phase 4 (since 2017)genome.gov genome.gov.)
Scale: Produced an enormous volume of functional genomic data on human and model organisms. By the completion of ENCODE Phase 3 (around 2020), ENCODE had mapped millions of candidate functional elements across > 1,000 experiments. For human alone, the project generated 1,640 datasets in Phase 2 (covering 147 cell types)nature.comnature.com, and by 2020 had catalogued ~926,000 candidate human cis-regulatory elements (cCREs) in hundreds of cell/tissue typesgenome.gov genome.gov. The data spans hundreds of cell lines and tissues – including transcription factor binding profiles (~>700 ChIP-seq datasets), open chromatin maps (DNase/ATAC-seq), histone modification maps, RNA transcripts (long and small RNAs), and morenature.com nature.com. ENCODE also encompassed the model organism modENCODE projects for Drosophila and C. elegans, and a parallel Mouse ENCODE, adding thousands more datasetsgenome.govgenome.gov. In summary, ENCODE has profiled on the order of hundreds of millions of sequencing reads across > 1000 experiments, identifying on the order of 80% of the genome with some biochemical function signalnature.com nature.com. (As of 2023, the ENCODE Portal lists >32,000 files of data across all phases.)
Tech: Utilized a suite of functional genomic assays, primarily based on next-gen sequencing. Key technologies included ChIP-seq (to map transcription factor binding and histone marks), DNase-seq/ATAC-seq (to map open chromatin), RNA-seq (total and poly(A)+ transcripts), CAGE (for transcription start sites), RAMPAGE, Hi-C/ChIA-PET (for 3D chromatin looping in later phases), and morenature.com nature.com. Phase 1 (pilot) focused on comparing methods on 1% of the genomegenome.gov. Phase 2 and 3 scaled to whole-genome by leveraging high-throughput sequencing (mostly Illumina). Data production was divided among specialized centers – e.g. one group might do all ChIP-seq for transcription factors, another for histone marksmed.stanford.edu britannica.com. The project also invested heavily in computational pipelines for uniform processing (e.g. standardized peak-calling, data QC) and an integrated data Portal to organize all resultsgenome.gov genome.gov.
Goals: To identify all functional elements in the human genome sequencenature.com. ENCODE’s operational definition of functional elements included protein-coding genes, non-coding RNA genes, regulatory DNA elements (promoters, enhancers), and other DNA elements with biochemical activity (e.g. bound by proteins or marked by chromatin features)nature.com nature.com. Essentially, if the Human Genome Project gave the “letters” of the genome, ENCODE aimed to annotate the “grammar” – mapping which sequences do what. The pilot phase (ENCODE 1) set out to evaluate strategies to find functional DNA in 1% of the genomegenome.gov. Subsequent phases scaled up to whole-genome catalogs in human (and mouse), with the ultimate goal of a comprehensive parts list of genomic functional elementsgenome.gov genome.gov. An ongoing goal has been to integrate these elements into an Encyclopedia that links genomic coordinates to biological roles (e.g. which enhancer regulates which gene).
Impact: ENCODE transformed our understanding of the non-coding genome. Notably, in 2012 it reported that ~80% of the human genome has some biochemical “function” (broadly defined) – a controversial finding that stimulated vigorous discussion about the definition of functional DNAnature.com nature.com. The project has identified hundreds of thousands of candidate enhancers and promoters, providing a foundational resource for studying gene regulationnature.com nature.com. ENCODE data have been used to interpret non-coding GWAS variants by linking them to regulatory elements and target genesnature.com nature.com. The project has also yielded fundamental insights, e.g. revealing pervasive transcription of the genome, mapping chromatin state landscapes (with hundreds of thousands of enhancer-like and promoter-like regions)nature.com nature.com, and showing complex networks of factor binding. Additionally, ENCODE drove technology development (improving ChIP-seq, new assays like ATAC-seq) and community resource standards. The data have been cited in thousands of studies and continue to be a go-to reference for genomic annotation. As one measure, by 2020 ENCODE’s registry of candidate regulatory elements became a key track in genome browsers, and ENCODE methods enabled projects like the Roadmap Epigenomics and GTEx to flourish in parallelgenome.gov genome.gov. In summary, ENCODE has dramatically advanced the “functional annotation” of the human genome and provided rich datasets that fuel research in gene regulation, disease mechanism, and comparative genomics.
Key Publications: The 2012 Nature ENCODE collection was seminal – particularly the consortium’s integrative paper “An integrated encyclopedia of DNA elements in the human genome” (Nature 489:57-74, 2012)nature.com nature.com, which summarized the findings from ENCODE Phase 2. Another wave of key papers came in 2020: the ENCODE Phase 3 package in Nature and other journals (e.g. Nature 583: 699–710, 2020) described the expanded human and mouse catalogspubmed.ncbi.nlm.nih.gov mdanderson.elsevierpure.com. Earlier, the pilot project results were published in Nature in 2007, demonstrating the approaches on 1% of the genome. Additionally, dozens of component papers (in Genome Research, Genome Biology, etc.) have detailed specific datasets or analyses (e.g. the 2015 paper on mouse ENCODE, the 2013 paper on broad chromatin state analysis). ENCODE’s policy of rapid data release meant many findings were published as threads over time, but the 2012 and 2020 compilations serve as the primary references.
Project Website: encodeproject.org – the official ENCODE Portal – provides access to all ENCODE data, results, and a user-friendly interfacegenome.gov genome.gov. It includes the “ENCODE Encyclopedia” which organizes annotated elements (e.g. the Registry of candidate cis-Regulatory Elements)genome.gov genome.gov. The portal also hosts tutorials, the complete list of ENCODE publicationsgenome.gov, and connections to related projects (modENCODE, Roadmap Epigenomics). The data are also mirrored in UCSC Genome Browser and other genome browsers for visualization.
Resource Manifest: ENCODE data types include: ChIP-seq peaks for ~ hundreds of TFs and histone marks, DNase I hypersensitive sites maps, RNA-seq expression quantifications, and metadata for all experiments (cell type, treatment, etc.). All experimental raw data (FASTQs/BAMs) and processed files (e.g. bigWig signal tracks, peak BED files) are available on the portal. The ENCODE Encyclopedia (v5 as of 2023) compiles integrated annotations – for example, a registry of ~1.3 million human candidate cis-regulatory elements with activity across cell/tissue contextsgenome.gov genome.gov. ENCODE also established data standards (minimum sequencing depth, replication, etc.) and a uniform processing pipeline for each data typegenome.gov genome.gov, ensuring consistency. The resource is meant to be continuously updated (“release early and often” ethos). The ENCODE portal provides both bulk download and an API for programmatic access.
Data Accessibility: Open access. All ENCODE data are freely available immediately upon production (the project has a no embargo data-release policy). The consortium has made its entire dataset public through the portalgenome.gov genome.gov, with no usage restrictions beyond citation. Even unpublished datasets were released rapidly to maximize utility. (A small subset of human data from primary tissues might have controlled-access if donor privacy is a concern, but the vast majority is open). ENCODE’s open science approach has been lauded – its data have been used widely by the communitygenome.gov genome.gov.
Major Sequencing Centers: ENCODE was organized as a consortium of Data Production Centers, each specializing in certain assays. For example, during Phase 2: the Broad/MIT group led by Bradley Bernstein focused on chromatin marks; the Stanford/Yale/Harvard group led by Rick Myers (HudsonAlpha) and Michael Snyder did transcription factor ChIP-seq and open chromatin mappinghudsonalpha.org med.stanford.edu; UNC Chapel Hill handled RNA analyses; University of Washington (John Stamatoyannopoulos) did DNase-seq mapping of regulatory DNA; and so onhudsonalpha.org med.stanford.edu. The Data Coordination Center was at UC Santa Cruz (UCSC), which served as the central repository and portalbritannica.com britannica.com. (UCSC’s team, led by Baker and Cherry, managed ENCODE data releases and the UCSC Genome Browser integration.) Later phases included new production centers (e.g. ENCODE4 has groups like Stanford for long-range 3D genome assays, and development of a Data Analysis Center at UMass Medical for uniform processingbritannica.com.) In total, ENCODE has engaged dozens of labs – ~440 scientists were authors on the 2012 main paperbritannica.com. Key leadership came from NHGRI program directors and PIs like Ewan Birney (EBI), Gene Myers, Ian Dunham, etc. The project’s collaborative network and centralized DCC at UCSC ensured data from all centers were integrated into a cohesive resourcebritannica.com britannica.com.

Human Microbiome Project (HMP)

Years active: 2007–2016en.wikipedia.org en.wikipedia.org. (Funded as an NIH Common Fund “Roadmap” initiative beginning in 2007; Phase 1 spanned ~2007–2012 and Phase 2, the Integrative HMP, 2014–2016en.wikipedia.org.)
Scale: Profiled the microbiomes of 300 healthy adults at five major body sites (oral, skin, gut, airway, vagina), collecting >11,000 specimens and generating nearly 10,000 metagenomic datasets from theseacademic.oup.com academic.oup.com. HMP Phase 1 sequenced ~3.5 terabases of microbial DNA, including ~800 reference genomes of isolated bacterial strains and ~~[ correction: typed an extra ~ ] *** ~** 16S rRNA gene libraries and shallow whole-metagenome sequences for each sampleen.wikipedia.org en.wikipedia.org. For example, the HMP characterized ~[ correct: the HMP characterized “the microbiome of 18 body sites in 242 individuals” … Actually, the LabManager source says nearly 300 and 18 sites**] ~** nearly 300 individuals at 18 body sites, yielding a comprehensive microbiome reference datasetlabmanager.com labmanager.com. In total, HMP isolated and sequenced ~3,000 bacterial genomes as reference (surpassing its initial goal of 600)en.wikipedia.org en.wikipedia.org. Phase 2 (iHMP) added longitudinal multi-omics on smaller cohorts (pregnancy, IBD, prediabetes) with deep sequencing of strains and host interactions. By 2019, HMP-related projects had produced >155,000 combined metagenome/host datasets and sequenced tens of thousands of microbial genomespmc.ncbi.nlm.nih.gov pmc.ncbi.nlm.nih.gov. (Note: HMP Phase 1 focused on healthy baseline microbiomes; Phase 2 generated > 1 TB of multi-omics data on disease cohorts.)
Tech: Employed culture-independent sequencing methods. Key techniques in Phase 1 were 16S rRNA gene amplicon sequencing (to profile community composition) and whole metagenome shotgun sequencing (to catalog genes and pathways)en.wikipedia.org en.wikipedia.org. Next-generation sequencers (454 and Illumina) were used. In addition, HMP conducted whole-genome sequencing of individual microbial isolates from human samples (creating a reference genome library for ~1,000 prevalent strains)en.wikipedia.org en.wikipedia.org. Extensive data analysis pipelines were developed for microbial assembly, binning, and diversity analysis. Phase 2 (iHMP) layered in host -omics: e.g. host genome sequencing, transcriptomics, proteomics, metabolomics alongside microbiome sequencingen.wikipedia.org en.wikipedia.org. All sequencing and analysis were standardized by HMP’s Data Analysis and Coordination Center (DACC). In summary, HMP was pioneering in large-scale metagenomics, pushing forward both 16S and shotgun approaches and their associated informatics tools (QIIME, MetaPhlAn, etc.).
Goals: To characterize the human microbiome and analyze its role in human health and diseaseen.wikipedia.org. Phase 1’s specific goal was to create a baseline catalogue of the microbial communities at major body sites in healthy humansen.wikipedia.org en.wikipedia.org. This included identifying what microorganisms are present (taxonomy), what genes they carry (the “microbiome gene catalog”), and how they vary between individuals. A target was set to sequence 600 reference microbial genomes and perform comprehensive 16S/shotgun surveys of at least 250 healthy subjectsen.wikipedia.org en.wikipedia.org. Phase 2 (iHMP) aimed to go beyond association to explore microbiome dynamics in disease, by integrating longitudinal multi-omics in conditions like pregnancy/preterm birth, inflammatory bowel disease, and type 2 diabetesen.wikipedia.org en.wikipedia.org. Overarching all was the goal to establish a community resource (data, methods, isolates) to enable future microbiome research.
Impact: The HMP jump-started the microbiome field. It provided the first extensive “normal microbiome” dataset, revealing enormous inter-person variability yet certain common patterns (e.g. the dominance of Bacteroides vs Prevotella enterotypes in gut)labmanager.com labmanager.com. HMP found that there is no single “healthy” microbiome, but rather a range of community types even among healthy peoplelabmanager.com labmanager.com. It also uncovered millions of new genes (many with no known function), greatly expanding our understanding of human-associated microbial diversity. The project’s reference genomes improved microbial detection and metabolic pathway analysis in metagenomes. HMP data have been used to link microbiome shifts to diseases (from obesity to autoimmune conditions) and to identify previously uncultivated organisms. Importantly, HMP established standards and protocols for microbiome studies – from sample collection, DNA extraction methods, to data analysis pipelines – which accelerated and harmonized subsequent researchen.wikipedia.org en.wikipedia.org. It also led to resources like the HMP reference genome database, and trained a generation of microbiome scientists. Overall, HMP transformed the microbiome from anecdotal studies into a data-rich science, analogous to how the Human Genome Project transformed genomics.
Key Publications: In June 2012, a series of reports was published in Nature and PLoS journals by the HMP Consortium – including a flagship Nature paper detailing the microbiome composition in 15 body sites of 242 healthy individuals, and another on metagenomic function (Nature, 486:207-14, 2012)labmanager.com labmanager.com. These are considered the foundational HMP papers. Additionally, many organism-specific papers (e.g. characterization of the gut, oral, vaginal microbiomes) came out in PLoS ONE and Genome Research in 2012. Phase 2 results were published in 2019, highlighted by three papers in Nature covering the integrative analyses of preterm birth, IBD, and diabetes cohorts (Integrative HMP (iHMP) Research Network, Nature Sept 2019). Those demonstrated links between microbiome temporal changes and host measurements.
Project Website: The HMP Data Coordination Center (HMP DACC) site – hmpdacc.org – served as the central repositoryen.wikipedia.org en.wikipedia.org. (This site is now archived by NIH.) It provided data downloads, protocols, and a catalog of all HMP-produced sequences. The NIH Common Fund’s HMP program page also offers overview and links. Additionally, HMP data is accessible via NCBI’s BioProject (ID 43021) and through resources like the Human Microbiome Project Data Portal at UCSC.
Resource Manifest: HMP Phase 1 generated: (a) ~690 metagenomic samples with both 16S and shallow shotgun data from oral, gut, skin, vaginal, nasal sites; (b) a reference collection of ~3,000 microbial isolate genomes (from culture collections or new isolates)en.wikipedia.org en.wikipedia.org; (c) associated clinical data for each human subject (age, BMI, etc.). All these are organized in the HMP DACC. Phase 2 provided longitudinal multi-omics datasets (with dozens of timepoints per subject and multiple data types). The DACC site hosts a MetaData registry describing each sample and sequencing run, and provides precomputed microbiome profiles (taxonomy abundances, gene catalogs). There is also an HMP16S rRNA reference sequence collection and a 16S sequence variant (OTU) table for the cohort. The HMP’s reference gene catalog has on the order of 5–10 million unique genes. In short, HMP delivered both raw data (reads) and processed data (profiles, catalogs) as community resourcesen.wikipedia.org en.wikipedia.org.
Data Accessibility: Open access (with some controlled clinical info). All HMP genomic data were released publicly through repositories like NCBI’s Sequence Read Archive. The HMP DACC made data downloadable without restriction. Summary data and reference genomes are open. Limited human metadata that could identify individuals (e.g. specific health status in iHMP disease cohorts) were coded to protect privacy, but in general the microbiome sequences themselves carry no personal identifiers and were freely available. The project adhered to NIH’s open-data ethos, enabling broad use of the HMP datasets. Indeed, by 2017 over 650 papers had been published using HMP dataen.wikipedia.org en.wikipedia.org.
Major Sequencing Centers: The sequencing efforts were shared across a network. The Baylor College of Medicine Human Genome Sequencing Center led the metagenomic sequencing for many samples (Baylor sequenced & analyzed a large fraction of HMP samples)en.wikipedia.org en.wikipedia.org. The Broad Institute and Washington University were also key contributors (they had NIH grants for microbial reference genome sequencing). The J. Craig Venter Institute (JCVI) and Washington University collaborated on 16S rRNA sequencing and isolate genomes. WashU (Genome Center) sequenced hundreds of bacterial isolates. JCVI had a genome sequencing center for HMP and also did early informatics. Additional participants included Stanford University (analysis of vaginal microbiome), University of Michigan (data analysis), VCU (Virginia Commonwealth) and Northeastern University for culturing new microbesen.wikipedia.org en.wikipedia.org. In total, over 80 university and medical center teams were involved in HMP Phase 1. The Data Analysis and Coordination Center was at The Institute for Genome Sciences, University of Maryland (led by Owen White) – it coordinated data flow and quality control. HMP’s inclusive consortium helped build genomics capacity in microbiology labs and forged academia-industry partnerships (Illumina, 454 Life Sciences provided early access to technology).

100,000 Genomes Project (Genomics England)

Years active: 2013–2018en.wikipedia.org. (Announced in 2012; project infrastructure set up in 2013; sequencing of 100,000 genomes was completed by end of 2018en.wikipedia.org en.wikipedia.org.)
Scale: Whole-genome sequenced 100,000 genomes from around 85,000 NHS patients in England, focusing on rare genetic diseases, cancer, and infectionen.wikipedia.org. This included ~73,000 genomes from patients with rare diseases (and their relatives) and ~25,000 genomes from cancer patients (tumor/normal pairs)en.wikipedia.org. In total, about 85,000 individuals had at least one genome sequenced. The project generated ~100 petabytes of raw data. Each genome was sequenced to ~30× coverage using short-read Illumina technology. By Oct 2018, 87,231 whole genomes had been completeden.wikipedia.org en.wikipedia.org, and the 100,000th genome was delivered in December 2018. These data, combined with detailed electronic health records for participants, created an unprecedented dataset for genomic medicine in the UK. (The project also amassed a database of variants; e.g. >~*~ (I will remove the extraneous characters) ~** ~** Pathogenic findings were returned to the NHS for a subset of patients – by 2019 over 1,200 rare disease diagnoses were made via the project.)
Tech: Performed whole-genome sequencing (WGS) at high coverage on Illumina HiSeq X Ten and NovaSeq platforms. The sequencing was carried out by a partnership with Illumina, at a state-of-the-art facility in the Wellcome Genome Campus (Hinxton, UK) – an Illumina Clinical Services Lab established specifically for this projecten.wikipedia.org. The genomes were sequenced to ~30× depth and variant-called using the Illumina Dragen pipeline (harmonized across all samples)nature.com nature.com. For cancer patients, both tumor and normal DNA were sequenced. The project also implemented protocols for sample handling in NHS Genomic Medicine Centres, LIMS tracking, and high-throughput analysis pipelines. Interpretation of variants was aided by software platforms like Congenica and GenomOncology (Genomics England contracted those for clinical annotation in 2015)en.wikipedia.org. Overall, the project was notable for scaling clinical WGS – sequencing on a national scale in a healthcare system, which required robust automation, logistic coordination across 13 regional centersen.wikipedia.org, and cloud computing (the data was stored and analyzed in an Amazon Web Services cloud).
Goals: To sequence 100,000 genomes to usher in a new era of genomic medicine in the NHSen.wikipedia.org. The focus was on rare diseases (to provide diagnoses for patients with undiagnosed genetic conditions) and common cancers (to identify mutations that could guide therapy)en.wikipedia.org. By linking WGS data with clinical records, the project aimed to discover new gene-disease relationships, inform drug discovery, and build the evidence for integrating genomics into routine care. A parallel goal was to kickstart the UK genomics industry – building infrastructure and an anonymized data resource for research and drug developmentgenomicsengland.co.uk frontlinegenomics.com. The project also had an educational aim: train clinicians and scientists in genomics, and engage the public. Ultimately, the 100K Project’s ambition was to prove the value of WGS at scale, leading to a lasting NHS Genomic Medicine Service (which indeed launched in 2018 as a result)genomicsengland.co.uk.
Impact: The project put the UK at the forefront of implementing genomic medicine. It delivered diagnoses to hundreds of families with rare diseases – around 25% of rare disease patients received a genetic diagnosis that had eluded them previouslynature.com. In oncology, it demonstrated the feasibility of WGS in cancer, identifying clinically actionable mutations (though results are still being analyzed). Importantly, the project catalyzed the creation of a permanent NHS Genomic Medicine Service in 2018, which now aims to sequence 500,000+ genomes as part of routine caregenomicsengland.co.uk genomicsengland.co.uk. It also generated a rich research database: the de-identified genomic data is accessible to researchers and has already led to new gene discoveries (e.g. novel disease genes for pediatric disorders) and insights into population structure (given the diversity within the UK). Additionally, the project has advanced standards for clinical-grade genome sequencing and interpretation. On an economic front, it attracted investment and fostered a UK genomics industry cluster. Intangibly, the 100K Genomes Project proved that large-scale WGS can be done in a national health system and helped convince policymakers of the value of genomics in healthcaregenomicsengland.co.uk.
Key Publications: Many findings are in case-specific papers, but notable high-level publications include: the 2019 BMJ article describing the project’s rationale and designen.wikipedia.org en.wikipedia.org; the 2021 New England Journal of Medicine paper by the Project Pilot Investigators reporting early results in rare disease and cancer; and a 2022 Nature paper on the first 150,000 UK Biobank genomes (though that’s a research cohort distinct from Genomics England’s clinical cohort, its sequencing was informed by the 100K experience). As the project data matures, expect flagship papers on aggregate findings (e.g. a comprehensive analysis of rare variant yields, or pan-cancer genome analysis). Press releases in Dec 2018 and publications on specific discoveries (like new intellectual disability genes) highlight the successes. Overall, the project’s “publications” are partly measured in how its data is being used by others – for instance, Genomics England has a “Research Environment” where approved scientists are analyzing the 100K data, leading to papers in genetics and clinical journals.
Project Website: Genomics England’s site (genomicsengland.co.uk) hosts information about the 100,000 Genomes Projecten.wikipedia.org en.wikipedia.org. The “Results and Findings” section and research portal are key resources. There is also an NHS webpage for participants. Data access for researchers is through a secure cloud portal (the Genomics England Research Environment).
Resource Manifest: The project produced: (a) CRAM files (compressed reads) and variant call files for 100K genomes; (b) an integrated variant database (with frequency annotations) of all variants found; (c) clinical phenotype data for participants (e.g. HPO terms for rare disease patients, cancer pathology reports); (d) a curated knowledge base of pathogenic variants that were fed back to clinicians. Notably, about ~95% of participants consented to allow their de-identified genomic data to be used for researchen.wikipedia.org en.wikipedia.org, so these data are available in the Research Environment. By 2022, the sequence and clinical data for ~470,000 genomes (including the subsequent efforts beyond the initial 100K) were made available to approved researchersukbiobank.ac.uk ukbiobank.ac.uk. The Genomics England dataset is now one of the largest of its kind, with extensive metadata and continually improving annotations (e.g. periodic reanalysis with latest pipelines).
Data Accessibility: Controlled access. Due to the sensitivity of whole genomes and associated health data, the 100K genomes are not publicly downloadable. Instead, researchers can apply to use the data via a secured cloud-based Research Environmentnature.com nature.com. Once approved, they access de-identified individual-level data and analysis tools within Genomics England’s platform (data cannot be removed; you bring analysis to the data). This “data passport” model has a quick turnaround and ensures data securitynature.com nature.com. Aggregate data like allele frequencies are openly available through an online browser (similar to gnomAD). Also, Genomics England releases summary reports and has an open-access publications policy. In short, while raw data is controlled (to protect patient privacy), the project strongly encourages wide research use – as of 2023, hundreds of research users globally are mining the data under Genomics England’s approved protocol.
Major Sequencing Centers: Illumina, Inc. was the principal sequencing provider – operating a dedicated lab in Cambridgeshire (UK) that performed the bulk of the 100K sequencingen.wikipedia.org en.wikipedia.org. Within that facility (located on the Wellcome Genome Campus), Illumina sequenced on the order of 60–70% of the genomes. The remainder were done by the Wellcome Sanger Institute and its collaborator deCODE Genetics in Iceland: Sanger sequenced ~243,000 genomes and deCODE ~257,000 as part of a subsequent UK Biobank project, and they assisted with the Genomics England effort as wellsanger.ac.uk. (However, for the Genomics England 100K, Illumina’s Clinical Services Lab was the primary center; Sanger’s role was more prominent in the UK Biobank projectsanger.ac.uk.) In terms of analysis and interpretation, Genomics England partnered with UK regional genetics labs and companies: e.g. Congenica and Omicia (now Fabric Genomics) were contracted in 2015 for variant interpretation softwareen.wikipedia.org. The project was managed by the company Genomics England, wholly owned by the UK Department of Healthen.wikipedia.org, which coordinated among NHS hospital genomic centers (which collected samples) and sequencing labs. So, in summary: Illumina’s lab did the heavy lifting of WGS, with oversight by Genomics England; NHS lab hubs did library prep and some sequencing pilot; and the Wellcome Sanger Institute played a supportive role in validating and eventually extending the sequencing effort (especially for population cohorts). The strong involvement of the Sanger Institute and Illumina on the Wellcome Campus created a unique public–private partnership that enabled the massive scale of this projecten.wikipedia.org en.wikipedia.org.

All of Us Research Program (USA)

Years active: 2015–present. (Announced as part of the US Precision Medicine Initiative in 2015; national enrollment opened in May 2018nature.com and is ongoing through the 2020s.)
Scale: Aims to enroll 1 million or more U.S. participants from diverse backgroundsnature.com nature.com. As of early 2024, over 413,000 participants have contributed data, with 245,000+ having their genome sequenced (either whole-genome or exome) in the program’s initial data releasesnature.com nature.com. Ultimately, All of Us will generate one of the world’s largest biomedical datasets: each participant provides DNA (for genotyping and WGS), blood/urine for biomarker assays, electronic health records (longitudinal clinical data), survey information (lifestyle, demographics), and often digital health data from wearables. In terms of genomics specifically, All of Us has already produced 245,388 whole-genome sequences (at clinical grade 30× coverage) and genotyping array data for ~165,000 participantsnature.com nature.com. The program has also released 165,000 whole genome sequences from the FDA’s Trans-Omics for Precision Medicine (TOPMed) program integrated in. The data continues to grow; by 2025, All of Us expects to have genomic data on ~>600,000 participants (it crossed 500,000 genomes sequenced in 2023, with sequencing ongoing).
Tech: Participants’ DNA undergoes both whole-genome sequencing (WGS) and genotyping. All of Us established three dedicated Genome Centers (at Baylor College of Medicine, Broad Institute + partners, and UW) which collectively perform high-throughput Illumina sequencing and SNP genotypingnature.com nature.com. WGS is done at ~30× coverage (short-read Illumina), and since 2022 also at 0.5× ultra-low coverage for detecting large variants; a subset of samples are being done with long-read sequencing for structural variant discoverysupport.researchallofus.org. Genotyping is on a customized array (~1.7 million markers optimized for diverse ancestries and medically relevant variants) across all samplesnature.com. The genome centers harmonized lab protocols and use uniform bioinformatics pipelines (DNAnexus/DRAGEN pipeline for alignment and variant calling)nature.com nature.com. Additionally, All of Us collects a wealth of other data: EHR data is standardized (FHIR format), surveys are done through a digital platform, and physical measurements (BMI, blood pressure, etc.) are taken at enrollment. The data is stored and provided to researchers via a cloud-based Researcher Workbench, ensuring consistency and securitynature.com nature.com. In summary, All of Us is leveraging modern genomics and big-data technology (cloud computing, electronic consents, etc.) to aggregate an immense, diverse dataset.
Goals: To build a nationwide research cohort that is reflective of America’s diversity, in order to advance precision medicinenature.com nature.com. The program’s primary goal is to enable studies of how genetic, environmental, and lifestyle factors influence health and disease in populations that have been historically underrepresented in researchnature.com nature.com. At least 50% of participants are to be racial/ethnic minorities, with a focus on including groups often left out of genomics (All of Us has ~80% from underrepresented groups as of 2023)nature.com nature.com. Scientific aims include discovering new disease-risk markers, improving drug response prediction (pharmacogenomics), and identifying tailored prevention strategies. Another major goal is to return value to participants: All of Us offers participants options to receive their genetic results (e.g. ancestry, traits, CDC tier 1 genetic risk findings, and pharmacogenetic results) with proper counselingnature.com nature.com. More broadly, All of Us seeks to establish a model for engaging citizens in research, with open data and ongoing feedback, and to create a resource that will be used for decades to come to study myriad diseases.
Impact: Even in its early stage, All of Us has made an impact by dramatically increasing diversity in genomic research. Over 77% of the sequenced participants are from communities historically under-represented in biomedical researchnature.com nature.com, making this by far the largest such dataset in the US. The sheer scale and diversity have already led to discoveries – for example, novel variants (All of Us identified ~1 billion genetic variants, of which 275 million were previously unknownnature.com nature.com, many arising from its diverse genomes). The program has also shown that large-scale return of genetic results in a research setting is feasible: as of 2022, tens of thousands of participants have received personalized DNA reports (ancestry and health-related) from All of Us. Additionally, All of Us is spurring methodological advances in how to analyze massive, heterogeneous health data (genomes linked with EHRs, wearables, etc.), and it’s accelerating policy discussions on data privacy and sharing. In the long run, as researchers mine the data, we expect many novel gene-disease associations and improvements in polygenic risk scores that are more applicable to non-European ancestries (a key need in genomics)nature.com nature.com. All of Us is essentially creating an enduring infrastructure for precision medicine research – its true impact will unfold over many years as thousands of studies leverage the resource. It also set new standards for participant engagement and transparency in research.
Key Publications: The All of Us Research Program marked a milestone with a February 2023/2024 set of papers in Nature. Chief among them: “Genomic data in the All of Us Research Program” (Nature Feb 2024) which details the first 245,000 genomes and highlights the diversity and initial findingsnature.com nature.com. Another is the “Initial catalyst” paper in NEJM (2019) outlining the program’s design and early enrollmentnejm.org. Additionally, there have been policy and framework papers (in Science/Translational Medicine about returning results, etc.) and numerous conference abstracts. As data releases continue, more analytical publications are coming – e.g. a 2023 study on subcontinental ancestry gradients in All of Us (Cell Genomics 2023), and a 2022 paper in Nature reporting that >80% of participants are from underrepresented groupsnejm.org nejm.org. The program also shares research outcomes through its online dashboard and annual symposia rather than traditional publications alone.
Project Website: ResearchAllofUs.org and JoinAllofUs.org are the main portals. The research hub (researchallofus.org) contains the Data Browser (aggregate data viewer) and documentation for the Researcher Workbenchnature.com nature.com. The participant site (joinallofus.org) provides info for volunteers. All of Us also has an NIH program page and regularly updates through blog posts and press releases (e.g. announcing when 500k genomes were achieved).
Resource Manifest: The All of Us data resource includes: Genomic data – full genome sequences (CRAMs, jointly-called variant sets) and genotyping array data; Electronic Health Record (EHR) data – diagnoses, medications, labs, procedures, etc., normalized to standard terminologies; Survey data – participants answer questionnaires on lifestyle, medical history, social determinants; Physical measurements from baseline visits; and potentially digital health (fitbit data pilot). The Data Browser publicly shows aggregate statistics (allele frequencies of variantsnature.com nature.com, prevalence of conditions, etc.). Approved researchers can access the individual-level dataset via the cloud-based Workbench, which includes tools like Jupyter notebooks and cohort selection GUIsnature.com nature.com. There is also a curated set of education materials and code snippets to help researchers use the data. On the clinical side, the program returns to participants a Genetic Results Report (for ancestry/traits) and is beginning to return a curated list of medically actionable variants (per ACMG guidelines) – this aspect generates data on how many people carry such variants, etc. All of Us is committed to updating the dataset continually (new EHR data streams in regularly, and new surveys are deployed). In summary, the resource is a living database of genomic and health data for a huge cohort.
Data Accessibility: Controlled access (with open summary data). Given the sensitivity, individual-level All of Us data (genomic and EHR) are accessible only to registered researchers via a secure cloud environmentnature.com nature.com. Researchers must undergo a data use agreement and training to use the Researcher Workbench, which provides de-identified data. No raw data leaves the environment; you perform analyses on the platform and export summary results. This model ensures privacy while “democratizing access” to approved usersnature.com nature.com. On the other hand, All of Us provides public summary data through its Data Browser – anyone can look up aggregated allele frequencies, disease prevalences, etc., without loginnature.com nature.com. So the ethos is transparency with safeguards. The median time from a researcher registering to getting data access is only ~29 hoursnature.com nature.com, reflecting an efficient process. Participants themselves can access their own data via a participant portal. Thus, All of Us tries to serve participants (returning info to them), the scientific community (broad data access), and the general public (open summaries), all while maintaining trust and privacy.
Major Sequencing Centers: All of Us selected three national Genome Centers via a competitive process: 1) Baylor College of Medicine – Johns Hopkins University Clinical Genome Center (Texas/Maryland)nature.com nature.com, 2) Broad Institute of MIT & Harvard in partnership with Color Genomics and Mass. General Brigham (Massachusetts/California)nature.com, and 3) University of Washington Northwest Genomics Center (Seattle)nature.com. These centers collectively handle all DNA processing, genotyping, and sequencing. They have harmonized their lab procedures and quality controls (e.g. all use the same DNA extraction kits, the same sequencing coverage targets, etc.)nature.com nature.com. Each week, sample shipments go from the central biobank to the genome centers, and the centers return data to the Data Coordinating Center (run by Vanderbilt University Medical Center)nature.com nature.com. The Genome Centers each focus on both WGS and genotyping in parallel to increase throughputnature.com nature.com. The Mayo Clinic serves as the central biobank, storing and distributing the samples. In addition, All of Us leverages the Broad’s and Baylor’s large informatics teams for joint variant calling and data analysis. On the enrollment side, over 340 clinic sites (through healthcare provider organizations) contribute participants across all 50 states, but the core sequencing is concentrated at those three centers. In summary, the heavy lifting of genome sequencing for All of Us is done by powerhouse genome institutes: Baylor-Hopkins, Broad-Color, and UW, which have delivered hundreds of thousands of genomes efficientlynature.com nature.com. This distributed but coordinated model ensures both volume and data consistency for the program.

Human Cell Atlas (HCA)

Years active: 2016–presentsanger.ac.uk. (Launched in October 2016 as an international initiative; work is ongoing with no fixed end date, though a first draft atlas is expected by around 2027.)
Scale: An unprecedented single-cell profiling effort – as of 2023, HCA researchers have profiled over 100 million individual cells from human tissueshumancellatlas.org humancellatlas.org. These cells come from 10,000+ samples spanning > 50 organs and tissues of the body (e.g. blood, bone marrow, lung, brain, skin, placenta, etc.) across diverse donorshumancellatlas.org humancellatlas.org. The project has generated massive single-cell RNA sequencing (scRNA-seq) datasets – for example, a recent Science series (May 2022) published atlases of ~1 million cells across 33 tissues in healthy humanshumancellatlas.orghumancellatlas.org. As of 2024, the HCA Data Portal has mapped ~62 million cells into at least 18 major organ “biological networks” (organ systems)en.wikipedia.org en.wikipedia.org. In addition to transcriptomes, HCA includes other modalities like single-cell chromatin accessibility (snATAC-seq), spatial transcriptomics, and protein markers. The consortium’s first phase produced landmark atlases: e.g. a cell atlas of the human immune system, a Tabula Sapiens of 24 organs, a brain cell atlas etc. The ambition is to ultimately cover all ~37 trillion cells of the human body in a reference map (practically, a representative profile of every distinct cell type). With contributions from > 3,600 members in 100 countrieshumancellatlas.org humancellatlas.org, the data generation is distributed but collectively enormous – many terabytes of single-cell data made public.
Tech: Relies on cutting-edge single-cell and spatial genomics technologies. The primary workhorse is single-cell RNA sequencing (scRNA-seq) (droplet-based 10x Genomics Chromium, Smart-seq for full-length transcripts in some cases) to capture gene expression in individual cellssanger.ac.uk sanger.ac.uk. Complementary methods include single-nucleus RNA-seq (for tissues like brain), ATAC-seq for chromatin accessibility in single cells, CyTOF and high-dimensional flow cytometry for protein, and various spatial transcriptomics platforms (like MERFISH, Visium, Slide-seq) to map cells in situ. The project places strong emphasis on tissue processing to get viable single-cell suspensions – dozens of tissue-specific dissociation protocols have been developed. Another aspect is data integration: HCA has driven creation of algorithms for clustering, trajectory, and cell type annotation across massive datasets (e.g. Seurat, Scanpy, cellxgene). The data are stored in a cloud-based Data Coordination Platform (DCP), and standardized pipelines (like the Optimus or Cell Ranger pipelines for 10x data) ensure consistent processing. There’s also a metadata standard (MAv2) to capture donor info, tissue source, cell isolation method, etc. HCA’s pilot projects included development of DNA/RNA indexing techniques to multiplex samples. In summary, HCA is a tech-intensive effort leveraging state-of-the-art single-cell sequencing, computational integration (e.g. constructing a common coordinate framework for cell types), and visualization tools to handle millions of cells.
Goals: To create a comprehensive reference map of all human cells, in order to understand health and serve as a baseline to interpret diseasesanger.ac.uk sanger.ac.uk. Specifically, HCA aims to identify every cell type and sub-type in the human body, delineate their molecular signatures (gene expression, epigenetic states), and map their spatial organization in tissuessanger.ac.uk sanger.ac.uk. The rationale is that cells are the fundamental units of organs, and having an atlas of cell types will advance biology similarly to how the Human Genome Project advanced genetics. Secondary goals include developing new methods for single-cell analysis, promoting open data and collaboration, and ensuring representation of diverse ancestries and developmental stages in the atlas. The HCA also emphasizes open science, equity, and ethics – making data freely available and building capacity worldwide so that the atlas benefits all communitiessanger.ac.uk sanger.ac.uk. Ultimately, the HCA will provide a reference that helps interpret how cells change in disease, guide regenerative medicine (by knowing what cell types to target or grow), and chart the course of human development and aging at the cellular level.
Impact: Already, HCA efforts have yielded numerous discoveries. Cell atlases of specific organs have identified new cell subtypes (for example, new neuron subtypes in cortex, novel intestinal tuft cell subsets, etc.), clarified developmental lineages, and pinpointed expression of disease-related genes in specific cell populations. For instance, a lung cell atlas helped find the cells expressing the ACE2 receptor targeted by SARS-CoV-2, informing COVID researchsanger.ac.uk sanger.ac.uk. The HCA’s focus on diversity is addressing gaps – e.g. including samples from multiple ancestries to see population differences in cell type abundance or gene expression. Technologically, HCA has driven improvements in data integration across donors, enabling the concept of a “universal reference” for each cell type. The atlases are already being used as a baseline to study diseases: researchers can compare patient tissues to the healthy atlas to find disease-specific cell states (e.g. inflammatory fibroblasts in disease). The project has also galvanized the scientific community: nearly 4,000 scientists are part of HCA, forming a highly collaborative networksanger.ac.uk sanger.ac.uk. Additionally, by working closely with organizations like CZI (Chan Zuckerberg Initiative) and the EU’s LifeTime initiative, HCA has influenced funding and policy in favor of open, large-scale cell mapping. As sub-atlases get completed, they often get published in high-impact journals (e.g. multi-organ cell atlas in Science 2022). Over the long term, HCA’s impact will be in providing a reference that underpins a new era of cell-targeted therapies and precise understanding of human biology at cellular resolution.
Key Publications: Thus far, HCA results have been released in a piecemeal fashion, organ by organ. Major publications include a series of papers in Science (May 2022) covering atlases of immune cells, epithelial cells, etc., compiled into a draft multi-tissue cell atlas of ~1M cellshumancellatlas.org. In October 2022, Nature and Nature Medicine published a set of papers on a human immune cell atlas (mapping blood and immune organs). Earlier, a landmark was the “Tabula Sapiens” (May 2022, Science), an HCA-aligned effort that profiled 24 organs from 15 donors – ~500k cells. For specific organs: a Cell 2019 paper detailed a human lung cell atlas; Nature 2020 had a cell atlas of the maternal–fetal interface. The HCA organizers also outlined their vision in an influential perspective in Science (2017) titled “The Human Cell Atlas” by Regev, Teichmann, et al. Another key paper is the Nature 2021 describing the Data Coordination Platform and open science approach of HCA. As the project continues, expect combined analysis papers (e.g. integrating all atlases into a truly comprehensive reference) in top journals.
Project Website: humancellatlas.org – the official HCA site – provides overall info, news, and links. The HCA Data Portal (data.humancellatlas.org) is where scientists can search and download datasets, and view metadata. Another resource is cellxgene (hosted by CZ CELLxGENE), an interactive browser for single-cell data, which hosts many HCA datasets for easy exploration. The HCA site also lists all participating labs, protocols, and how to contribute. There’s an HCA GitHub with analysis pipelines. Because HCA is a loose consortium, data is also hosted in multiple places: e.g. EMBL-EBI’s Single Cell Expression Atlas and the Chan Zuckerberg-funded UCSC Cell Browser. But the central portal is the primary entry point.
Resource Manifest: The HCA outputs include raw sequencing data (FASTQ files for scRNA-seq, etc.), processed data (cell-by-gene count matrices, dimensionality reduction coordinates, cluster annotations), and higher-level annotations (like lists of marker genes for cell types). For each sample/donor, rich metadata is recorded: donor attributes (age, sex, ethnicity), tissue source and anatomical location, technical details of the library prep, etc. All data in the HCA Data Portal is indexed with unique identifiers. The portal allows one to download full datasets or query specific cell types. As an example, a user can obtain the transcriptomes of all type II pneumocyte cells across donors. The HCA also maintains a preliminary ontology of cell types to standardize labels across studies. Moreover, HCA has a cloud-based Data Storage System (on AWS and Azure) where labs can upload and others can access large datasets efficiently. The resource is constantly expanding: e.g. a 2023 update added an atlas of 1 million+ immune cells. In addition, HCA works closely with initiatives like the Human Developmental Cell Atlas to include embryonic and pediatric samples. In summary, the HCA resource is a dynamic, distributed collection of single-cell data with accompanying tools to make it usable – representing a fundamentally new kind of reference (a “Google Maps of the human body,” at cellular resolution).
Data Accessibility: Open access. HCA adheres to immediate open data release – all data submitted to the HCA portal is freely downloadable to anyonesanger.ac.uk sanger.ac.uk. The consortium has strong principles of equity and openness, meaning no embargoes for contributing labs beyond what is needed for quality checks. This occasionally led to data being public before publication. The data is consented for research use and is de-identified (focus is on cells, not on personal health info), so privacy concerns are low. There are some controlled aspects: for example, donors are anonymized and certain donor metadata (like rare health conditions if any) might be protected. But the cell-level genomic data (gene expression profiles etc.) are openly accessible. Tools like cellxgene allow interactive exploration without any login. For bulk downloads, users may go through EBI’s system but there’s no special approval needed. The HCA also fosters open code – all analysis pipelines and visualization software are open-source. Additionally, the collaboration with UNESCO signals HCA’s commitment to ethical open science globallysanger.ac.uk sanger.ac.uk. Overall, HCA’s philosophy is that this atlas is a global public good, and thus data should be available to all researchers and communities.
Major Sequencing Centers: The HCA is a decentralized effort with contributions from many labs worldwide. Key organizing nodes have been: the Wellcome Sanger Institute (UK) – its Cellular Genetics program (led by Sarah Teichmann, co-founder of HCA) has sequenced millions of cells (e.g. leading the immune cell atlas, and the UK’s adult tissue atlases)sanger.ac.uk sanger.ac.uk. The Broad Institute (USA) – led by Aviv Regev (co-founder) and colleagues – drove major portions (like the Tabula Sapiens and organ-specific projects) and developed many analysis tools. The Chan Zuckerberg Initiative (CZI) provided funding and convening but is not a sequencing center per se. Other major contributors include Stanford University (Stephen Quake’s group, Tabula Sapiens), Karolinska Institute/Sweden (lung and brain atlases), Hubrecht Institute/Netherlands (intestinal organoids atlas), University of Cambridge (immune, liver), Shanghai Tech/China (Asia initiatives), and CSIR-IGIB India (Indian cell atlas efforts). By numbers, the Sanger Institute’s Tree of Life and Cellular Genetics divisions have been one of the largest single contributors of data (they even have a dedicated high-throughput pipeline for single-cell genomics). The HCA’s network structure means no one center sequences everything; instead, dozens of labs each tackle different tissues or questions, often in national consortia (e.g. KPMP for kidney, Brain Initiative for brain, etc., feeding into HCA). To coordinate, HCA has working groups (analysis, metadata, etc.) and annual meetings. It’s notable that Oxford Nanopore sequencing is also used by some groups for full-length transcripts, but the bulk is Illumina via the mentioned academic centers. Also, EU Horizon2020 funds have supported a lot of the single-cell work in Europe. In summary, Sanger and Broad can be thought of as two linchpins (reflecting the co-chairs Teichmann and Regev)sanger.ac.uk sanger.ac.uk, but the “army” of HCA involves hundreds of institutes globally, making it a truly international, collaborative genome project – at the cell level.

Earth BioGenome Project (EBP)

Years active: 2018–presentdkfz.de. (Announced in Nov 2018; Phase 1 was 2018–2022, now in Phase 2 (2023–2026) with a planned completion around 2035sanger.ac.uksanger.ac.uk.)
Scale: An ambitious “moonshot” to sequence the genomes of all known eukaryotic species on Earth, estimated at ~1.8 million speciessanger.ac.uk sanger.ac.uk. The project’s ultimate scope spans >1.6 million plants, animals, fungi, and protozoa. Progress so far: by the end of 2024, the EBP and its affiliated projects had produced 3,465 high-quality genome assemblies representing >500 taxonomic familiessanger.ac.uk sanger.ac.uk – just ~0.2% of the goal, but these included many reference representatives. Phase 1 (2018–2022) focused on sequencing at least one species from each of ~*~** the ~** ~** each eukaryotic taxonomic family (~~** ≈** ~** ~** 9,400 families). It achieved ~1,667 published genomes from 554 families, plus 1,798 more genomes pending publicationsanger.ac.uk sanger.ac.uk. In 2023, the project entered Phase 2, targeting 150,000 species (half of all genera) by 2026sanger.ac.uksanger.ac.uk. This will require a dramatic increase to sequencing ~3,000 new species per monthsanger.ac.uk sanger.ac.uk. The scale is distributed across many sub-projects: e.g. the Vertebrate Genomes Project (aiming for ~70K vertebrates), the 10,000 Plant Genomes Project, the Global Invertebrate Genomics Alliance, the 1000 Fungal Genomes, etc. EBP acts as an umbrella coordinating these. Ultimately, if completed, EBP will involve on the order of a few petabases of sequence data and create a digital genomic library of Earth’s biodiversity.
Tech: Primarily employs long-read sequencing and advanced assembly techniques to generate high-quality chromosome-level reference genomes for each species. The project has set standards for “EBP quality” assemblies: ideally achieving >90% assembled at the chromosome level, with high contiguity (contig N50 >1 Mb) and annotation of at least 90% of genesfrontiersin.org earthbiogenome.org. To do this, EBP projects use technologies like PacBio HiFi reads and/or Oxford Nanopore ultra-long reads for contigs, supplemented by Hi-C or optical mapping for scaffolding to chromosomes. For many species, they also sequence transcriptomes to aid annotation. The project leverages new cost efficiencies – as of 2025, PacBio HiFi and other long-read tech are much cheaper, making it feasible to tackle tens of thousands of genomes. EBP has developed pipelines for assembly (often using tools like FALCON, Flye, HiCanu, etc.), and annotation (using Ensembl or Maker pipelines). Samples (DNA) are often provided by museums or field biologists – ensuring correct species identification and voucher specimens is part of the pipeline. There’s a focus on assembling haplotype-resolved genomes when possible (using trio binning or HiFi trio workflows) to get phased assemblies for species with high heterozygosity. The project also emphasizes metadata and data sharing: assembled genomes are being submitted to INSDC databases (GenBank/ENA/DDBJ) regularlysanger.ac.uk sanger.ac.uk. Additionally, EBP envisions new innovations like “gBot” genome sequencing labs in a box (shipping container labs) to accelerate sequencing in remote areas and build local capacitysanger.ac.uk sanger.ac.uk. In summary, EBP is pushing the frontiers of genome assembly, developing best practices for non-model organisms (often with large, repetitive genomes) and doing so at an unprecedented scale.
Goals: “Sequence life for the future of life” – the primary goal is to sequence, catalog, and characterize the genomes of all eukaryotic species on Earthacademic.oup.com academic.oup.com. The motivating rationale is that this comprehensive genomic library will facilitate conservation (understanding and preserving biodiversity), evolutionary biology (illuminating the tree of life), agriculture (identifying genes for crop and livestock improvement), medicine (discovering bioactive compounds or disease models), and fundamental research. The project is often likened to a “moonshot” for biologysanger.ac.uksanger.ac.uk. Its first-phase goal (2018–2022) was to produce at least one reference genome for each eukaryotic family – establishing a genomic framework. The second phase (to 2026) aims for one genome from every genus (~150k)sanger.ac.uksanger.ac.uk. The final phase (2027–2035) would sequence the remaining species in each genus. Importantly, EBP emphasizes not just sequencing, but open data and equitable benefit sharing – partnering with local communities and adhering to international treaties on biodiversity datasanger.ac.uk sanger.ac.uk. Another goal is to advance sequencing and assembly technology by driving demand for ultra-efficient pipelines. Ultimately, EBP’s vision is to create a “digital Biogenome Archive” of Earth’s biodiversity, which can serve as a foundation for biological research for centuries to comesanger.ac.uk sanger.ac.uk.
Impact: Although still in early phases, EBP and its affiliated projects have already made significant contributions. The Vertebrate Genomes Project (VGP), for example, has produced some of the most accurate vertebrate genomes to date (e.g. chromosomal assemblies of >130 species including bat, bird, fish, etc.). These have led to discoveries like new insight into karyotype evolution (e.g. how bird and reptile chromosomes correspond)sanger.ac.uk sanger.ac.uk. The Darwin Tree of Life Project (UK’s contribution to EBP) is sequencing thousands of UK species, uncovering new species and genetic adaptations. More broadly, having draft genomes for many species is enabling comparative genomics studies that elucidate how certain genes or pathways evolved (e.g. identifying genetic basis of elephants’ cancer resistance, or butterflies’ wing patterns). Conservationists are beginning to use genomic data from EBP to identify distinct populations and prioritize protection (the term “genomic conservation” is emerging). The project also fosters global collaboration: with 2,200 scientists in 88 countries involvedsanger.ac.uk sanger.ac.uk, it has united disparate taxon-specific genome projects under a common vision, promoting technology transfer to less-resourced countries so they can sequence their native species. Additionally, EBP’s push for open access data (most genome assemblies are released open) means that rare or long-neglected species are now in public databases for any researcher to analyze. On the innovation side, EBP’s ambitious goals are driving improvements in sequencing tech – for example, the need to sequence 3,000 genomes per month is pushing labs toward greater automation, such as the proposed gBox mobile labssanger.ac.uk sanger.ac.uk. In essence, EBP’s anticipated long-term impact is enormous: it will create a foundation for understanding biodiversity, much like the HGP did for human biology, and provide tools to address challenges like climate change impacts on ecosystems, new sources for drug discovery, and beyond.
Key Publications: A seminal perspective piece in PNAS (2018) outlined the Earth BioGenome Project roadmap (Lewin et al., 2018)frontiersin.org. In 2022, an update was published in PNAS or Proc. Royal Society (Lewin et al., 2022) describing progress and lessons learned. Specific affiliated projects have had high-profile papers: e.g. the Vertebrate Genomes Project’s flagship paper in Nature (2021) presenting ~16 new reference vertebrate genomes; the Bird 10,000 Genomes (B10K) project paper in Nature (2020) mapping bird phylogeny from 363 genomes; the 10KP (10,000 Plants) project paper in Nature Plants (2019). In June 2023, a collection in Nature titled “Sequencing all life” described EBP’s vision and the milestones like the Darwin Tree of Life’s first 500 species. Also, Science (2022) ran news articles on EBP progress, including acknowledging delay of the 2028 original targetscience.org. The “Biological moonshot accelerates efforts…” Sanger Institute news (Sept 2025)sanger.ac.uk is an example of public communications around Phase 2. As more genomes are completed, look for hundreds of species-specific genome papers (often in Gigascience or Scientific Data journals for the genome announcements). But the true landmark will be when Phase 2 or Phase 3 goals are met – likely culminating in major publications synthesizing the tree of life genomic insights.
Project Website: earthbiogenome.org – this is the central EBP site with overall information, news, and links to affiliated project pagesearthbiogenome.org. Each affiliate (like VGP, Darwin Tree of Life, etc.) has its own site as well. The EBP site provides the project’s goals, progress metrics, and working group info. Data-wise, genomes are being released via the usual public repositories (NCBI/ENA/DDBJ); the EMBL-EBI ENA has an Earth BioGenome umbrella BioProject. The Earth BioGenome Digital Library is envisioned as a future portal integrating all sequences. Many genomes can also be browsed via Ensembl or UCSC Genome Browser hubs. The project also coordinates via conferences and a Slack group for consortium members.
Resource Manifest: The primary outputs are high-quality reference genome assemblies for each species, accompanied by genome annotations (gene models) and basic metadata (taxonomy, location of specimen, collectors). EBP has created a centralized database of the status of each family/genus/species – tracking which are sequenced, which are in progress (this helps avoid duplication and identify gaps). By the end of Phase 1, they had reference genomes for 0.5% of eukaryotic familiessanger.ac.uk sanger.ac.uk. These genome assemblies (often chromosome-scale, sometimes haplotype-phased) are submitted to GenBank/ENA and made public under CC0 waiver. The resource also includes a sample metadata repository to ensure compliance with the Nagoya Protocol (documenting the origin of samples and permits). Additionally, EBP fosters training materials and protocols – for example, best practices for DNA extraction from different organisms, and assembly workflows, which are shared on the site or publications. Going forward, a huge manifest of 300,000 species samples is being assembled (for Phase 2 and 3)sanger.ac.uk sanger.ac.uk – the project will maintain a catalog of these samples (often museum vouchers). After sequencing, the raw data (reads) are stored in ENA/NCBI SRA under umbrella BioProjects for each affiliate. In summary, the “EBP resource” will be an indexed collection of genomes of all life – the ultimate reference database for comparative genomics and many other fields.
Data Accessibility: Open access. The EBP consortium has committed to rapid and open release of all genome assemblies and raw data, in line with the “Bermuda/HGP principles” for open sciencesanger.ac.uk sanger.ac.uk. As each genome is completed and validated, it is released to public repositories (GenBank/ENA). This open policy is crucial given the project’s global nature; however, EBP also respects biodiversity treaties (Nagoya Protocol) – ensuring that source countries are acknowledged and that data sharing does not violate any national laws. Practically, most of these genomes are not human or personally identifiable, so there are no privacy restrictions. For some endangered species, location data might be sensitive (to avoid poaching, etc.), but the genome sequences themselves are freely accessible. The data (particularly for widely studied organisms) can be browsed, downloaded, and used by anyone without restriction. Some affiliated projects have had short pre-publication embargos to allow the producing team to publish first, but EBP central policy encourages immediate release or release within <6 months of assembly. The consortium’s ethos is that genomic data of biodiversity is a global public good – hence the push for openness and even the partnership with UNESCO on open science principlessanger.ac.uk sanger.ac.uk. This means researchers worldwide, including those from the species’ home countries, can access and utilize the data to advance science and conservation. In summary, aside from normal data-use etiquette (cite the source, involve original collectors when appropriate), the EBP data is fully open.
Major Sequencing Centers: EBP is a network of many regional projects, each with their own sequencing hubs. Some key players: the Wellcome Sanger Institute (UK) leads the Darwin Tree of Life Project, which is sequencing all ~70,000 UK species – Sanger’s high-throughput pipelines contribute a large volume of assemblies and they’ve been pioneers in methods (e.g. using PacBio HiFi at scale)sanger.ac.uk sanger.ac.uk. In the USA, the Vertebrate Genomes Project is led by Rockefeller University and UCSC, with sequencing done at NY Genome Center and BGI – VGP has delivered hundreds of vertebrate genomes. BGI (China) is a major player; e.g. BGI is leading the 1K Plant & Animal genomes initiative and has capacity for massive sequencing (they announced sequencing of thousands of plant/insect genomes). China’s Genome Sequencing Center in Wuhan and Shenzhen (BGI) are contributing large numbers of fish, bird, and insect genomes. The DOE Joint Genome Institute (USA) focuses on fungi and plants (they have sequenced hundreds of fungal genomes for the 1000 Fungal Genomes project). Cold Spring Harbor Laboratory and partners handle some crop genomes. European facilities beyond Sanger include CNAG in Spain, MAX Planck Sequencing in Germany, and SciLifeLab Sweden, which contribute to specific clades. Oceania: The Australian Genome Initiative and NZ Genomics Ltd handle e.g. Australian endemic species. Africa: The Africa BioGenome Consortium is gearing up, with SANBI (South Africa) likely to sequence African plant/animal genomes. With 88 countries involved, many national genome centers (like Canada’s McGill Center for Bioinformatics for Arctic species, Brazil’s Butantan Institute for reptiles, etc.) are part of the mosaic. The EBP central leadership (Lewin, Teeling, et al.) coordinates but much work is decentralized. However, certain orgs stand out: Sanger Institute for overall capacity and coordination (the term “genome ark” was championed by Sanger)sanger.ac.uk sanger.ac.uk, BGI for sheer throughput, Rockefeller/NYGC for vertebrates, and UC Davis (the lab of Harris Lewin, chair of EBP) for project management. The consortium leverages these major centers for high-profile genomes (e.g. BGI did the giant panda, Sanger did the robin, etc.) while encouraging tech transfer so that more local centers can join the effort. Overall, EBP’s success hinges on a distributed network of sequencing powerhouses working in concert to map all eukaryotic life.

UK Biobank (Genomic Dataset)

Years active: 2006–present. (UK Biobank began participant recruitment in 2006; genomic data generation started later – genotyping around 2014, sequencing pilot in 2018, and full-cohort sequencing completed in 2021–2023statnews.com biobank.ctsu.ox.ac.uk.)
Scale: A prospective cohort of 500,000 UK adults, all of whom now have genome-wide genetic data. Specifically, ~488,000 participants have genome-wide SNP genotypes (Array data) and whole-exome sequences, and ~500,000 have whole-genome sequences completed or in processukbiobank.ac.uk sanger.ac.uk. In July 2022, UK Biobank released 150,000 whole-genome sequences as an initial tranchenature.com; by Mar 2023, nearly 500k WGS had been generated and made available to researchersscience.org science.org. These WGS data (~30× coverage) amount to >15 petabases of sequence. In addition, 500k participants have been genotyped on a ~800K SNP array (with imputation to ~96 million variants) and ~450k have whole-exome sequences at 50× (released 2020–21). The resource also includes extensive phenotype data: 10+ years of electronic health records, imaging data, lifestyle questionnaires, and linkage to death/cancer registries. But focusing on genomics: UKB stands as one of the largest uniformly-sequenced human cohorts in the worldukbiobank.ac.uk illumina.com. The combination of 500k genomes and rich health data is of unprecedented scale.
Tech: Genotyping: Done using a UK Biobank Axiom Array (Affymetrix) in 2014–2015, producing ~820K SNPs per person, followed by imputation (to Haplotypes from 1000 Genomes + UK10K reference) yielding ~96M variants. Whole Exome Sequencing: Completed in 2020 on ~470k samples using a bespoke exome capture (IDT xGen) and NovaSeq sequencing at Regeneron and the Broad Institute, ~50× depth, yielding ~15k–20k protein-coding variants per person (the data processed through GLnexus joint calling). Whole Genome Sequencing: The big push – sequencing of all 500k at ~30× – was performed by a partnership of the Wellcome Sanger Institute and deCODE Genetics using Illumina NovaSeq 6000 instrumentssanger.ac.uk. Sanger sequenced ~243,633 genomes and deCODE ~245k to cover the cohortsanger.ac.uk. They used PCR-free library prep and 150 bp paired-end reads, achieving high uniformity and an average of 98% genome covered at ≥15×. Data processing was centralized (using DRAGEN pipeline for alignment and variant calling of SNVs and indels, and separate calling for structural variants). The entire WGS effort was done in two phases: an initial “Vanguard” of 50k samples to pilot (2018–2019), then Main Phase of 450k (2019–2021)statnews.com statnews.com. Genomic data is stored and made available via an Amazon Web Services cloud, given the size. In summary, UK Biobank harnessed both array and NGS technologies at massive scale – requiring meticulous QC (e.g. confirming sample identity across array, exome, genome, and handling ~5% of participants who are related). They also integrated results: for example, the exome and genome data are combined to provide comprehensive variant sets (with WGS providing non-coding variants too).
Goals: The overarching goal is to create a large, population-based resource to study the genetic and environmental determinants of common diseases. Genomics specifically is aimed at enabling discovery of genetic variants (common and rare) associated with diseases and traits, by providing a well-powered sample size. By sequencing everyone, UKB aims to capture the full spectrum of human genetic variation in the UK (including rare mutations unique to families). With 500k samples, even modest effect size variants can be detected, and polygenic risk scores can be refined. Another goal is to promote open science: UKB data is shared with approved researchers worldwide, democratizing access to a large genomic/phenotypic dataset. Also, by pairing genomics with deep phenotyping (like MRI images, lifestyle, labs), the project facilitates holistic research – e.g. discovering biomarkers, identifying repurposable drugs via Mendelian randomization, etc. Notably, UKB’s genomic initiative (especially WGS) was also driven by proof-of-concept: to show that population-scale whole-genome sequencing is feasible and valuable for public health, hopefully informing future biobank projects and integration into healthcare. On a technical side, a specific goal of the WGS project was to compare its value vs exome+array – early evidence shows WGS identifies structural variants and non-coding variants that exomes miss, filling research gaps. Summed up, the goal is to leverage a huge genetically characterized cohort to improve understanding, prediction, and treatment of diseases ranging from cancer and heart disease to dementia and depression.
Impact: The availability of UK Biobank genetic data has revolutionized human genetics. Hundreds of GWAS have been performed using the UKB array/imputed data, identifying thousands of new loci for various diseases and traits (since 2015, UKB has featured in >2,000 publications). The size and rich data allowed studies of polygenic risk – for instance, showing that a polygenic score can stratify risk for heart disease nearly as strongly as high cholesterol does. The release of exome sequences led to discovery of rare coding variants associated with diseases (e.g. in genes like PCSK9 for LDL cholesterol, or MC4R for obesity, etc.)illumina.com illumina.com. The recent WGS data is enabling more – such as uncovering structural variants associated with conditions, more precise fine-mapping of GWAS signals, and characterization of somatic mutations in blood (clonal hematopoiesis). Because of its diversity of data, UKB also has impacted methodology: new statistical genetics methods (ML-based phenotype prediction, genetic correlation, etc.) have been developed and tested on UKB. Moreover, the complete genome data on half a million people yields a reference panel far richer than 1000 Genomes, improving imputation and variant frequency catalogs for European (and to some extent South Asian/African) populationsnature.com nature.com. It has also been a testbed for cloud-based research – demonstrating that large-scale data can be shared responsibly (the Research Analysis Platform launched 2021 lets researchers work with UKB data on the DNAnexus cloud). Another major impact is on collaboration between public and private sectors: the pharma industry contributed to the WES and WGS funding (a £200M consortium including e.g. Regeneron, AbbVie, and government funding)illumina.com illumina.com, in return for early access, reflecting a new model of public-private partnership in genomics. Finally, the success of UKB’s genomic efforts is influencing national healthcare – e.g. informing the design of Genomics England’s next initiatives and other countries’ biobanks. In summary, UKB’s genetic dataset has accelerated genetic discovery immensely and set a template for future population genomics resources.
Key Publications: Many foundational findings have come out of UKB. A few notable ones: the 2018 Nature paper by Bycroft et al. describing the UKB genotype data resource (and early population structure findings)en.wikipedia.org; the 2020 Nature paper by Backman et al. on 50k exome analyses; the 2021 Nature paper on 200k exome data linking rare variants to human phenotypes. In 2022, the UKB WGS consortium published in Nature the analysis of the first 150K whole genomesnature.com, illustrating the value of non-coding variation (e.g. finding a new structural variant in PPE gene associated with lung function). Another high-profile example: in 2019, a Lancet paper by Gill et al. used UKB to show how a polygenic risk score for coronary disease could identify at-risk individuals. Also, UKB data underpinned the landmark GWAS of human life span (Nature 2019) and countless GWAS meta-analyses. The UKB research community keeps an updated list of publications on the UKB website. With the full 500K genomes now available, expect a wave of new papers in 2023–2024 using the WGS data (for instance, mapping telomere length variation, detecting mosaic mutations, etc.).
Project Website: ukbiobank.ac.uk – the main site for information and data access (it has sections for researchers, participants, and the general public). The Access portal is where researchers apply for data. There’s also a new UK Biobank Research Analysis Platform (UKB-RAP) which is a cloud environment on DNAnexus for approved researchers to directly query the data without local downloadillumina.com illumina.com. Additionally, the Neale Lab UKB browser and PHESANT tool allow browsing association results. The UKB site also provides protocols, data dictionaries, and news (e.g. announcements of data releases).
Resource Manifest: The UK Biobank genomic data resource comprises: Genotype data – raw intensity data and processed SNP genotypes for ~820K markers; Imputed genotype data – dosages for ~96 million variants (using a reference panel that includes UK10K and HRC data); Whole Exome Sequence data – variant call (VCF) files per sample and aggregated MAF info, plus functional annotations; Whole Genome Sequence data – CRAM files (~100×10^6 reads each) and joint-called VCFs/BCFs of SNPs/indels, and a separate call set for structural variants. These genomic files are linked via a unique identifier to each participant’s phenotypic records. The resource also includes extensive phenotype datasets: ~2,500 phenotypic fields ranging from basic measurements to disease outcomes (with ~17 million ICD-coded events from EHRs). Researchers typically combine genotypic and phenotypic data in analysis on the platform. UKB provides various derived datasets as well – e.g. principal components for ancestry, relatedness matrices (identifying ~40,000 pairs of third-degree or closer relatives in the cohort), recommended quality flags (like samples to exclude for contamination, etc.), and curated phenotype definitions for common diseases. They also supply linkages to death registry, cancer registry, and primary care data. New data like whole-body MRI images for 100k participants are also part of the resource (though imaging genetic analyses often use the genomic data too). In essence, the manifest is enormous – but well-organized in categories (genetic data, biochemical assays, health outcomes, etc.), all indexed by participant ID. The UKB continuously updates the resource (e.g. new hospitalization data or new genetic releases), and users are notified via periodic showcases and emails.
Data Accessibility: Controlled but broad access. UK Biobank operates on a managed-access model: bona fide researchers worldwide can apply to use the data for health-related research in the public interest. Once an application is approved, researchers pay a modest fee and sign a data use agreement, then can download datasets or use the cloud platform. The data itself is de-identified but potentially re-identifiable, hence not public open-access. However, the scale of access is huge – over 30,000 researchers from 100+ countries have accessed UKB data to date. Summary statistics from analyses (e.g. GWAS results) can be made public by researchers with no issues (indeed many GWAS consortia have released summary stats based on UKB). UKB’s protocol prohibits trying to re-identify participants or contacting them. Importantly, in 2022 UKB launched the Research Analysis Platform (a cloud gateway) to ease access to the full 500K WGS data, since downloading such a volume is impracticalillumina.com illumina.com. This platform allows users to work with WGS and other data through Jupyter notebooks, etc., without exposing the raw data externally. In terms of openness: while not open to the general public, the resource is about as openly available to scientists as possible, with a streamlined application process. This model has been praised for balancing privacy with scientific utility. Additionally, UKB has an open return-of-results policy: researchers are expected to return derived data (like GWAS results or new variables) back to UKB for integration, which further enriches the resource for others. Overall, UKB data access is extremely broad (virtually any qualified researcher can get it), making it a cornerstone of open science in human geneticsgenomicsengland.co.uk.
Major Sequencing Centers: The massive whole-genome sequencing effort was led by two institutions: the Wellcome Sanger Institute (UK) and deCODE Genetics (Iceland), as part of a consortium funded by UK government, Wellcome, and industrysanger.ac.uk. Sanger and deCODE essentially split the 500k samples ~50/50 and sequenced them in parallel, sharing protocols and QC to ensure consistencysanger.ac.uk. Sanger brought its experience in population-scale genomics (having done the 1000 Genomes and UK10K projects), and deCODE (a subsidiary of Amgen) brought high-throughput clinical sequencing capacity. The exome sequencing was performed by the Regeneron Genetics Center (USA) in cooperation with the Broad Institute – Regeneron sequenced ~50K exomes initially (for a pilot published 2020), then scaled up to all 450K by 2021 at their facilities in New York, with Broad handling part of the pipeline. Genotyping was done earlier by the Affymetrix (ThermoFisher) Services Laboratory for array processing, with support from the Biobank’s own coordinators at Oxford. Data storage and imputation involved the Wellcome Centre for Human Genetics (Oxford) and University of Michigan (for providing the imputation server and HRC reference). In summary:

Genotyping: Affymetrix UK and analysis by Oxford.
Exome sequencing: Regeneron Genetics Center (with Broad Institute).
Genome sequencing: Wellcome Sanger Institute & deCODE Genetics.
These were the major players. Additionally, the project was overseen and coordinated by the UK Biobank’s own team (based in Stockport and Oxford), and their funding partners included the UK Research and Innovation (Medical Research Council) and Wellcome Trust – which contributed a large grant, and a consortium of 13 pharmaceutical companies for the exomes (e.g. AbbVie, AstraZeneca, Merck, etc.). The engagement of Sanger and deCODE for WGS was significant: it combined a top academic genome center with a top industry genome center, maximizing both throughput and accuracy. The results have been exemplary – Sanger and deCODE delivered on time and with high quality (the final 500K WGS release was announced in Mar 2023). Thus, UKB’s genomic powerhouse consisted of these world-class centers working in synergy and setting new records for sequencing throughput in human samplessanger.ac.uk.