Two decades after the Human Genome Project produced a draft sequence, an international research team, including University of California San Diego computer scientists, has published the first complete genome. The work was done by the Telomere to Telomere (T2T) consortium, and six papers describing the project will be published April 1 in a special edition of Science.
The team from UC San Diego’s Department of Computer Science and Engineering contributed to two of the papers: Complete genomic and epigenetic maps of human centromeres and The complete sequence of a human genome. The second pulls together the many strands of research that went into completing the project.
“This is a major milestone,” said Pavel Pevzner, Ronald R. Taylor Distinguished Professor of Computer Science at UC San Diego. “Around 8% of the human genome had gone unsequenced for decades. By filling these gaps, we gain a better understanding of human biology and can now identify formerly hidden genetic anomalies that may lead to disease.”
The T2T researchers found the 8% gap contains repetitive DNA, numerous genes and almost as much genetic information as an entire chromosome. Most of the newly sequenced DNA was near chromosomal telomeres (long chromosome caps) and centromeres (dense middle sections).
The now-complete genome sequence illuminates more than two million additional variants, providing new information on 622 medically relevant genes. This complete sequence will also boost our understanding of chromosomes, opening new lines of research into how they segregate and divide.
“Generating a truly complete human genome sequence represents an incredible scientific achievement, providing the first comprehensive view of our DNA blueprint,” said Eric Green, M.D., Ph.D., director of the National Human Genome Research Institute, which helped fund the project. “This foundational information will strengthen the many ongoing efforts to understand all the functional nuances of the human genome, which in turn will empower genetic studies of human disease.”
Centromeres are found in the middle of chromosomes, where the various arms meet, and help pull them apart during cell division. These structures make up around 6.2% of the entire human genome and have been virtually impossible to assemble because they contain so many repeating DNA sequences.
“Centromeres are where cell division is initiated, so it’s important to know what’s going on there,” said Andrey Bzikadze, a computer science Ph.D. student at UC San Diego in Pevzner’s lab and co-author on both papers. “But these repeating sequences are incredibly difficult to assemble because there’s almost no variation. It’s like doing a jigsaw puzzle that only shows blue sky.”
In addition to determining the centromere sequences, the UC San Diego team also helped validate the entire genome assembly.
“Our tool was the only one that was able to look into the most complicated regions of the genome, including the centromeres,” said Bzikadze. “We were really dedicated to finding the problems in these assemblies in difficult, complex regions.”
While the majority of genome sequences are currently conducted on short read instruments, which chop DNA into small snippets (around 300 nucleotides) before reassembling it, this technology proved incapable of filling the gaps in the human genome. However, in the past few years, long read machines (greater than 10,000 nucleotides) have shown tremendously improved accuracy, making this complete human genome sequence possible.
“Long reads were essential to accomplish this,” said Bzikadze. “Without the long reads, this would not have been possible.”