When the human genetic code was first unveiled two decades ago, the Government-backed international Human Genome Project and its rival, privately funded Celera Genomics, joined with President Clinton and Prime Minister Blair to announce ‘the book of life’ at the White House.
Though significant – the news made headlines around the world – those first draft human genomes released in June 2000 were incomplete.
They consisted of only one set of chromosomes (our cells contain 23 pairs), that is, around three billion or so ‘letters’ of DNA code. They were also composites of several people’s DNA.
One British Nobel prize-winner at the heart of the public effort remarked, ‘We were just a bunch of phoneys.’
The first complete genetic makeup of one person emerged when Craig Venter – who led the Celera effort – unveiled all six billion letters of his own genome on September 4, 2007.
Soon after followed the genetic code of James Watson, who championed the public genome effort, and was co-discoverer of the structure of DNA in 1953. An iconic model of the ‘double helix’ can be seen in the Science Museum.
But we still did not have the full picture of one person’s entire genetic code.
Those first historic draft genomes missed as much as 15 per cent of the human genetic code (closer to 8 per cent in subsequent human DNA sequences) because of the way that older generation sequencing – DNA reading – technologies worked.
The sequencing machines of the day – which can be seen in our collection – were unable to read each of the 46 packages – chromosomes – of DNA within a human cell in one go.
Instead, they broke up the chromosomes and used the sequencers to read millions of small fragments of many copies of the human genome in parallel, each consisting of up to 300 ‘letters’ of code, and then used computers to look for overlaps to work out the original genetic sequence.
As they were unable to sequence every DNA molecule from one end to the other, they could not deal with repetitive stretches of DNA code – even with a supercomputer it is not possible to reassemble them with confidence.
These repetitive, enigmatic, hard-to-sequence regions lie in the centromeres, which are the pinched parts of each chromosome that play a key role in cell division, and lurk within the short arms of chromosomes where the centromere is off-centre.
They can consist of regions of 200 ‘letters’ of genetic code, repeated again and again over stretches of millions of letters.
This genetic terra incognita can be explored today because it is now possible to read DNA stretches that are considerably longer – 10,000 to 100,000 letters – by using ‘third generation’ sequencing technologies developed by Pacific Biosciences (PacBio) in America and Oxford Nanopore Technologies (Nanopore) in the UK.
These larger DNA pieces are much easier to reassemble with a computer – even if full of repeats – because they are more likely to contain sequences that overlap.
A few days ago, a consortium of around 30 institutions published a preprint (so it has not yet been peer reviewed) that describes how long-read technologies have been used to complete a human genome, releasing over 200 million letters of our genetic code that have never been seen before.
The team is the Telomere-to-Telomere (T2T) Consortium, where telomeres are at the end of chromosomes – think of them like the plastic coating (aglet) on the ends of shoelaces – so telomere to telomere means end to end.
The team sequenced DNA from an unusual kind of cell, taken from hydatidiform mole, a growth in a woman’s uterus caused when sperm fertilized an egg that lacked its own DNA. These cells contain a duplicate of the father’s chromosomes, rather than the usual 46 chromosomes from both parents, making it much easier to reassemble the pieces.
Karen Miga at the University of California, Santa Cruz, is part of the consortium and has already successfully used this method to tackle one chromosome, the X chromosome.
‘The genome has this really interesting landscape of what we call tandem repeats. They are almost like their own little kingdom which we have not been able to study until now.’
The end-to-end sequence has already revealed 115 new genes that code for proteins, the building blocks of cells, which brings the total to around 20,000. In addition, it probably contains many other ‘genic’ regions that play a role in gene regulation and other functions.
Miga is also part of another group, the Human Pangenome Reference Consortium, which aims over the next three years to sequence more than 300 genomes from people from around the world to see how these repeating regions change from person to person, revealing whether any of the newly-sequenced regions are associated with human diseases.
However, it is a long way from the human genetic code – genotype – to phenotype, which is how we look, function and behave.
Since the White House announcement, she said, ‘I have been blown away by the recent genetic discoveries and progress with precision medicine, but it takes a lot of time – years and years – for these exciting results to formally reach the public.’
However, in some patients, a survey of their genetic sequence has not revealed anything wrong at all and the problem ‘could have been hiding in these regions,’ says Miga. With the complete human genome, ‘a lot of rare diseases and new discoveries will start to emerge.’
The potential of long-read sequencing has been underlined by a team under Kari Stefansson of deCODE genetics in Iceland which has published a peer reviewed study in Nature Genetics of 3,622 Icelanders with long read sequencing using Nanopore technology, revealing differences in human genetic codes that did not feature in traditional ‘short-read’ sequencing.
These variants account for as much as 40% of the variation in the human genome, so they will provide a boost for efforts to understand disease.
In a conversation with his colleague Bjarni V. Halldorsson, Stefanson’s head of sequence analysis, Halldorsson explained that there are parts of the genome where ‘traditional short read sequencing does not reach so well, places that are deleted between individuals and places that are repetitive. These regions we can now access.’
These so called ‘structural variants’ also have a much larger chance of affecting the phenotype, he said, that is the appearance, biology, health and behaviour of a person. ‘They play a truly causative role in disease.’
There is also the possibility now of spotting single letter ‘spelling mistakes’ in highly repetitive stretches of DNA in the human genetic code, along with changes to DNA that affects whether a gene is used or not.
They found a few thousand structural variants that were linked with DNA, including a deletion in a gene called PCSK9, which lowered LDL cholesterol levels, and a repetitive motif in a gene called ACAN, which correlated with the height of a person, so the more repeats, the taller an individual.
However, some parts are so repetitive that they are beyond even this technology. Halldorsson added: ‘We will never know the full human variation until we study even larger sets (of DNA)’.