Genetic Sequencing: The possibility of Gattaca
“For what It’s worth, I’m here to tell you that it IS possible”. – Vincent in Gattaca
Gattaca certainly had a vision of things to come. The film also came at a prime time, during the "great high-tech boom of the late 1990’s1. Ethan Hawke as an ‘In-Valid’ hit the big screens in 1997… approximately 45 years after the discovery of the structure of DNA by Watson and Crick, and 7 years after the formal inception of the Human Genome Project (HGP), an undertaking by the DOE (U.S. Department of Energy) and the NIH (National Institutes of Health) to “obtain the 3-billion-base-pair map of the human genome”2.
The full objectives of the project included mapping and sequencing the human genome and the genomes of model organisms, data collection and distribution, ethical, legal, and social considerations, research training, and technology development and transfer2. One question we might ask: Why did it take nearly 40 years after Watson and Crick’s breakthrough to reach a point where sequencing the human genome was a realistic goal, and when the public (including Hollywood) started expressing concerns over what would come of being able to decode a person’s most personal and unique identity? It might help to know the major advancements that made such a project possible: (1) Developments in the field of molecular genetics, and (2) Development of information theory and computer algorithms capable of making sense of a vast array of sequenced DNA fragments. (The genome is sequenced in parts that are then reassembled… sequencing entire, intact chromosome from start to end is problematic and produces far too many errors.) According to The Human Genome Project: A Player’s Perspective1: “The confluence of genetics and computer science must rank as one of the great coincidences in the history of science and technology…. Humans discovered that biological information is digital – a mechanism of information storage and processing that evolved within cells over billions of years – and, quite independently, invented new technological means of strong, processing, and transmitting information based on digital codes.”
How the Genome Was Sequenced Animation
Which brings up a side note concerning the rather primitive looking computers used by the employees at Gattaca. In an era of engineering the human genome to the extent as depicted in Gattaca, the film might have been better served by a screen-set incorporating more advanced computers and electronic displays, perhaps more like those displayed in the movie Avatar. However, perhaps such technologies would have caused the film a more unrealistic air to viewers in the late 1990s, when genetic discrimination, not robots-come-to life (despite the Terminator movies) was at the forefront of the science-savvy public’s concerns.
If the advent and goals of the Human Genome Project are not enough in line with a Gattaca-esque future, the early 1990’s saw rise in concerns over genetic discrimination, defined as “discrimination against otherwise-healthy individuals on the basis of a genotypic variation.”3 The fears of genetic discrimination lead to the federal Health Insurance Portability and Accountability Act of 1996, which offered “important new protection for people who want to undergo genetic testing but fear discrimination by health insurers if their test results indicate an increased risk for developing a serious disease.”3 The Act prevented health insurance plans from applying the preexisting-condition rule to genetic information “unless a person had been diagnosed with the illness predicted by genetic testing”3. Fears such as these underlie the dark future of ‘in-Valids’ as portrayed in Gattaca, where so-called Genoism, although illegal, is widespread and promotes discrimination in health insurance coverage, job placement, social status, and almost all other aspects of an individual’s life, based upon genetic qualifications.
So far, one point for Gattaca (initiatives to sequence the human genome), one point against the as-portrayed-on-screen not so distant future (preemptive legislation against genetic discrimination).
While the government sector was pursuing the human genome in the form of the Human Genome Project, a race in the private sector was also afoot. J. Craig Venter and supporters started their own initiative in the form of the company Celera Genomics (founded 1998)4 to “sequence a large portion of the human genome in 3 years for $300 million”, i.e. in a much shorter time frame and at a tenth of the cost of the HGP1. The company would utilize their own sequencing technologies as well as data and resources rendered publicly available by the HGP. Venter’s initiative employed the technique of whole-genome shotgun sequencing (a technique in fact pioneered by Celera). Shotgun sequencing was originally criticized on the basis that the decoding of overlapping ‘random’ fragments, although it had worked for simpler organisms like bacteria, would be inadequate in piecing together a more complex genome such as that of a human. However, this “shotgun” approach “has subsequently become a standard method for sequencing complex organisms that is now broadly accepted and routinely used by many of the same scientists who originally scorned the approach.”4
Shotgun sequencing made the job of decoding the human genome much faster than previous methods (such as the “clone-by-clone” method employed in the public domain sequencing project HGP1) that required mapping of the genome prior to sequencing, i.e. determining the order of larger DNA fragments with respect to their origin in a chromosome prior to analyzing the base (nucleotide) sequence of that DNA fragment. Shotgun sequencing uses the power of computer algorithms to reassemble many overlapping, smaller DNA fragments cut from several copies of a single genome and decoded prior to reassembly, circumventing preliminary (and time-consuming, yet less expensive than complete sequencing1) physical mapping of chromosomal fragments. On the opposite side, traditional clone-by-clone sequencing, although slower, has been claimed by proponents to be a more accurate method than shotgun whole genome sequencing, for which there exists the risk of misplacement for reads of short fragments1, and reassembly of fragments requires advanced computing technologies5. A primer on both methods of sequencing can be found here.
Human Genome Project – Approaches to Sequencing
“And the winner is…”
In February of 2001, literally days apart, publications appeared in the prestigious journals of Science (Venter’s group) and Nature (HGP) announcing initial drafts of the human genome. The cover of the Science issue, in referring to the feat: “A scientific milestone of enormous proportions, the sequencing of the human genome will impact all of us in diverse ways-from our views of ourselves as human beings to new paradigms in medicine.”6 The next ten years after the initial sequencing have been a whirlwind of discovery and surfacing of new challenges. Although a monumental accomplishment8, the initial draft as published in 2001 only covered ~90% of the euchromatic genome (loosely packed regions of DNA under active transcription), and contained many gaps (around 250,000) and base sequence errors (Lander). Not until 2004 was a draft human genome sequence published that covered 99.7% of the euchromatic genome (at greater than 99.999% accuracy). This more complete draft was synthesized via the clone-by-clone approach by the HGP.
What’s Next? A real-life GATTACA?
A recent issue of Nature, The Future is Bright, focused on the last teen years of human genomics. In predicting science innovations to come in 2011, an article in the issue quotes the drop of human genome sequencing to $1000 a pop, as well as breakthroughs in understanding mechanistic links between DNA regions on the genome and various medical conditions, as revealed by ongoing GWAS, genome-wide association studies7. Sequencing machines have also come a long way since the original days of the Human Genome Project. Older machines used electrophoretic separation of DNA fragments, with detection based on fluorescent maker nucleotide terminators or ‘caps’ on the ends of the fragments . In this fashion, they could produce around 115,000 base-pairs every 24-hours (~102 kbp per instrument run). Current sequencing machines employ ‘massively parallel’ technologies (many strands of DNA sequenced at once) and can spit out 5 million bases (i.e. a whole genome) per week (~1014 kbp per instrument run)8. Massively parallel sequencers “work in a nucleotide-by-nucleotide fashion, rather than by discrete separation and detection of already produced Sanger sequencing products on a capillary (electrophoresis) instrument”8. This means much faster sequencing, potentially days for an entire genome, but also demands a greater resource in information technology to quickly gather and store the vast amounts of data required to assemble a whole genome sequence. Which brings up an interesting point… in Gattaca, a genome sequence report is portrayed as taking up a length of paper about the height of a human being. However, a true full sequence, printed out on standard printer paper in a font similar to that in this article, would consume a stack of paper about the length of a football field (or 3x the height of a 10meter diving platform). If the film’s intention was that Vincent’s entire genome be displayed in that report, they needed a LOT more paper.
“Of course, a car with only an engine is unworkable; as such, DNA sequencing technology provides an integral part of a larger system, one with multiple components… We need the raw materials, such as fuel (DNA), sparks to ignite the fuel (reagents), mechanical parts to translate fuel and ignition into motion (robotics), and direction (bioinformatics), all working in a carefully engineered balance, and a driver (genome center) to steer the automobile quickly and efficiently to the desired destination (biological understanding).”8
Here are just a few of the discoveries and studies made possible by our knowledge of the human genome7:
(1) maps of evolutionary conservation and selective sweeps during human history
(2) gene transcription
(3) chromatin structure
(4) methylation (epigenetic) patterns
(5) genetic variation
(6) recombinational distance and linkage disequilibrium
(7) association to inhered diseases
(9) genetic alterations in cancer (we can compare an individual’s normal genome to that sequenced from tumor tissue)
(10) Synthetic biology: “Only when we can write regulatory elements de novo will we truly understand how they work.”
Direct-to-consumer genome sequencing, in the form of ‘spit-kits’ as marketed by companies such as 23andMe and Navigenics, will provide reports (for a ‘small fee’) that compare your genome and selected genetic traits to an online ‘standard’ genome database. This kind of consumer testing has raised concerns over the accuracy of the analyses, and whether they are scientifically ‘robust’ enough to allow useful health interventions. One concern is the diagnosis or risk prediction of disease based on standard genome databases. These databases are hardly complete, and efforts are currently ongoing to “improve the overall completeness and correctness of the human reference genome.”8 Unaccounted for genetic diversity across populations may skew predictions of human disease, which is why projects such as the 1,000 Genomes (http://www.1000genomes.org/about) are so important. Dr. Mardis also raises the issue of accurate interpretation of direct-to-consumer testing: “The results will require interpretation by a physician, which raises a separate but equally important issue: the significant need to develop and implement training programs in genomics for medical professionals.” Which raises yet another issue. What does the non-scientist consumer do with the knowledge of their genome sequence and reported genetic traits? Without a team of experts, which according to Mardis is likely to include biologists, geneticists, pathologists, physicians, research nurses, genetic counselors, and even IT and system support specialists, the average consumer may be left quite helpless, even overwhelmed, facing a vastly complex report of their genome sequence and various genetic dispositions to disease. The goal of genome reporting is to provide patients with personalized disease risk prediction, and to study those diseases found with respect to their genetic origins. “Partial prediction will be feasible… (but) fundamental limits arise (in such predictions) due to complex architecture of common traits… and many non-genetic factors.”
According to “A Player’s Perspective”, “The Human Genome Project will emerge as a natural step in the scientific quest to understand one of nature’s deepest mysteries: How can a fertilized egg cell, an object too small to see with the unaided eye, contain all the information required to guide the development of a unique human being?”
I asked Elaine R. Mardis, Co-Director of the Genome Institute at Washington University in St. Louis, a few questions about the science behind Gattaca and the newest technologies that help us to explore the human genome:
Gattaca portrays genome sequencing from raw blood and tissue samples with readout/identification of an individual within seconds. What sequencing technologies today do you believe are closest to this type of capability?
We haven’t really anything at present that is anywhere near that quick at returning whole genome sequencing data. As you might imagine, it’s more than just producing the sequence, it’s also about aligning sequences to the human reference genome for “interpretation”, which means identifying mutations, etc.
Do you think we could ever really get there?
Sure we can.
What are some of the main challenges in getting there?
The main challenges are as follows (not exhaustive):
1. Getting the DNA out of the nucleus of the cells to be studied. This is especially true in a raw blood sample as tested from individuals in Gattaca using the finger-prick test.
2. Sequencing the DNA directly without an initial step to amplify or label it (since human chromosomes are often 100s of megabases in length, the technology would need to be able to handle such lengths or a fragmentation step would be required.
3. Rapidly generating 3 billion bases of the genome data in a short timeframe and…
4. Just as rapidly analyzing the resulting sequence, including a comparison to the reference and its annotation OR assembling the data into distinct chromosomes for subsequent comparison to the reference.
The film portrays determination of risk factors based on genetic factors within seconds of birth, using the heel-prick test on the newborn, making predictions possible such as “Risk of heart disease… 99%.” What do you think are the main challenges for making such predictions, based upon genetic factors, a reality?
This is, of course, already a reality. Babies are tested every day from heel stick blood for diseases such as phenylketonuria. That said, there are many diseases for which we still haven’t sorted out the key genetic contributors, those will largely be sorted out in the next 8-10 years at the current rate of discovery in sequencing cases and controls for each major disease.
Finally, we hear much in today’s society about the fears of personal genome sequencing based on privacy concerns. What are your thoughts on the ethical issues and privacy concerns in an age of personal genomics?
I have a lot of thoughts, but basically I think that firstly there is good legislation already in place (GINA) to protect people from discrimination based on genetic susceptibility. Second, I think people often don’t appreciate the fact that, having your genome sequenced and held anonymously in a database still requires a second testing with your genetic identity (must be known) to make the match. The likelihood of that is pretty remote. So, in general we are very careful with people’s data by depositing only in restricted access databases, and allowing the government to vet those other scientists interested in downloading the information. The absolute reality is that most patients readily volunteer to participate in genomic studies because they understand their genome will contribute to disease treatments and cures. It’s altruism that saves the day, and fear of privacy breach is minimal in our experience.
1. Olson, The Human Genome Project: A Player’s Perspective, J. Mol. Biol. (2002) 319: 931-942
2. Human Genome News, May 1990; 2(1)
3. Human Genome News, January-June 1997; 8:(3-4)
4. Celera Genomics Website
5. Human Genome Project Approaches
6. Cover, Science (16 February 2001); 291(5507)
7. Lander, Initial impact of the sequencing of the human genome, Nature (Feb 2011) 470: 187-197
8. “A decade’s perspective on DNA sequencing technology” Mardis, Nature (Feb 2011) 470
Lander ES (2011). Initial impact of the sequencing of the human genome. Nature, 470 (7333), 187-97 PMID: 21307931