DNA and the Code - The Biological Self

The Question That Reorganised Biology

For most of the history of biology, the deepest question was not how living things worked. It was how living things stayed the same. Every oak tree produces acorns that become oak trees. Every cat produces kittens that become cats. Offspring resemble parents across generations with a fidelity that no purely chemical process seemed able to explain. Something was being transmitted. Something was being copied. But what?

The answer arrived with a violence that reorganised the entire science. On 25 April 1953, Francis Crick and James Watson published a 900-word paper in Nature proposing a structure for deoxyribonucleic acid. The paper contained a sentence that is one of the most understated claims in the history of science: that the specific pairing they had postulated immediately suggested a possible copying mechanism for the genetic material. The possible copying mechanism they were referring to was, in fact, how every living thing on Earth reproduces the information in its genome, and has been since the beginning of life.

The structure they proposed, the double helix, was not arrived at in isolation. Rosalind Franklin at King's College London had produced X-ray diffraction images of DNA of extraordinary quality, most significantly an image known as Photo 51, taken in May 1952, that showed with unusual clarity the helical structure of DNA and provided critical parameters for building any accurate model. Watson saw this image, shown to him without Franklin's knowledge by her colleague Maurice Wilkins, and later acknowledged that seeing it was decisive. Franklin died in 1958, four years before Watson, Crick, and Wilkins received the Nobel Prize in Physiology or Medicine in 1962. The Nobel Prize is not awarded posthumously.

The double helix is not merely a beautiful structure. It is a solution to what turns out to be the central problem of biological inheritance: how to store information reliably, copy it accurately, and transmit it between generations. DNA solved this problem with an elegance that no human engineer has yet improved upon.

The Structure of the Molecule

Deoxyribonucleic acid is a polymer: a long chain molecule built from repeating units. The repeating units are called nucleotides, and each nucleotide consists of three components: a sugar molecule called deoxyribose, a phosphate group, and one of four nitrogen-containing bases. The four bases are adenine (A), thymine (T), guanine (G), and cytosine (C). The sugar and phosphate components are identical in every nucleotide. The base is the variable part, the letter of the biological alphabet, and it is the sequence of bases along the DNA chain that carries the information.

The double helix consists of two such chains, wound around each other in a right-handed helix. The two chains run antiparallel: one runs in the direction designated 5' to 3' and the other runs 3' to 5', the designations referring to the carbon atoms of the deoxyribose sugar at each end of the chain. The two sugar-phosphate backbones face outward, toward the watery environment of the cell. The bases face inward, stacked on top of each other and paired across the centre of the helix.

The pairing is not random. Chargaff's rules, established empirically by Erwin Chargaff in 1950, held that in any DNA sample, the amount of adenine always equals the amount of thymine, and the amount of guanine always equals the amount of cytosine. The structural explanation became clear with the double helix: A always pairs with T, connected by two hydrogen bonds, and G always pairs with C, connected by three hydrogen bonds. This complementary base pairing is the structural foundation of every biological function DNA performs.

The helix makes one complete turn every 10.5 base pairs, with a rise of 3.4 angstroms per base pair. The diameter of the helix is approximately 2 nanometres. The helix has a major groove and a minor groove, produced by the geometry of the base pairs relative to the backbone. These grooves are not mere structural features. They are the surfaces along which the proteins that read and regulate DNA make their contacts with the molecule. The shape of the double helix is not just a way of packaging the bases. It is a functional surface for molecular interaction.

The chemical stability of the double helix derives from two sources: the hydrogen bonds between complementary base pairs, which hold the two strands together, and the base stacking interactions between adjacent bases on the same strand, which are hydrophobic in nature and contribute substantially to the stability of the helix in aqueous solution. Neither source of stability is particularly strong individually. Together they produce a molecule that is stable enough to preserve information over the lifetime of a cell, flexible enough to be opened and read by molecular machines, and recoverable enough to be repaired when damaged.

The double helix: two antiparallel sugar-phosphate backbones wound around each other, with complementary base pairs stacked in the interior. The major groove (wider) and minor groove (narrower) are the surfaces along which regulatory proteins read and interpret the sequence. The diameter is approximately 2 nanometres. Every chromosome in every eukaryotic cell is composed of this structure, extended to billions of base pairs in length.

The Language of the Genome

Information in DNA is stored as a linear sequence of the four bases: A, T, G, and C. The sequence is read in triplets called codons, each codon corresponding to a specific amino acid or to a stop signal that terminates protein synthesis. There are 64 possible codons and 20 amino acids, so the genetic code is degenerate: most amino acids are encoded by more than one codon. This redundancy is not a flaw. It provides a buffer against certain types of mutation, a property that turns out to be important.

The path from DNA sequence to protein is indirect. DNA in the nucleus is first transcribed into a related molecule called messenger RNA (mRNA), which carries the sequence information out of the nucleus to the ribosomes in the cytoplasm. At the ribosome, the mRNA sequence is translated into a chain of amino acids, one amino acid added per codon, until a stop codon terminates the chain. The amino acid chain then folds into a specific three-dimensional structure determined by its sequence, and that structure is the protein. This sequence of events, DNA to RNA to protein, is what Francis Crick called the central dogma of molecular biology in 1958. It describes the direction of information flow in biological systems.

The genetic code was cracked between 1961 and 1966 by a series of biochemical experiments, most significantly by Marshall Nirenberg and Heinrich Matthaei, who in 1961 used synthetic mRNA of known sequence to identify which codons encoded which amino acids. Har Gobind Khorana developed methods for synthesising nucleotides in defined sequences that enabled the decipherment of the remaining ambiguous codons. Nirenberg and Khorana shared the Nobel Prize in Physiology or Medicine in 1968. The genetic code is essentially universal across all life on Earth, the same triplet assignments for the same amino acids appearing in bacteria, plants, animals, and fungi, with only minor variations in a small number of organisms. This universality is one of the most powerful pieces of evidence for the common ancestry of all living things.

Not all of the DNA sequence encodes proteins. In the human genome, only approximately 1.5 percent of the total sequence codes for protein. The rest, the remaining 98.5 percent of the 3 billion base pairs in the haploid human genome, was for decades described as junk DNA, a term reflecting the assumption that non-coding sequence had no function. This assumption has been progressively dismantled. A large fraction of the non-coding genome encodes regulatory sequences that control when, where, and how much each gene is expressed. Some encodes functional RNA molecules that are not translated into protein but perform critical regulatory and structural roles. Some represents the evolutionary remnants of ancient viral insertions, transposons, and repeated sequences that have accumulated over billions of years of evolutionary history. The non-coding genome is not junk. It is a regulatory landscape of extraordinary complexity that is still being mapped.

Chart 01

Composition of the Human Genome

Replication: The Copy Machine of Life

Every time a cell divides, it must first copy its entire genome: all 3 billion base pairs of human DNA, every chromosome, in full. The copy must be accurate enough that the daughter cells function properly. In practice, this means an error rate of less than 1 mistake per billion base pairs copied. This is an accuracy greater than any comparable process in human technology.

The mechanism of DNA replication is directly suggested by the structure of the double helix, as Watson and Crick noted in their 1953 paper. Because A always pairs with T and G always pairs with C, each strand of the helix contains all the information needed to reconstruct the other. Separate the two strands and use each as a template for building a new complementary strand, and the result is two double helices, each containing one original strand and one newly synthesised strand. This is semiconservative replication, a prediction confirmed experimentally by Matthew Meselson and Franklin Stahl in 1958 in an experiment widely regarded as one of the most elegant in the history of molecular biology. They used nitrogen isotopes to label parental DNA and tracked the fate of the label through successive rounds of replication. The pattern was exactly what semiconservative replication predicted.

The molecular machinery of DNA replication involves a large ensemble of proteins working in coordinated sequence. Helicase unwinds the double helix, separating the two strands at a structure called the replication fork. DNA polymerase synthesises the new strand, reading the template in the 3' to 5' direction and adding new nucleotides to the growing 5' to 3' chain. Because DNA polymerase can only synthesise in one direction, only one new strand, the leading strand, can be synthesised continuously. The other, the lagging strand, is synthesised discontinuously in short fragments called Okazaki fragments, named for Reiji Okazaki who identified them in 1968. The fragments are subsequently joined by an enzyme called DNA ligase.

The accuracy of replication derives from multiple mechanisms. DNA polymerase makes an error approximately once every 100,000 nucleotides added. A proofreading function built into DNA polymerase detects mismatched bases as they are incorporated and removes them, reducing the error rate to approximately 1 in 10 million. A subsequent mismatch repair system scans the newly synthesised DNA, identifies remaining mismatches, and corrects them, reducing the final error rate to below 1 in a billion. These successive layers of error correction are not redundant. They are each essential. When the mismatch repair system is inactivated, as it is in certain inherited cancer syndromes, the mutation rate increases dramatically and cancer risk rises correspondingly.

In human cells, DNA replication begins simultaneously at approximately 30,000 points across the genome, called origins of replication. Each origin fires once per cell cycle, and replication proceeds in both directions until it meets the forks coming from adjacent origins. At a typical replication speed of approximately 50 base pairs per second per polymerase, the full human genome would take approximately 100 hours to copy from a single origin. By using 30,000 origins simultaneously, the cell completes the full copy in roughly 8 hours.

The Human Genome Project and What It Found

In 1990, an international consortium of research institutions began the Human Genome Project: a coordinated effort to determine the complete sequence of the 3 billion base pairs of the human genome. It was the largest biological research undertaking in history. The initial estimate was 15 years and 3 billion dollars, roughly one dollar per base pair. In 2000, a draft sequence was announced jointly by the public consortium and by Craig Venter's private company Celera Genomics, which had used a faster computational approach. The complete reference sequence was published in April 2003, 50 years almost to the day after the Nature paper describing the double helix.

What the genome project found upended a series of confident assumptions about the relationship between genome size, gene number, and biological complexity. The human genome was expected to contain approximately 100,000 protein-coding genes. The actual number is approximately 19,000 to 20,000, fewer than a zebrafish, and only roughly 5,000 more than the small roundworm C. elegans, which has 959 cells in its entire body. The number of genes does not track biological complexity in any straightforward way.

Approximately 45 percent of the human genome consists of transposable elements: sequences that have, over evolutionary time, copied and inserted themselves throughout the genome. The most abundant class in humans is the LINE-1 element, approximately 6 kilobases long, with approximately 500,000 copies scattered across the genome. Most of these are defective, evolutionary remnants of ancient mobile elements that have lost the ability to transpose. A small fraction remain active and continue to move.

Approximately 8 percent of the human genome consists of sequences derived from endogenous retroviruses: the genomic remnants of ancient viral infections of the germ line that have been incorporated and transmitted across subsequent generations. Some of these sequences have been co-opted for biological functions. The protein syncytin, essential for forming the placenta in primates, is derived from an ancient retroviral envelope gene. The genome is not a pristine collection of carefully selected sequences. It is a palimpsest: the accumulated residue of billions of years of evolutionary history, containing the molecular traces of every major event in the lineage.

The genome project also revealed the extraordinary degree of genetic similarity between humans and other organisms. Humans share approximately 98.7 percent of their protein-coding genome with chimpanzees, approximately 85 percent with mice, approximately 60 percent with fruit flies, and approximately 31 percent with brewer's yeast. These are not approximate figures. They are the direct result of sequencing and comparison, and they encode the shared evolutionary history of all eukaryotic life.

Chart 02

Gene Count Across Species: Complexity Does Not Scale with Genes

Mutation, Error, and the Raw Material of Evolution

The fidelity of DNA replication is remarkable but not perfect. Errors escape the proofreading and mismatch repair systems. DNA is also damaged by radiation, reactive oxygen species produced by metabolism, and a range of chemical agents in the environment. Cells have extensive DNA repair systems to correct this damage, but repair is also imperfect, and some changes are permanently incorporated into the genome. These permanent changes in DNA sequence are mutations.

The simplest mutations are point mutations: changes to a single base pair. A substitution replaces one base with another. If the substitution occurs in the protein-coding sequence and changes the codon to one that specifies a different amino acid, the result is a missense mutation. If the substitution changes a codon to a stop codon, prematurely terminating the protein, it is a nonsense mutation. If the substitution changes the codon but the new codon specifies the same amino acid as the original, the result is a synonymous or silent mutation, which has no effect on the protein sequence.

Sickle cell anaemia results from a single nucleotide substitution in the gene encoding the beta chain of haemoglobin: an A replaced by a T at position 17 of the coding sequence. This changes one codon from specifying glutamic acid to specifying valine. The resulting haemoglobin molecule, under low-oxygen conditions, polymerises into rigid fibres that distort the red blood cell into the characteristic sickle shape, causing the cell to rupture and block small blood vessels. One letter change in 3 billion. The downstream consequences span a lifetime.

Insertions and deletions, known collectively as indels, add or remove one or more base pairs from the sequence. If the number of bases inserted or deleted is not a multiple of three, the result is a frameshift mutation, which shifts the reading frame of all downstream codons and typically produces a completely non-functional or truncated protein. The most common form of cystic fibrosis results from a deletion of precisely 3 base pairs in the gene encoding the protein CFTR, removing a single phenylalanine from position 508 of the protein. One deleted codon. The loss of function in this single protein causes the multi-organ disease that defines cystic fibrosis.

The rate at which new mutations arise in the human germline has been directly measured through whole-genome sequencing of children and their parents. Each human child carries approximately 60 to 70 new single nucleotide variants not present in either parent, the result of replication errors and DNA damage accumulated during the production of the egg and sperm. Most of these mutations fall in non-coding regions and have no detectable effect. Vanishingly few, but not zero, fall in positions where they improve function. This is the raw material on which natural selection operates: a continuous trickle of molecular variation, most of it inconsequential, generated with every generation, in every lineage, since the origin of life.

Rosalind Franklin and the Structure That Was Taken

The history of the double helix cannot be honestly told without attending to Rosalind Franklin in full, not as a footnote but as a central figure whose contribution was decisive and whose credit was withheld in ways that were clear at the time and are undeniable in retrospect.

Franklin arrived at King's College London in 1951 as a world-class X-ray crystallographer with experience applying the technique to the structure of carbon compounds. She was assigned to work on DNA. Over the following 18 months she produced X-ray diffraction data of a quality that was not matched by any other group working on DNA structure. She identified two distinct forms of DNA, which she designated A and B, and recognised that they had different structural properties. Her image of the B form, Photo 51, taken in May 1952, showed the helical form with a clarity that was not available from any other source.

In January 1953, Maurice Wilkins showed Photo 51 to James Watson without Franklin's knowledge or consent. Watson later wrote that seeing the image made his jaw drop and his heart race. The helical pattern it showed was unambiguous. Around the same time, Watson and Crick also obtained, through the Medical Research Council, a report summarising Franklin's unpublished data on the unit cell parameters of DNA. These parameters were critical for determining that the backbone of the helix was on the outside rather than the inside, a point Watson and Crick had previously gotten wrong. The 1953 paper was submitted and published while Franklin was unaware of how her data had been used.

Franklin left King's College in 1953 for Birkbeck College, where she did pioneering work on the structure of viruses. She died of ovarian cancer in April 1958, at the age of 37. She did not know the extent to which her data had contributed to the double helix model. The Nobel Committee awarded the prize to Watson, Crick, and Wilkins in 1962. Franklin was not mentioned in any of the Nobel lectures delivered on that occasion.

The subsequent decades brought a partial reckoning. Aaron Klug, Franklin's collaborator at Birkbeck, won the Nobel Prize in Chemistry in 1982 for work that built on the structural methods she had pioneered. Watson's memoir The Double Helix, published in 1968, depicted Franklin dismissively, in terms that Crick himself acknowledged were unfair. The science does not change. The structure of DNA is what it is, and Watson and Crick's model-building insight was real. But the most important piece of empirical evidence that made the model possible came from a crystallographer whose name was not on the paper.

Rosalind Franklin (1920 to 1958). Her X-ray crystallography of DNA produced data of a quality unmatched by any other group working on the problem. The image designated Photo 51, taken in May 1952, provided the clearest evidence then available for the helical form of DNA. She died four years before the Nobel Prize was awarded for the discovery her data had made possible.

DNA Damage, Repair, and Cancer

Every cell in the human body is subject to a continuous barrage of DNA damage. The genome of a typical human cell suffers approximately 10,000 to 20,000 DNA lesions every day, from sources including ultraviolet radiation from sunlight, reactive oxygen species generated as by-products of metabolism, spontaneous chemical reactions such as the hydrolytic loss of bases, and a variety of environmental mutagens. Most of this damage is repaired before it can be replicated or cause gene expression changes. The DNA repair systems of mammalian cells are among the most sophisticated molecular machinery in biology.

Base excision repair removes individual damaged bases, recognising the chemically altered nucleotide, cutting it out, and replacing it using the complementary strand as a template. Nucleotide excision repair handles bulkier lesions, such as the pyrimidine dimers formed when adjacent thymine bases are cross-linked by ultraviolet radiation. Double-strand break repair addresses the most dangerous form of DNA damage, the severing of both strands of the helix. It operates via two main pathways: homologous recombination, which uses the sister chromatid as a template for accurate repair, and non-homologous end joining, which reconnects the broken ends directly, a faster but more error-prone process.

Mary-Claire King at the University of California Berkeley identified in 1990 a region on chromosome 17 strongly linked to hereditary breast and ovarian cancer. The gene at this locus, cloned in 1994 and designated BRCA1, encodes a protein that plays a central role in the homologous recombination pathway of double-strand break repair. Individuals who inherit a defective copy of BRCA1 or the related gene BRCA2 have a substantially elevated lifetime risk of breast and ovarian cancer because their cells have a compromised capacity to repair certain types of DNA damage accurately.

Cancer, understood at the molecular level, is what happens when the systems that maintain genomic integrity fail to prevent the accumulation of mutations in genes that control cell growth and division. Proto-oncogenes are normal genes whose protein products promote cell growth; mutations that render them constitutively active convert them into oncogenes that drive uncontrolled cell proliferation. Tumour suppressor genes encode proteins that inhibit cell growth or promote apoptosis; their inactivation removes brakes on division. Cancer is not a single disease. It is the consequence of genomic instability: the genome rewriting itself in ways that progressively erode the cell's compliance with the growth-control systems of the organism.

CRISPR and the Editable Genome

For most of the history of molecular genetics, the genome could be read but only laboriously altered. Site-specific changes required techniques that were slow, expensive, and applicable only to a limited range of organisms. The discovery and development of the CRISPR-Cas9 system transformed this situation with a speed that had no precedent in the history of biological technology.

CRISPR, standing for Clustered Regularly Interspaced Short Palindromic Repeats, was first identified as a feature of bacterial genomes in 1987 by Yoshizumi Ishino in Japan, though its function was not understood at the time. Francisco Mojica at the University of Alicante identified CRISPR sequences in archaea in the 1990s and proposed in 2005 that they represented an adaptive immune system: a molecular record of past viral infections that bacteria and archaea use to recognise and destroy viruses they have encountered before.

The key mechanistic insight came from Emmanuelle Charpentier and Jennifer Doudna, who published in Science in 2012 a demonstration that the Cas9 protein could be programmed with a synthetic guide RNA to cut any DNA sequence specified by the guide. The cut was precise: at exactly the position in the genome complementary to the 20-nucleotide sequence specified in the guide RNA. A molecular machine that could be directed to cut any specific sequence in any genome, in any cell, had no precedent. Charpentier and Doudna received the Nobel Prize in Chemistry in 2020.

In December 2023, the first CRISPR-based therapy, targeting the genetic mutations causing sickle cell disease and beta-thalassaemia, received regulatory approval in the United States and United Kingdom. The treatment involves editing the patient's own haematopoietic stem cells to reactivate the production of foetal haemoglobin, compensating for the defective adult haemoglobin. For patients with sickle cell disease, the mutation that Linus Pauling identified in 1949 as a molecular disease can now, in principle, be corrected at its source.

He Jiankui, a Chinese researcher, announced in November 2018 that he had used CRISPR to edit the germline of human embryos that were subsequently implanted and born as live infants, introducing a mutation intended to confer resistance to HIV infection by disrupting the CCR5 gene. The announcement was met with near-universal condemnation from the scientific community. He was subsequently sentenced to three years in prison by Chinese authorities. The episode clarified the distinction between somatic gene editing, which affects only the treated individual, and germline editing, which alters the genome of all subsequent descendants.

The Telomere and the Limit of the Cell

At the ends of every chromosome is a structure called the telomere: a repeating sequence of the hexanucleotide TTAGGG that caps the chromosome and protects it from degradation and from being recognised by the cell's DNA repair machinery as a double-strand break. Human telomeres consist of approximately 2,000 to 10,000 of these repeats at birth, spanning a total of 15 to 60 kilobases per chromosome end.

The problem the telomere addresses is fundamental. DNA polymerase cannot replicate the very end of a linear chromosome. This is the end-replication problem, first articulated by Alexey Olovnikov in 1971 and independently by James Watson in 1972. Each round of replication leaves a short single-stranded overhang at the 3' end of the new strand. This overhang is degraded, shortening the chromosome by approximately 50 to 100 base pairs with each division. Over successive cell divisions, the telomere shrinks. When it reaches a critical minimum length, a checkpoint is triggered. The cell can no longer divide. It enters a state of permanent growth arrest called replicative senescence.

This was predicted by Leonard Hayflick, who showed in 1961 that normal human cells grown in culture undergo a finite number of divisions, typically 40 to 60, before permanently losing the ability to divide. The Hayflick limit had no mechanistic explanation until telomere biology provided one.

Elizabeth Blackburn at UC Berkeley and later UCSF identified the telomere sequence in 1978 and subsequently, with graduate student Carol Greider, discovered in 1984 an enzyme called telomerase that can extend telomeres using an RNA template it carries within itself. Telomerase adds new TTAGGG repeats to chromosome ends, compensating for the sequence lost in each round of replication. Blackburn and Greider, together with Jack Szostak, received the Nobel Prize in Physiology or Medicine in 2009.

Most normal somatic cells express little or no telomerase, which is why they are subject to the Hayflick limit. Stem cells and germ line cells express telomerase, which is how they maintain their capacity for division. Cancer cells almost universally reactivate telomerase expression, enabling them to divide without limit. The telomere system is therefore a counting mechanism built into the molecular architecture of the chromosome, one that counts cell divisions and imposes a limit that constrains the ability of normal cells to accumulate the additional mutations that cancer requires. When cancer circumvents that limit, it does so by resurrecting the same molecular mechanism that maintains stem cells and keeps the germ line immortal.

What the Genome Says About What a Human Is

The complete sequence of the human genome is a document 3 billion characters long. It can be read, searched, and compared across individuals and across species with tools that did not exist 30 years ago. What reading it reveals is a picture of a human being that has no precedent in any previous description of human nature.

Single nucleotide polymorphisms (SNPs), positions in the genome where individuals in a population differ by a single base, number approximately 4 to 5 million between any two unrelated humans. Any two randomly chosen humans share approximately 99.9 percent of their genome sequence in common. The 0.1 percent that differs, distributed across 3 billion positions, is nonetheless sufficient to make every human genome unique in the history of the species. It is also sufficient to carry the signals of ancestry, migration, and evolutionary selection that population genomics reads as a record of human prehistory.

The genomes of non-African humans include approximately 1 to 4 percent derived from Neanderthal ancestors, the result of interbreeding events that occurred approximately 50,000 to 60,000 years ago as anatomically modern humans moved out of Africa and encountered Neanderthal populations in Eurasia. Individuals of Melanesian and Aboriginal Australian ancestry carry an additional 4 to 6 percent of their genome from Denisovan ancestors, a hominin group known almost entirely from genomic data derived from fragmentary fossils found in a Siberian cave.

The human genome also carries within it the traces of events that shaped the genome itself over far longer timescales. Humans and other great apes lack the ability to synthesise vitamin C, a capacity that most mammals retain. The gene encoding L-gulonolactone oxidase, an enzyme required for vitamin C synthesis, is present in the human genome but is non-functional, inactivated by mutations accumulated after a dietary shift in a primate ancestor made endogenous synthesis redundant. The dead gene remains in the genome as a molecular fossil of a biological capability the lineage once had and no longer needs.

What the genome ultimately says about what a human is can be stated simply and with full scientific authority: a human being is a set of instructions, accumulated and refined by 3.8 billion years of evolutionary trial and selection, for building and operating a system of 37 trillion eukaryotic cells. Those instructions are encoded in a molecule whose structure was described in 1953, whose sequence was determined between 1990 and 2003, and whose meaning is still being read.

The genome is not the blueprint it was sometimes described as in the early genomic era. It is better understood as a dynamic resource: a set of sequences whose expression is regulated by a vast system of molecular controls that respond to developmental signals, environmental inputs, and the history of the cell's experience. What a genome does depends on where it is, when in development it is being read, and what signals the cell is receiving. The genome, and the organism that carries it, are not separable in the way that a blueprint and a building can be separable. They are aspects of the same process.

DNA andthe Code