What is it?
the New Genetics
Websites en Español
Primer Molecular Genetics
List of All Publications
Search This Site
Site Stats and Credits
What has been learned from analysis of the working draft
sequence of the human genome? What is still unknown?
By the Numbers
The Wheat from the Chaff
- The human genome contains 3.2 billion chemical nucleotide base pairs (A,
C, T, and G).
- The average gene consists of 3,000 base pairs, but sizes vary greatly, with
the largest known human gene being dystrophin at 2.4 million base pairs.
- The total number of genes is estimated at 25,000, much lower than previous
estimates of 80,000 to 140,000 that had been based on extrapolations from
gene-rich areas as opposed to a composite of gene-rich and gene-poor areas.
- The human genome sequence is almost exactly the same (99.9%) in all people.
- Functions are unknown for more than 50% of discovered genes.
How It's Arranged
- About 2% of the genome encodes instructions for the synthesis of proteins.
- Repeat sequences that do not code for proteins make up at least 50% of the
- Repeat sequences are thought to have no direct functions, but they shed
light on chromosome structure and dynamics. Over time, these repeats reshape
the genome by rearranging it, thereby creating entirely new genes or modifying
and reshuffling existing genes.
- During the past 50 million years, a dramatic decrease seems to have occurred
in the rate of accumulation of repeats in the human genome.
How the Human Genome Compares with That of Other Organisms
- The human genome's gene-dense "urban centers" are predominantly composed
of the DNA building blocks G and C.
- In contrast, the gene-poor "deserts" are rich in the DNA building blocks
A and T. GC- and AT-rich regions usually can be seen through a microscope
as light and dark bands on chromosomes.
- Genes appear to be concentrated in random areas along the genome, with vast
expanses of noncoding DNA between.
- Particular gene sequences have been associated with numerous diseases and
disorders, including breast cancer, muscle disease, deafness, and blindness.
- Stretches of up to 30,000 C and G bases repeating over and over often occur
adjacent to gene-rich areas, forming a barrier between the genes and the "junk
DNA." These CpG islands are believed to help regulate gene activity.
- Chromosome 1 (the largest human chromosome) has the most genes (3,168),
and Y chromosome has the fewest (344).
Variations and Mutations
- Unlike the human's seemingly random distribution of gene-rich areas, many
other organisms' genomes are more uniform, with genes evenly spaced throughout.
- Humans have on average three times as many kinds of proteins as the fly
or worm because of mRNA transcript "alternative splicing" and chemical modifications
to the proteins. This process can yield different protein products from the
- Humans share most of the same protein families with worms, flies, and plants,
but the number of gene family members has expanded in humans, especially in
proteins involved in development and immunity.
- The human genome has a much greater portion (50%) of repeat sequences than
the mustard weed (11%), the worm (7%), and the fly (3%).
- Over 40% of predicted human proteins share similarity with fruit-fly or
- Although humans appear to have stopped accumulating repeated DNA over 50
million years ago, there seems to be no such decline in rodents. This may
account for some of the fundamental differences between hominids and rodents,
although gene estimates are similar in these species. Scientists have proposed
many theories to explain evolutionary contrasts between humans and other organisms,
including those of life span, litter sizes, inbreeding, and genetic drift.
- Scientists have identified millions of locations where single-base DNA differences
occur in humans. This information promises to revolutionize the processes
of finding DNA sequences associated with such common diseases as cardiovascular
disease, diabetes, arthritis, and cancers.
- The ratio of germline (sperm or egg cell) mutations is 2:1 in males vs females.
Researchers point to several reasons for the higher mutation rate in the male
germline, including the greater number of cell divisions required for sperm
formation than for eggs.
What We Still Don't Understand: A Checklist for Future Research
- Exact gene number, exact locations, and functions
- Gene regulation
- DNA sequence organization
- Chromosomal structure and organization
- Noncoding DNA types, amount, distribution, information content, and functions
- Coordination of gene expression, protein synthesis, and post-translational
- Interaction of proteins in complex molecular machines
- Predicted vs experimentally determined gene function
- Evolutionary conservation among organisms
- Protein conservation (structure and function)
- Proteomes (total protein content and function) in organisms
- Correlation of SNPs (single-base DNA variations among individuals) with
health and disease
- Disease-susceptibility prediction based on gene sequence variation
- Genes involved in complex traits and multigene diseases
- Complex systems biology, including microbial consortia useful for environmental
- Developmental genetics, genomics
Applications, Future Challenges
Deriving meaningful knowledge from the DNA sequence will define research through
the coming decades to inform our understanding of biological systems. This enormous
task will require the expertise and creativity of tens of thousands of scientists
from varied disciplines in both the public and private sectors worldwide.
The draft sequence already is having an impact on finding genes associated
with disease. Over 30 genes have been pinpointed and associated with breast
cancer, muscle disease, deafness, and blindness. Additionally, finding the DNA
sequences underlying such common diseases as cardiovascular disease, diabetes,
arthritis, and cancers is being aided by the human variation maps (SNPs) generated
in the HGP in cooperation with the private sector. These genes and SNPs provide
focused targets for the development of effective new therapies.
One of the greatest impacts of having the sequence may well be in enabling
an entirely new approach to biological research. In the past, researchers studied
one or a few genes at a time. With whole-genome sequences and new high-throughput
technologies, they can approach questions systematically and on a grand scale.
They can study all the genes in a genome, for example, or all the transcripts
in a particular tissue or organ or tumor, or how tens of thousands of genes
and proteins work together in interconnected networks to orchestrate the chemistry
Post-sequencing projects are well under way worldwide. (See Genomic Science Program).
These explorations will result in a profound, new, and more comprehensive understanding
of complex living systems, with applications to agriculture, human health, energy,
global climate change, and environmental remediation, among others.