MOLECULAR & CELLULAR NEUROBIOLOGY

Master Course Cognitive Neuroscience - Radboud University, Nijmegen

Chapter 4: Genomics

The genome	Functional Genomics	Genome-wide association studies (GWAS)
Genomics research	Pharmacogenomics	Molecular networks
The Human Genome and HapMap Projects	Genetic variations: SNPs and CNVs

The Human Genome Project: purpose and goal

The Human Genome Project, the first large international effort in the history of biological research, was initiated on October 1, 1990, to be completed in the year 2005. However, with improvements in technology and competition from the private sector, the timetable was accelerated. A rough draft of 90% was completed in 2000, and the complete sequence became available in 2003. The Human Genome Project sequenced the DNA blueprint for the development of a single fertilized egg into a complex organism. However, while the overall objective was to sequence the human genome, other goals were completed along the way that markedly accelerated the efforts of all investigators involved in biological or medical research. The first goal was to develop a genetic map. This meant developing markers (unique DNA sequences) along each chromosome that would have a readily identifiable chromosomal position to provide highly informative signposts for the identification of nearby genes. This goal provided thousands of markers spaced 5 to 10 million base pairs apart, spanning the entire human genome, leading to the creation of a genetic “road map” for each chromosome. As will become evident in a future section of this text, it is the use of this genetic map, with DNA sequences (markers) of known positions (loci) along each chromosome, that enables the mapping of a gene’s chromosomal location by genetic linkage analysis. The tool of genetic linkage analysis led to the acceleration of mapping the position of numerous genes responsible for diseases. Currently over 1500 disease-causing genes are known, due to the more rapid identification of genes facilitated by the Human Genome Project.

The policy of the Human Genome Project is that the entire human DNA sequence, including all identified genes, will be available to the public. Each gene, as it is sequenced, is entered into a publicly accessible database and available at no cost. In the United States, GenBank (at http://www.ncvi.nlm.nih.gov) is run by the National Center for Biotechnology Information (NCBI) and serves as the public repository of DNA sequence information. The results of the efforts of the publicly funded Human Genome Project consist of not only DNA sequences of the various genes but also the intervening sequences.

Another goal was to develop a physical map of regions of the DNA that are expressed as genes. These markers are referred to as expressed sequence tags (ESTs) and contain short sequences of 200 to 300 bp. These sequences are unique and represent a fragment of a yet to be fully characterized specific gene. ESTs are generated by extraction of all of the mRNAs in a cell type, which represents all of the genes expressed at that time in that cell. The mRNA can be converted to cDNA with the enzyme reverse transcriptase and the sequences amplified by the polymerase chain reaction (PCR), from which unique sequences are selected and entered into GenBank as ESTs. The sequences of these ESTs are then matched to the plethora of sequences available in the DNA sequence repository. Thus, ESTs mapped to their chromosomal locations can be used as markers to identify novel genes responsible for disease. The development of this physical map has tremendously accelerated the efforts of investigators to identify novel genes, relevant to normal physiology or disease. These ESTs serve as candidate genes if a locus harboring a disease gene is mapped to a region; the ESTs in the region are potential candidate genes and greatly facilitate the identification of the gene of interest.

Information from the draft human genome sequence

Click Public HGP mapping for a movie.

Click Shotgun sequencing & dealing with repeat sections for a movie.

By the numbers

The human genome contains 3164.7 million chemical nucleotide bases (A, C, T, and G). The average gene consists of 30,000 bases, but sizes vary greatly, with the largest known human gene being dystrophin at 2.4 million bases. The total number of genes is estimated at 21,000 — much lower than previous estimates of 80,000 to 140,000 that had been based on extrapolations from gene-rich areas as opposed to a composite of gene-rich and gene-poor areas. Almost all (99.9%) nucleotide bases are exactly the same in all people. The functions are unknown for over 50% of discovered genes. Less than 2% of the genome codes for proteins. Repeated sequences that do not code for proteins ("junk DNA") make up at least 50% of the human genome. Repetitive sequences are thought to have no direct functions, but they shed light on chromosome structure and dynamics. Over time, these repeats reshape the genome by rearranging it, creating entirely new genes, and modifying and reshuffling existing genes. During the past 50 million years, a dramatic decrease seems to have occurred in the rate of accumulation of repeats in the human genome.
How it is arranged
The human genome's gene-dense "urban centers" are predominantly composed of the DNA building blocks G and C. In contrast, the gene-poor "deserts" are rich in the DNA building blocks A and T. GC- and AT-rich regions usually can be seen through a microscope as light and dark bands on chromosomes. Genes appear to be concentrated in random areas along the genome, with vast expanses of noncoding DNA between. Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming a barrier between the genes and the "junk DNA." These CpG islands are believed to help regulate gene activity. Chromosome 1 has the most genes (2968), and the Y chromosome has the fewest (231).
How the human compares with other organisms
Unlike the human's seemingly random distribution of gene-rich areas, many other organisms' genomes are more uniform, with genes evenly spaced throughout. Humans have on average three times as many kinds of proteins as the fly or worm because of mRNA transcript "alternative splicing" and chemical modifications to the proteins. This process can yield different protein products from the same gene. Humans share most of the same protein families with worms, flies, and plants, but the number of gene family members has expanded in humans, especially in proteins involved in development and immunity. The human genome has a much greater portion (50%) of repeat sequences than the mustard weed (11%), the worm (7%), and the fly (3%). Although humans appear to have stopped accumulating repeated DNA over 50 million years ago, there seems to be no such decline in rodents. This may account for some of the fundamental differences between hominids and rodents, although gene estimates are similar in these species. Scientists have proposed many theories to explain evolutionary contrasts between humans and other organisms, including those of life span, litter sizes, inbreeding, and genetic drift. Click Chimp & humans diverge from a common ancestor for an animation. Click Comparison human & Neanderthaler for an animation.
Variations and mutations
There are about 1.4 million locations where single-base DNA differences (single-nucleotide polymorphisms, SNPs) occur in humans. This information promises to revolutionize the processes of finding chromosomal locations for disease-associated sequences and tracing human history. The ratio of germline (sperm or egg cell) mutations is 2:1 in males vs females. Researchers point to several reasons for the higher mutation rate in the male germline, including the greater number of cell divisions required for sperm formation than for eggs.
Applications, future challenges
Deriving meaningful knowledge from the DNA sequence will define research through the coming decades to inform our understanding of biological systems. This enormous task will require the expertise and creativity of tens of thousands of scientists from varied disciplines in both the public and private sectors worldwide. The draft sequence already is having an impact on finding genes associated with disease. A number of genes have been pinpointed and associated with breast cancer, muscle disease, deafness, and blindness. Additionally, finding the DNA sequences underlying such common diseases as cardiovascular disease, diabetes, arthritis, and cancers is being aided by the human variation maps (SNPs) generated in the Human Genome Project in cooperation with the private sector. These genes and SNPs provide focused targets for the development of effective new therapies. One of the greatest impacts of having the sequence may well be in enabling an entirely new approach to biological research. In the past, researchers studied one or a few genes at a time. With whole-genome sequences and new high-throughput technologies, they can approach questions systematically and on a grand scale. They can study all the genes in a genome, for example, or all the transcripts in a particular tissue or organ or tumor, or how tens of thousands of genes and proteins work together in interconnected networks to orchestrate the chemistry of life.
Anticipated benefits
improved diagnosis of disease earlier detection of genetic predispositions to disease rational drug design gene therapy and control systems for drugs personalized, custom drugs

Genomics timeline

1869	DNA first isolated	1994	First GM food on the market: Flavr Savr tomato
1909	Word gene is coined	1996	Yeast genome sequenced
1952	Genes are made of DNA	1996	First mammal cloned - Dolly
1953	DNA double helix described	1997	E. coli genome sequenced
1961	mRNA isolated	1998	Roundworm C. elegans genome sequenced
1966	Genetic code cracked	2000	Fruit fly genome sequenced
1972	First animal gene cloned	2000	90% of human genome sequenced
1981	First transgenic mice and fruit flies	2003	Complete human genome sequenced
1983	First disease gene mapped - Huntington
1987	First human genetic map

Translation of genomic information to future clinical practice

As the annotation of the human genome becomes stable, a user-friendly, distilled view can be developed, as in the figure above. The diagram (a) of a chromosome 3 region (12,300–12,450 kb) contains the PPAR-g gene structure (dark blue) with an alternative promoter (light blue), hypothetical noncoding functional regions (green shaded boxes), and functional variants (red). Note that introns in the gene structure are scaled down relative to the exons. Zooming in on two sequence segments (b) shows the translated sequence with functional variants highlighted in blue (nucleotide changes) and pink (amino-acid changes). Amino-acid numbering includes the propeptide sequence. The variants (c, pink) can be viewed in the monomer protein structure (grey) in a linked database. Also shown is the binding position of an antidiabetic thiazolidinedione drug (blue), part of the other monomeric unit (green) of the dimeric receptor, and the ligand (yellow). Using linked information from a range of sources, a summary of the known, modelled or predicted biological consequences (such as biochemical, structural, medical or pharmacological) could be curated (and updated regularly) for each functional variant in tabular form (d). A small subset of this information would define the disease or drug outcome or side effect associated with each variant, would constitute specific risk information of value in clinical assessment, and would be exported (red outlined boxes). For maximum usefulness, therefore, the exported information would be subject to stringent filters and would include only data for which the medical relevance was well established for each particular disease discipline. For example, variants of uncertain significance would be excluded from the filtered risk information, although all data would be available in the public domain. All the information in a–d would be curated in the public domain. The use of personal genetic information in a clinical setting would be initiated or consented to by an individual. The individual sequence acquired could be as little as one or more individual genotypes, or as much as a complete genome sequence. The information would be private and owned by the individual, and might be stored electronically, protected by a high-security code requiring unique personal identifiers (such as multiple fingerprint identification) for access only with consent of the individual (e). The information might be taken either before consultation (as illustrated here) or afterwards, and in either case would be subject to counselling by the practitioner and consent by the individual. A specific investigation would be initiated by a consultation (f). The personal genetic information would then be supplied by the individual, for interpretation with respect to an agreed set of variants and/or a specific phenotype. The practitioner would use the available risk information concerning each variant to provide a genetic assessment for the individual (g). The top line refers to the variant featured in d and f; the second line is a hypothetical entry for a variant on another chromosome and does not represent a known variant. In the case illustrated, the individual has the heterozygous genotype TC at position 3: 12,450,610. This corresponds to having both Pro 495 and Ala 495 forms of the protein PPAR-g. This genotype confers an increased risk of insulin-resistant diabetes on the individual, and also resistance to the thiazolidinedione class of antidiabetic drugs. Combining this with risk information for other genotypes would help to inform subsequent clinical decisions (h).

The HapMap project

While the Human Genome Project was completed in 2003, other large-scale human genome projects continue. The sequence of the human genome differs by only 0.1% among human beings. This one-tenth of 1%, however, translates into 3 million bases. These 3 million bases are now considered to be responsible for essentially all of the human variation including predisposition or resistance to diseases. Thus, it became evident that identifying the sequence responsible for human variation would represent a major quest for the next decade.

A great deal of human variation appears to be due to single-nucleotide polymorphisms (SNPs), which are distributed throughout the human genome occurring at a frequency on average of about one SNP per 1000 base pairs. While identifying the SNPs responsible for human variation and the mechanism whereby this sequence induces the change is of crucial importance, it is perhaps of even more immediate importance to identify those SNPs that predispose to disease. Their potential to facilitate diagnosis, prevention, and treatment could be enormous. The difficulty lies in how to identify those SNPs that predispose to disease. In searching for SNPs that predispose to disease, it is quite a different task than identifying mutations responsible for single-gene disorders. A particular SNP is neither necessary nor required for a particular disease and thus contributes only a small percentage of the predisposition to the disease. Inheriting several of these SNPs may give you an accumulative effect as expressed in the phenotype of a polygenetic disease. The diseases that ultimately must be understood are those diseases due to multiple genes that interact significantly with the environment such as cardiac diseases, cancer, and mental illness. In an effort to facilitate future studies identifying SNPs and their related phenotype in polygenetic diseases, a consortium was formed consisting of Canada, Japan, United Kingdom, China, Nigeria, and United States to sequence and identify SNPs. The overriding question was to determine whether SNPs were coinherited in blocks and, hence, the term haplotype and the HapMap Project. The results were published and do indeed indicate that several of the SNPs are coinherited as blocks and exert a combined effect and thus one could select SNPs that are tagged to other SNPs, making it practical to scan the genome utilizing 300,000 to 500,000 SNPs as opposed to several million. While each human being has only 3 million SNPs, in the general population it is estimated there are about 17 million. It would now appear that 500,000 SNP chips can be used for genome-wide scans, which significantly decreases the cost compared to having to utilize 2 or 3 million SNPs. One of the difficulties that continues to remain a challenge is the low frequency of occurrence of these SNPs. It would appear that many of the SNPs occur at a frequency of less than 5%, which makes detection by current technology very difficult. Common SNPs that occur with frequency of 5 or 10% can, however, be detected utilizing genome-wide scans with 500,000 SNPs as markers. It appears that probably only 50,000 to 100,000 SNPs are responsible for providing significant change in humans since most SNPs do not affect coding regions, although the percentage of SNPs present in noncoding promoter regions that may markedly influence transcription remains to be determined. See also under "Genetic variations: SNPs and CNVs".

Next page: Functional genomics

Go back to: Genomics research