MOLECULAR & CELLULAR NEUROBIOLOGY 
Master Course Cognitive Neuroscience - Radboud University, Nijmegen

 

INDEX

INTRODUCTION CELLS AND WITHIN CELLS IN A NUTSHELL GENOMICS MOLECULAR BIOLOGICAL RESEARCH METHODOLOGY NEURODEVELOPMENT  

 

Chapter 4: Genomics

  The genome Functional Genomics Genome-wide association studies (GWAS)
  Genomics research Pharmacogenomics Molecular networks
  The Human Genome and HapMap Projects Genetic variations: SNPs and CNVs  

 

 

Genome-wide association studies (GWAS)

Genome-wide association studies (GWAS) involve genotyping hundreds of thousands of common DNA variants (single-nucleotide polymorphisms, SNPs) spread throughout the genome in large numbers of individuals with illness and a similar number of comparison individuals with a low prevalence of illness ('controls'). By comparing the frequencies of genetic variants between individuals with and without disease, GWAS can lead directly to the causal variants of disease or to variants that are in strong linkage disequilibrium with variants of disease.  This apparoach enables the characterization of DNA variation systematically over the entire genome and in whole populations. The emergence of this technology has revolutionized our ability to apply GWAS approaches to many human diseases, with more than 200 loci now identified (and replicated for Crohn’s disease; at least 30 different risk genes involved; Figure 1), type 1 and type 2 diabetes, inflammatory bowel disease, serum lipid levels, prostrate cancer, breast cancer, colorectal cancer, rheumatoid arthritis, age-related macular degeneration, obesity, celiac disease, multiple sclerosis, atrial fibrillation, coronary disease, glaucoma, gallstones, asthma, restless leg syndrome and more than 50 other human diseases as well as various individual traits (height, hair color, eye color, freckles, and HIV viral set point). Therefore, the power of approaches such as GWAS lies in their ability to identify the genetic causes of disease, which can be used to predict disease risk and to elucidate signalling pathways associated with disease, information that is of use in drug discovery.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1. GWAS for Crohn's disease. (A) Manhattan plot. Significance level (P value on log10 scale) for each of the 500,000 SNPs tested across the genome. SNP locations reflect their positions across the 23 human chromosomes. SNPs with significance levels exceeding 10–5 (corresponding to 5 on the y axis) are colored red; the remaining SNPs are in blue. Ten regions with multiple significant SNPs are shown, labeled by their location or by the likely disease-related gene (e.g., IL23R on chromosome 1). (B) Close-up of the region around the IL23R locus on chromosome 1. The first part shows the significance levels for SNPs in a region of ~400 kb, with colors as in (A). The highest significance level occurs at a SNP in the coding region of the IL23R gene (causing an Arg381 ->Gln change). The light blue curve shows the inferred local rate of recombination across the region. There are two clear hotspots of recombination, with SNPs lying between these hotspots being strongly correlated in a few haplotypes. The second part shows that the IL23R locus harbors at least two independent, highly significant disease-associated alleles. The first site is the Arg381 ->Gln polymorphism, which has a single disease-associated haplotype (shaded in blue) with frequency of 6.7%. The second site is in the intron between exons 7 and 8; it tags two disease-associated haplotypes with frequencies of 27.5% and 19.2%.

Various lessons have already emerged from genetic mapping by GWAS:

1.    GWAS work. Before 2006, only about two dozen reproducible associations had been discovered. By early 2008, more than 150 relationships were identified between common SNPs and disease traits. In most diseases studied, GWAS have revealed multiple independent loci, although some traits have not yet yielded associations that meet stringent thresholds (e.g., hypertension). It is not clear whether this reflects inadequate sample size, phenotypic definition, or a different genetic architecture.

2.    Effect sizes for common variants are typically modest. In a few cases, common variants with effects of a factor of ≥2 per allele have been found, e.g. APOE4 in Alzheimer's disease. In the vast majority of cases, however, the estimated effects are much smaller—mostly increases in risk by a factor of 1.1 to 1.5 per associated allele.

3.    The power to detect associations has been low. Given the effect sizes now known to exist, and the need to exceed stringent statistical thresholds, the first wave of GWAS provided low power to discover disease-causing loci. For example, achieving 90% power to detect an allele with 20% frequency and a factor of 1.2 effect at a statistical significance of 10–8 requires 8600 samples. Thus, although it is unlikely that common alleles of large effect have been missed, GWAS of hundreds to several thousand cases have necessarily identified only a fraction of the loci that can be found with larger sample sizes. This prediction has been empirically confirmed in type 2 diabetes, serum lipids, Crohn's disease, and height. Increasing the power by pooling the samples to perform meta-analysis and replication genotyping has increased this yield to more than 100 replicated loci for these four conditions.

4.    Association signals have identified small regions for study but have not yet identified causal genes and mutations. Genetic mapping is a double-edged sword: local correlation of genetic variants facilitates the initial identification of a region but makes it difficult to distinguish causal mutation(s); luckily, whereas family-based linkage methods typically yield regions of 2 to 10 Mb in span, GWAS typically yield more manageable regions of 10 to 100 kb. These regions have yet to be scrutinized by fine-mapping and resequencing to identify the specific gene and variants responsible. Even when a locus is identified by SNP association, the causal mutation itself need not be a SNP. For example, the IRGM gene was associated with Crohn's disease on the basis of GWAS. Subsequent study suggests that the causal mutation is a deletion upstream of the promoter affecting tissue-specific expression.

5.    A single locus can contain multiple independent common risk variants, e.g. intensive study has already identified seven independent alleles at 8q24 for prostate cancer and two at IL23R for Crohn's disease. Multiple distinct alleles with different frequencies and risk ratios may well be the rule.

6.    A single locus can harbor both common variants of weak effect and rare variants of large effect. In recent GWAS, studies of common SNPs enabled the identification of 19 loci as influencing low- or high-density lipoprotein (LDL, HDL) or triglycerides. Nine of these 19 were already known to carry rare Mendelian mutations with large effects, such as the loci for the LDL receptor (LDLR) and familial hypercholesterolemia (FH). Similarly, the genes encoding Kir6.2, WFS1, and TCF2 are all known to cause Mendelian syndromes including type 2 diabetes, as well as common SNPs with modest effects.

7.    Because allele frequencies vary across human populations, the relative roles of common susceptibility genes can vary among ethnic groups. One example is the association of prostate cancer at 8q24: SNPs in the region play a role in all ethnic groups, but the contribution is greater in African Americans. This is not because the risk alleles yet found confer greater susceptibility in African Americans, but because they occur at higher frequencies, contributing to the higher incidence among African American men than among men of European ancestry.

 

Lessons have also emerged about the functions and phenotypic associations of genes related to common diseases:

1.    A subset of associations involve genes previously related to the disease, e.g. of 19 loci meeting genome-wide significance in a recent GWAS of LDL, HDL, or triglyceride levels, 12 contained genes with known functions in lipid biology.

2.    Most associations do not involve previous candidate genes. In some cases, GWAS results immediately suggest new biological hypotheses—for example, the role of FGFR2 in breast cancer, and CDKN2A and CDKN2B in type 2 diabetes.

3.    Many associations implicate non–protein-coding regions. Although some associated noncoding SNPs may ultimately prove attributable to linkage disequilibrium (LD) with nearby coding mutations, many are sufficiently far from nearby exons to make this outcome unlikely. Examples include the region at 8q24 associated with prostate, breast, and colon cancer, 300 kb from the nearest gene.
A role for noncoding sequence in disease risk is not surprising: comparative genome analysis has shown that 5% of the human genome is evolutionarily conserved and thus functional; less than one-third of this 5% consists of genes that encode proteins. Noncoding mutations with roles in disease susceptibility will likely open new doors to understanding genome biology and gene regulation. Regulatory variation also suggests different therapeutic strategies: modulating levels of gene expression may prove more tractable than replacing a fully defective protein or turning off a gain-of-function allele.

4.    Some regions contain expected associations across diseases and traits. Crohn's disease, psoriasis, and ankylosing spondylitis have long been recognized to share clinical features; the association of the same common polymorphisms in IL23R in all three diseases points to a shared molecular cause and multiple variants associated with type 2 diabetes are associated with insulin secretion defects in nondiabetic individuals, highlighting the role of β-cell failure in the pathogenesis of T2D.

5.    Some regions reveal surprising associations. For example, unexpected connections have emerged among type 2 diabetes, inflammatory diseases (two loci), and cancer (four loci). A single intron of CDKAL1 was found to contain a SNP associated with type 2 diabetes and insulin secretion defects, and another with Crohn's disease and psoriasis.

 

From common SNPs to the full allelic spectrum

The current HapMap provides reliable proxies for the vast majority of SNPs at frequencies above 5%, but its coverage declines rapidly for lower-frequency alleles. Such lower-frequency alleles may be particularly important: alleles with strong deleterious effects are constrained by natural selection from becoming too common. We divide these alleles into two conceptually distinct classes:

1.    Common variants with frequencies below 5% ("common" refers to variants that occur at sufficient frequency to be cataloged in studies of the general population and measured (directly, or indirectly through LD) in association studies. In practice, this class may include allele frequencies in the range of 0.5% and above.

2.    Rare variants. Most Mendelian diseases involve rare mutations that are essentially never observed in the general population. Rare mutations likely also play an important role in common diseases. Because they are numerous and individually rare, it is not possible to create a complete catalog in the general population. Instead, they must be identified by sequencing in cases and controls in each study. Moreover, because each variant is too rare to prove statistical evidence of association, the mutations must be aggregated as a class to compare the overall frequency of cases versus controls. GWAS of rare variants are already under way for large structural variants through the use of microarray analysis. A recent GWAS of autism revealed that a highly penetrant, recurrent microdeletion and microduplication of a 593-kb region in 16p11.2 explains 1% of cases. Moreover, several recent studies report that patients with autism and schizophrenia may have an excess of rare deletions across the genome relative to unaffected controls. Although these studies did not identify specific loci (none of the novel loci were observed more than once), they suggest that the universe of rare structural changes contributing to each disease may be as large and diverse as that of common SNPs.


The genetic architecture of common disease

Variants so far identified by GWAS together explain only a small fraction of the overall inherited risk of each disease (for example, ~10% of the variance for Crohn's and ~5% for type 2 diabetes). Where is the remaining genetic variance to be found? There are several answers:

1.    At disease loci already identified by GWAS, the locus-attributable risk will often be higher than currently estimated. This is because marker SNPs used in GWAS will typically be imperfect proxies for the actual causal mutation that led to the association signal. The causal gene will often contain additional mutations not tagged by the initial marker SNPs, both common and rare. Determining the contribution of each gene will require intensive studies of variants at each locus.

2.    Many more disease loci remain to be identified by GWAS. GWAS to date have had low statistical power and thus necessarily missed many loci with common variants of similar and smaller effects. The first studies did not have proxies for common structural variants and have failed to capture lower-frequency common variants (0.5 to 5%). Moreover, the vast majority of studies have been performed only in samples of European ancestry. Larger, more comprehensive, and more diverse GWAS will reveal many more loci.

3.   Some disease loci will contain only rare variants. Such loci (if not already found by Mendelian genetics) cannot be identified by study of common variants alone. They will require systematic resequencing of all genes in large samples.

4.    Current estimates of the variance explained are based on simplifying assumptions. Because the genotype-phenotype correlation has yet to be well characterized, the estimates assume that the variants interact in a simple additive manner. Yet gene-gene and gene-environment interactions play important roles in disease risk. Although searches have not yet found much evidence for epistasis, this may simply reflect limited power to assess the many possible modes of interaction, including pairwise interactions and threshold effects. Once patterns of association and interaction are understood, effects of specific gene and environmental exposures on each phenotype may be larger.

For the above reasons, it is premature to make inferences about the overall genetic architecture of common disease. Only by systematically exploring each of these directions over the coming years will a general picture emerge—with the likely outcome being that different diseases will each be characterized by a different balance of allele frequencies, interactions and types. Although the proportion of genetic variance explained is certain to grow in the coming years, it is unlikely to approach 100% because of practical limitations, such as the difficulty of detecting common variants with extremely small effects, genes harboring rare variants at very low frequency, and complex interactions among genes and with the environment.


Disease risk versus disease mechanism

The primary value of genetic mapping is not risk prediction, but providing novel insights about mechanisms of disease. Knowledge of disease pathways (not limited to the causal genes and mutations) can suggest strategies for prevention, diagnosis, and therapy. From this perspective, the frequency of a genetic variant is not related to the magnitude of its effect, nor to the potential clinical value that may be obtained.

The path ahead

Given the long-standing success of genetic mapping in providing new insights into biology and disease etiology, and the recent proof that systematic association studies can identify novel loci, our aim should be nothing less than identifying all pathways at which genetic variation contributes to common diseases. To achieve this goal:

(i) Expanding clinical studies. Current studies are underpowered for the types of SNP alleles that we now know exist, and available evidence indicates that increasing sample size will yield substantial returns. Nearly all GWAS to date have been performed in populations of European ancestry. Even if a variant has the same effect in all ancestry groups, it may be more readily detected in one population simply because it happens to have higher frequency. Genetic effects will likely vary across groups because of modification by environment and behavior, which may vary more across groups than does genotype. Many important diseases remain to be studied by GWAS. Disease-related intermediate traits can also offer substantial insight, particularly in conjunction with clinical endpoints. Correlations between genetic variants and phenotypes are limited by the accuracy with which each is measured. The ability to measure genotype now far exceeds our ability to measure phenotype. Continuous ambulatory monitoring, imaging methods, and comprehenive ("-omic") approaches to biological samples all have promise in improving the accuracy of phenotype measurement. Environmental exposures play a larger role in human phenotypic variation than does genetic variation, but environmental exposures are fundamentally more difficult to measure. DNA is stable throughout life, with a single physical chemistry that enables generic approaches to measurement. Environmental exposures are heterogeneous and may be fleeting. Improved methods for measuring environmental exposures, perhaps based on epigenetic marks they leave, are sorely needed.

(ii) Expanding the range of genetic variation. The lowest-hanging fruit will be to resequence loci that have been definitively implicated in disease by Mendelian genetics or by GWAS. Initially, resequencing of coding exons will be easiest to interpret. Rare coding mutations with large effect will be especially valuable, because physiological studies of mutation carriers can help illuminate the biological basis of the disease, and because coding mutations of large effect are more straightforwardly transferred to cellular and animal models for mechanistic studies. Extending GWAS to include structural variants and lower-frequency common variants will require comprehensive catalogs of genomic variation, as well as characterization of LD relationships. With new massively parallel sequencing technologies, an accurate map of all 1% alleles (both single-nucleotide and structural) should be achievable. A "1000 Genomes Project" was recently launched toward this end. Multiple instances of de novo coding mutations at a locus (by comparing affected individuals with parents) could provide particularly powerful association information, because the human mutation rate is so low (in the range of 10–8). But identifying de novo mutations without being overwhelmed by false positives will require extraordinary sequencing accuracy (far better than finished genome sequence). Because such studies will be expensive at first, priority should go to disorders with high heritability, where there is an unmet medical need, and for which other approaches have met with limited success. Psychiatric disorders might represent one such target. Eventually, it will become practical to resequence entire genomes from thousands of cases and controls. The problem of interpretation will be much harder for noncoding functional elements, because it is unclear either how to aggregate elements to achieve a large enough target size, or to develop ways to recognize function-altering changes. New statistical methods will be required to combine evidence from rare and common alleles at a locus and across multiple loci, phenotypes, and nongenetic exposures. A particular challenge will be to identify mutations in regions without known function or evolutionary conservation. There may be inherent limits to our ability to relate phenotypic variation and genotypic variation. To the extent that disease is influenced by tiny effects at hundreds of loci or highly heterogeneous rare mutations, it may be impractical to assemble sufficiently large samples to give a complete accounting.

 Implications for Biology, Medicine, and society

Genetic mapping is only a first step toward biological understanding and clinical application. Creation of disease models, both in human cell culture and nonhuman animals, will be key. Physiological studies in patients classified by genotype may inform disease processes and lead to useful nongenetic biomarkers. Given the limits of human clinical research, rare alleles of strong effect may be more useful than common alleles of weak effect. The extent to which genetic information will figure in "personalized medicine" will depend on whether predictive accuracy beyond conventional measures can be attained, and whether there are interventions whose effectiveness is improved by knowledge of a genetic test. Knowledge of a common variant that increases type 2 diabetes risk by 20% may eventually lead to new understanding and therapeutic strategies, but whether an increase in absolute risk (from 8% to 10%) is useful for patients remains to be seen. Although it is tempting to think that knowledge of individual risk might promote greater adherence to a healthy lifestyle, human behavior is complex and risk estimates are challenging to interpret. Even where genotype can predict response to a drug with a narrow therapeutic window, it cannot be assumed that genetic testing will necessarily lead to improved clinical outcomes. It will be a challenge for the public to understand the difference between relative and absolute risk, and to figure in their thinking the larger component of genetic and environmental factors not yet captured by today's technologies. We must constantly remind that although genes play a role (and can lead us to new biological insight), our traits are powerfully shaped by the environment, and the solutions to important problems will often lie outside our genes.

GWAS in psychiatric illness

Genome-wide significant genetic associations for bipolar disorder and schizophrenia have been shown. Studies of approximately 10,000 individuals have shown strong evidence for association with susceptibility to bipolar disorder at variants within two genes involved in ion channel function: ANK3 (encoding the protein ankyrin-G) and CACNA1C (encoding the alpha-1C subunit of the L-type voltage-gated calcium channel). A similar study in close to 20,000 individuals has shown strong evidence for association with susceptibility to schizophrenia at a variant within ZNF804A (encoding a zinc finger transcription factor). The study of even larger samples may identify additional reliable associations and thereby extend knowledge of the proteins and biological pathways involved in illness.

Importance of phenotype definition

The importance of phenotype definition and selection on genome-wide association findings is demonstrated strikingly by work on type 2 diabetes where the gene FTO was robustly associated with illness in a collaborative meta-analysis. However, association at FTO was not present at all in one of the three samples in the meta-analysis although it was highly significant in one of the other samples of similar size. The difference was caused by phenotypic heterogeneity: in the sample showing no association, cases were not included if the individuals were obese. No such exclusion criterion was present in the sample with the strong effect. Subsequent work showed that FTO influences diabetes risk through an effect on body mass. This demonstrates that phenotype variation can be critical to the ability to identify susceptibility variants. Furthermore, taking account of phenotype variation across samples can provide critical information about the mode of action of a susceptibility locus.

Psychiatric scenarios that might produce results similar to the obesity-diabetes story include presence or absence of prominent psychotic features in bipolar disorder or prominence of anxiety in recurrent depression.

 Genetic dissection of psychiatric phenotypes

What about psychiatric phenotypes? Psychiatric diagnoses can be considered `the weak component of modern research', defined solely by descriptive, usually behavioural, criteria. Although these phenotype definitions are highly heritable, and hence are valid and sensible starting points for genetic research, it is generally agreed that the most useful biological categories and/or dimensional definitions and measures are still unknown. The strikingly high level of co-occurrence of different diagnoses within the same individual (comorbidity) almost certainly reflects a substantial overlap in the underlying biology of currently defined syndromes. For example, deletion of chromosome 22q11 has been associated with childhood autism and ADHD as well as adult mood disorders and psychosis.

Molecular genetics will not provide a simple, gene-based classification of psychiatric illness (as it will not for other common familial illnesses). The notion that there is a gene for one or more psychiatric disorders is inappropriate and unhelpful. Rather, there is a complex relationship between genotype and phenotype that involves multiple genes and environmental factors, together with stochastic variation. Nonetheless, molecular genetic findings can be expected to help delineate the relationship between specific biological pathways/systems and broad patterns, or domains, of psychopathology.

Types of analyses that may be relevant

A range of analytic approaches that may be relevant to understanding the relationship between genotype and phenotype for psychiatric traits include approaches designed both to discover new pathologically relevant genetic variants and also to characterise the phenotypic spectrum associated with robustly associated variants. First, we can explore whether individual genetic variants increase risk across multiple diagnostic categories. For example, genes may exist that alter risk for both schizophrenia and autism, or for schizophrenia and bipolar disorder, or for bipolar illness and recurrent depression. Second, we can attempt to identify risk genes for psychosis, depressed mood or some other domain of psychopathology regardless of the syndrome in which they occur. Third, we can look for disease-modifying effects. For example, genes may exist which do not influence risk for the diagnostic category of schizophrenia but, when an individual has this diagnosis, alters the probability that they have auditory hallucinations or early onset. Fourth, instead of starting with phenotypes and then looking at genotypes, we could reverse the order. We might start with a single gene or genotype of interest and study its phenotypic profile. Fifth, we could apply one of a range of advanced statistical tools to define novel diagnostic entities (whether they are categories or dimensions) that would `make more sense' from a genetic perspective. Sixth, instead of focusing on single genetic variants, we could consider a large set of polymorphisms (perhaps tens of thousands) and use aggregate measures of their overall contribution to phenotypic susceptibility to seek to define `signatures' of genetic variants, the patterns of which could be compared across phenotypes. This approach, which will be particularly useful if psychiatric phenotypes are highly polygenic (i.e. many, many risk genes, each of small effect on risk), has recently been used to demonstrate a substantial overlap in polygenic contribution to schizophrenia and bipolar disorder.

The ongoing major investments of time and money in genome-wide association studies for psychiatric disorders has the potential to contribute to the identification of pathways involved in illness and help psychiatry move towards approaches to diagnosis and treatment that are grounded in a better understanding of pathogenesis.

Types of analysis to delineate the relationship between genotype and phenotype

Phenotype refers to the measurable clinical characteristics of individuals, which may be considered at several levels (e.g. disorder, syndrome, factors or domains of psychopathology, or individual symptom items) and genotype refers to the measured genetic variation, which may also be considered at several levels (e.g. individual allele, individual polymorphism (SNP), gene, gene family, biological pathway, or other large set of polymorphisms (including polygenic `signature')). It is possible to: (a) start with phenotype(s) and seek associated genotype(s) (traditional approach); (b) start with genotype(s) and seek correlated phenotype(s) (`reverse phenotyping' or `phenotype refinement' approach); or (c) consider all phenotype and genotype data together and seek patterns of genotype–phenotype correlation (an approach that makes minimal prior assumptions about both nosology and pathogenesis).

Phenotype -> genotype
a.    Seek susceptibility across traditional diagnostic categories (uses combinations of disorders v. controls).
b.    Seek susceptibility to specific domains of psychopathology (uses cases with specific clinical features v. controls).
c.    Seek modifier genes for specific clinical features (uses cases with specific clinical features v. cases without specific clinical feature).
d.    Look for patterns (or signatures) of large numbers of associated SNPs that can then be compared across samples or diagnoses.

 

 Genotype -> phenotype
a.    Identify the phenotypic spectrum associated with a specific genotype of interest.

 

Genotype {leftrightarrow} phenotype
Look for patterns of correlation in data with minimal prior assumptions (i.e. seek novel, genetically valid diagnostic entities).

 


Next page: Molecular networks Go back to: Genetic variations: SNPs and CNVs