University of Phoenix Translational Bioinformatics Presentation

Create your PowerPoint presentation with speaker notes that critically address each of the following elements. (Remember that your presentation slides should have short, bullet-pointed text with your speaker notes including the bulk of the information provided in the following list.)

  • Summarize the Human Genome Project.
  • Evaluate how genome mapping can explain the cause of and prevention of one disease.
  • Explain how bioinformatics will alter the path of health informatics.
  • Evaluate the role of precision medicine and the funding provided for this initiative within health care.  CHAPTER 30 Bioinformatics
    Molecular biology produces vast quantities of data about the genetic and functional state of organisms.
    Advancements in genetic sequencing methods quickly moved from the study of individual genes to the
    whole genomes of organisms, with decoding of their structure and function. The small genome of the
    pathogenic bacterium Haemophilus influenzae was the first whole genome of a living organism to be
    completely sequenced. Only small viral genomes and parts of other genomes had been sequenced
    before that point. Inspired by this success, scientists embarked on the task of sequencing and analyzing
    much larger and the more complex genomes of multicellular organisms such as the fruit fly Drosophila
    melanogaster, the round worm and, subsequently, the human genome.
    Bioinformatics (or computational biology) was born out of such genome sequencing projects. CHAPTER
    30 Bioinformatics
    Molecular biology produces vast quantities of data about the genetic and functional state of organisms.
    Advancements in genetic sequencing methods quickly moved from the study of individual genes to the
    whole genomes of organisms, with decoding of their structure and function. The small genome of the
    pathogenic bacterium Haemophilus influenzae was the first whole genome of a living organism to be
    completely sequenced. Only small viral genomes and parts of other genomes had been sequenced
    before that point. Inspired by this success, scientists embarked on the task of sequencing and analyzing
    much larger and the more complex genomes of multicellular organisms such as the fruit fly Drosophila
    melanogaster, the round worm and, subsequently, the human genome.
    Bioinformatics (or computational biology) was born out of such genome sequencing projects.
    Bioinformatics (or computational biology) was born out of such genome sequencing projects. It was
    originally limited to developing algorithms that automated the analysis of gene sequences, but it has
    expanded to supporting the analysis of many other elements of cell biology (Box 30.1). Indeed,
    bioinformatics is now an essential part of the biomedical ‘workbench’, providing computational tools for
    the storage of biological data and for data integration, visualization and analysis. Biology and
    bioinformatics now advance in synergy with each other, and together they have dramatically increased
    our ability to observe and analyze the living world at the molecular level. We now recognize just how
    different each human, each cancer and each infectious pathogen are and can exploit these individual
    molecular signatures to treat disease more effectively.
    Initially, bioinformatics and health informatics were seen as separate disciplines because their research
    subjects and communities did not overlap. However, they use very similar methodologies, and both
    require access to phenotypic data stored in the patient record. Bioinformatics-enabled tests are now
    part of the investigation, treatment or prevention of many diseases, and gene or protein data appear in
    the patient record.
    In previous chapters, we reviewed technologies for decision support, data sharing, data mining and
    machine learning. In this chapter, we examine the critical role that these technologies play in the
    success of computational biology and also explore the inherent constraints that come when working
    with biomedical data.
    Box 30.1 The Bioinformatics Family
    Genomics – The study of the DNA sequence of genes and their variations among individuals. This
    includes several specialties. Comparative genomics studies similarities and differences between
    genomes of evolutionary similar and distant organisms; functional genomics focusses on identifying the
    functional role of these genes; and metagenomics applies genomic techniques to whole communities of
    different organisms in their natural environment, thus bypassing the need for isolation and cultivation of
    individual species.
    Proteomics – The study of the proteins actually expressed within a cell because some DNA sequences
    are transcribed and proteins are manufactured, whereas others are not. This includes determining the
    multidimensional structure of proteins, the degree and timing of their expression and protein
    interactions within the cell and across the cell membrane.
    Transcriptomics – The study of messenger RNA (mRNA) molecules involved in the transcription of DNA
    codes into proteins and their transport from the nucleus to the cell.
    Metabolomics – The computational study of chemical pathways for molecular synthesis or degradation
    in support of the functioning of cellular metabolism.
    Pharmacogenomics – The identification of genetic markers associated with differential responses to
    drug therapy, including adverse drug effects, to improve the quality and safety of therapeutic
    interventions, as well as influence drug design.
    30.1 Bioinformatics can answer many questions about the role of genes in human disease
    Now that many human genomes have been sequenced, we are said to be in a post-genomic era.
    Knowing the human genome, we can begin systematically to deconstruct how this genetic programme
    influences biochemistry, human physiology and disease (Box 30.2). Many kinds of biological information
    are available to help in this process of discovery including DNA sequences, gene expression or activity
    measures, protein structures, folding and interaction effects. We also have rich collections of data on
    the human phenotype, and indeed every health record contains such data (e.g. blood pressure, height).
    The fundamental task in bioinformatics is to use computational methods to connect these biological
    measures with phenotypic measures, so that we understand how changes in one lead to changes in the
    other. The core informatics tasks underpinning such analyses thus include methods that:
    1. Rapidly measure gene sequences and the degree to which they are actually expressed (turned on) in a
    given tissue, as well as measure other biological processes such as protein behaviour or metabolic
    2. Collect clinical (phenotypic) data that measure human disease states.
    3. Correlate these biological and phenotypic data sets, to identify important causal associations.
    4. Store and disseminate these biological data, such as gene sequence data sets.
    Functional genomics is the name given to the overall enterprise of deconstructing the genome and
    assigning biological function to genes. Functional genomics aims to use genome-to-phenotype
    associations to answer broad questions about the role of specific genes in health and disease. Examples
    of some of the questions that can be asked are as follows:
    Box 30.2 A Gene Biology Primer
    All functions of living organisms are regulated by sets of codes encrypted in their deoxyribonucleic acid
    (DNA) molecules. The entire complement of DNA molecules of each organism stored in its
    chromosome(s) is known as its genome. The overall function of the genome is to drive the generation of
    specific molecules, mostly proteins, which then regulate the metabolism of a cell and its interaction with
    other cells.
    Structure of DNA
    Each molecule of DNA is assembled from a pair of DNA strands. The strands are composed of varying
    sequences of the nucleotide base molecules adenine (symbolized by the code A), thymine (T), cytosine
    (C) and guanine (G) (Table 30.1). Each base is able to bond with a different base, with A binding to T and
    C binding to G. The two DNA strands are joined by these pairings, which means that one strand is the
    direct complement of the other because the bases must be made of sequences that match according to
    pairing rules. The physical properties of the bonds that form between the bases of two complementary
    strands of bases create the twisting helix of the DNA molecule. This complementarity between
    nucleotides also means that you can automatically deduce the sequence of one strand if you know the
    sequence of nucleotide bases along the other strand.
    DNA sequence sizes are often given in base pairs, (or bp). Thus the DNA sequence in Figure 30.1 is 15 bp
    long. Larger units, such as kb (1000 or kilo-bp) and Mb (Mega-bp) are also used. Microbial organisms
    generally have a single, circular genome between 0.5 and 13 Mb long. However, the human genome is
    stored across 46 chromosomes and is 3.5 Gigabases (billions of bases) long. It is estimated to contain 50
    000 to 100 000 genes. Genes range in size from small (1.5 kb for a globin gene) to large (about 2000 kb
    for the Duchenne muscular dystrophy locus).
    DNA can be copied, which allows a cell to divide into two daughter cells, each with the same DNA copy,
    through the co-ordinated action of many molecules (i.e. enzymes), including DNA polymerases
    (synthesizing new DNA), DNA gyrases (unwinding the molecule) and DNA ligases (concatenating
    segments together).
    Transcription of DNA into RNA
    For the DNA (or genome) to direct or effect changes in a cell, a transcriptional programme needs to be
    activated to create new proteins in the cell, based upon the instructions in the DNA. DNA itself remains
    in the nucleus of the cell, but most proteins are needed in the cytoplasm of the cell, where most of a
    cell’s functions are performed. Thus, DNA must be copied into a transportable version called ribonucleic
    acid (RNA). A gene is a single segment of DNA that is transcribed into RNA.
    The RNA is generated from the DNA template by a process called transcription. The RNA sequence of
    base pairs corresponds to that in the DNA, except that the nucleotide uracil (U) is substituted for
    A ribosome need not start reading mRNA at the beginning of a sequence or only in one direction. This
    means that the actual protein built depends on where one starts to read the code and on the direction
    in which it is read. Depending on whether one starts at the first, second or third base pair of a triplet,
    there are three different reading frames possible in one direction (Figure 30.2). The translation of a DNA
    code into amino acids, and hence the resulting protein, can be completely different in each of the three
    cases. A gene often specifies its reading frame by starting with a start codon for amino acid methionine,
    or AUG, to initiate transcription. In much the same way, codons TGA, TAA and TAG often act as stop
    codons. Thus, if you know where a protein-coding region starts (i.e. start codon) in a DNA sequence, we
    can build bioinformatics-enabled techniques to translate this DNA sequence into a corresponding amino
    acid sequence.
    Table 30.2 Codes for amino acids
    One-letter code
    Three-letter code
    Amino acid name
    Aspartic acid
    Glutamic acid
    Figure 30.2 Open reading frames. There are three possible ways to read a sequence of DNA into
    different codons, and each leads to translation into one of three different proteins.
    Processing of amino acid chains
    Once a protein is formed, it has to find the right place to perform its function. It may be a structural
    protein that forms part of a cell’s own cytoskeleton, as a cell membrane receptor or as a hormone that is
    secreted by the cell. A complex cellular apparatus determines this translocation process. One
    determinant of location and handling of a polypeptide comes from the structure of its final or signal
    peptide. The cellular translocation machinery recognizes this header sequence of amino acids. The
    ribosomal-mRNA complex can be directed to stop temporarily, move to a specific location and then
    resume protein assembly. Alternatively, some proteins are delivered after they are fully assembled, and
    chaperone molecules can prevent proper folding until the protein reaches its correct destination.
    Epigenetic factors alter gene expression in normal development and disease
    Whether a gene is expressed in an organism (i.e. is transcribed and leads to the creation of a protein) is
    partly regulated by the genetic structure itself, with structures such as promoter sites shaping whether a
    particular gene is active. There are also epigenetic processes (external to DNA) that can affect gene
    expression and are potentially inheritable. For example, DNA methylation (the addition of a methyl
    group to cytosine or adenine) affects whether a gene can be transcribed. Changing patterns of
    methylation are a normal part of the process of organism development, and they help explain how the
    same genome can differentiate into functionally different tissues. Methylation can also be altered by
    environmental factors, and aberrant methylation patterns may underpin the abnormal behavior of some
    cancer cell lines, independent of any mutation in the underlying genetic sequence (Craig and Wong,
    2011). Aberrant.
    30.3 Gene sequence alignment methods are needed to assemble DNA fragments into plausible
    sequences as well as to compare sequences
    Neither conventional Sanger sequencing nor high-throughput sequencing can decode a whole genomic
    sequence at once. Both produce pieces or ‘reads’ of DNA that have to be joined together. Because the
    various sequencing techniques rely on different biochemical methods, they produce different types of
    raw data. Sanger sequencing produces high-quality reads about 1000 bp in length. Newer, more rapid
    methods produce much shorter and slightly lower-quality reads of 100 to 500 bp. Many billions of such
    short sequences can be produced during the sequencing of a single genome. Such massively parallel
    sequencing is a very cost-effective way to decode large numbers of genomes. Given these variations in
    measurement, different computational tools are needed for each. In particular, the assembly of huge
    numbers of short reads into longer and high-quality genomes is not a trivial task.
    Sequence assembly has a number of basic steps (Scheibye-Alsing et al., 2009). First, the degree of
    alignment or overlap between every fragment needs to be calculated. Next, we recognize that the
    process of creating the gene fragments involves breaking up and then reading multiple copies of the
    same genome. This means that the same information is repeated across different fragments, but each
    fragment may start and end in a different place. Thus, we choose two candidate fragments with the
    largest degree of overlap and assume that they both come from the same sequence, and we merge
    them, assuming the overlapping portion is a single and common sequence. The process is repeated until
    we have assembled what is known as the shortest common super-sequence.
    Figure 30.5 Global and local alignments of two hypothetical protein sequences. Vertical bars between
    sequences indicate the presence of identical amino acids. Dashes indicate sequences not included in the
    Sequence alignment is the basic process that underpins assembly, and it can involve comparing two
    (pair-wise alignment) or more (multiple sequence alignment) DNA or protein sequences and searching
    for regions of similarity. Sequence alignment is useful not just for assembly, but also for discovering
    functional and structural information from sequences. Sequences that are similar probably have the
    same function or have a common ancestor. There are two forms of pair-wise alignment – global and
    local (Figure 30.5). With global alignment, an attempt is made to align the entire sequence. This may
    include gaps either in the middle of the alignment or at either end of one or both sequences. Sequences
    that are similar and of similar in length are most suitable for this type of comparison. In contrast, local
    alignment looks for shorter stretches of two sequences with the highest density of matches.
    Global and local alignments are accomplished by different algorithms. In local alignment, the alignment
    process stops at the end of regions of strong similarity, and higher priority is given to finding these local
    regions. Sequence similarity scores are calculated from the sum of the number of identical matches or
    conservative (high-scoring) substitutions, divided by the total number of aligned sequence characters.
    Gaps are usually ignored, but some algorithms employ penalties for gaps. Percent identity is the
    proportion of aligned positions in which sequence characters are identical.
    Typical examples of global and local alignment algorithms are the Needleman-Wunsch and SmithWaterman algorithms, respectively. The computational cost of such algorithms is in order of O (nm),
    meaning that the amount of time required to complete the alignment grows linearly with the product of
    two sequence lengths n and m. It makes the task of searching for exact matches in a database
    containing hundreds of billions of bases of DNA sequence such as GenBank rather unwieldy. Not
    surprisingly, heuristic methods of approximate alignment that may not guarantee an optimal solution
    but deliver faster results are often favoured. The classic approximate alignment method is BLAST or
    Basic Logical Alignment and Search Tool. BLAST looks for common patterns shared by two sequences
    and tries to extend these to obtain longer matches. It can deliver a high speed of search because it does
    not search all the sequence space, nor does it aim to find the optimal alignment at all costs. Many public
    databases, including GenBank, have automated servers allowing BLAST searches against stored
    Alignment is a critical step for several bioinformatics techniques. First, it enables mapping of sequence
    reads onto a template genome. Second, as we saw earlier, it enables genome assembly. Genome
    assembly is the process of combining sequence reads into stretches of DNA called contigs, based on
    sequence similarity among fragments. The consensus sequence for contigs is based on the highestquality nucleotide in any given read at each position. Several computational algorithms have been
    developed to assist with the assembly of fragments of sequence reads by finding overlapping regions. In
    particular, very short reads generated by high-throughput sequencing are assembled using de Bruijn
    graphs, which are a way to transform sequence data into a network structure (Pevzner and Shamir,
    2011). Sequencing errors are more common with modern or next-generation sequencing methods and
    tend to occur at the end of fragments, and short overlaps between short fragments can be missed.
    Another important application for alignment is the assembly of metagenomic data that come from
    several distinct organisms with different genome structures and properties. Finally, alignment methods
    also support the task of gene prediction or annotation. Gene prediction (or gene calling) is the task of
    trying to identify which genes are present in a DNA sample. This is typically done using an alignment
    method such as BLAST to compare the sample sequence against known gene sequences in databases.
    The alternative approach to looking up databases of known genes is to try to identify possible coding
    regions in DNA by using first principles (or ab initio gene calling). Such algorithms may use machine
    learning to train on the features of known genes from the same or related organisms. For example, one
    could learn a multinomial sequence model that assumes the nucleotides in a sequence are independent
    of each other and are randomly distributed. In other words, the probability of observing any of the four
    nucleotides (P) is the same and does not depend on sequence position, i.e.:
    Markov models can also be built, and in these sequence models, the probability of observing a given
    nucleotide now does depend on the nucleotide preceding it in sequence.
    Different file formats have been introduced for sequence data to assist in alignment and assembly.
    FASTA is a standard format for recording nucleotide or peptide sequences. Nucleotides and amino acids
    are represented using single-letter codes (see Tables 30.1 and 30.2). FASTQ is an extension of the FASTA
    format. It stores an additional numeric quality score (PHRED) for every nucleotide in a sequence, which
    records the likelihood that a nucleotide read is an error. Unfortunately, there is no uniform standard for
    encoding these quality scores, and different PHRED scales are in common use.
    The SAM (Sequence Alignment Map) format is a file format used for storing information about sequence
    alignments. Although SAM files are readable by humans, a binary representation called the BAM format
    is used for efficient storage and computation. VCF (Variant Call Format) files are used to store data
    about gene sequence variations (single nucleotide polymorphisms [SNPs]). They are efficient because
    only one full reference genome needs to be stored, and other genomes need to record only their
    variations with respect to the reference.
    30.4 Bioinformatics is dependent on maintaining publicly accessible research and clinical databases
    Scientists not only publish their res
    30.4 Bioinformatics is dependent on maintaining publicly accessible research and clinical databases
    Scientists not only publish their research results in biomedical journals, but also must add any new
    genetic sequences arising from the research to public databases, so that these sequences are accessible
    by other researchers. Without such data sets, it would be extremely difficult to build up a rich picture of
    the extent of variation in biological function that exists, and our ability to associate disease with such
    variation would be very limited. There are two broad types of genetic databases:
    • Databases of known genetic sequences that can be used to identify unknown sequences obtained
    from patients – BLAST alignment methods allow for an efficient search over the sequence data stored in
    genetic databases. A search involves comparing a query nucleic acid or protein sequence against each
    sequence in the database. If a clear match is found, then gene function may be predictable without
    further laboratory experiments.
    • Comprehensive descriptions of genetic disorders – Growing numbers of databases contain genetic
    data of direct clinical relevance, including the Online Mendelian Inheritance in Man (OMIM) database,
    which provides information on genetic disorders.
    The main resources for the storage and distribution of sequences are the members of International
    Nucleotide Sequence Database Collaboration, a consortium of GenBank, sponsored by the National
    Center for Biotechnology Information (NCBI), European Molecular Biology Laboratory (EMBL) database
    at the European Bioinformatics Institute and DNA Database of Japan (DDBJ) from the National Institute
    of Genetics, Japan. These databases contain the same sequence entries, and they exchange data on a
    daily basis, but they do have slight differences in the way in which the information is presented. The
    explanations, descriptions and classifications are in ordinary English, whereas the structure is systematic
    enough to allow computer programs easily to read, identify and manipulate the various types of data.
    SWISS-PROT is a major database that collects confirmed protein sequences with their annotations.
    Public databases provide reference genomes to assist with the identification of gene sequence
    Public databases contain reference genomes that are well defined, functionally annotated if possible
    and ideally are of high quality with minimal risk of sequencing errors. Typically, reference genomes are
    manually curated and annotated by experts, with the intention of being a gold standard. These
    properties make reference genomes useful templates to guide the assembly of newly sequenced
    genomes or to compare with newly assembled genomes of unknown function.
    Such comparisons involve the alignment of reference and new sequences to identify any differences,
    similarities and the distance between them. A genetic distance is a measure of how many changes are
    needed to go from one sequence to another. The notion is related to the computer science concept of
    an edit distance, which is a measure of how many symbols need to be changed in one string of symbols
    to reach another known string. Distance measures produced by alignment studies enable closely related
    sequences to be clustered together (e.g. to differentiate different cancer lines), or they can suggest how
    closely related different organisms are, such as viruses or bacteria. Reference genomes can also help
    minimize data storage needs because new sequences need record only the sequence elements that are
    different from the reference standard. Such differences may represent less than 0.5 per cent of the full
    sequence data.
    As the number of sequenced genomes has grown, it has become very clear that the early reference
    standard sequences do not necessarily represent the most ‘typical’ sequences. It may thus be preferable
    at any point to look at all the known sequences associated with a particular tissue, disease or organism
    and to select from them one that is best suited to being the gold standard. This can be achieved by
    finding the sequence that is most similar to the majority of other sequences. As we saw earlier, highquality sequencing depends on the sequencing technology used, as well as the quality of the alignment
    and assembly algorithms. This means that our understanding of a genome improves with time, as
    multiple studies zero in on the most likely composition of a sequence. In recognition of this process,
    submissions to public databases go through a number of stages. Draft genomes usually contain only
    minimally filtered data and are likely to have regions of poor quality. Despite such shortcomings, draft
    genomes are inexpensive to produce and can still possess useful information. In contrast, finished
    genomes contain fully annotated sequences that have been manually edited for more than 90 per cent
    of the genome’s length and have less than 1 error per 100 000 base pairs. Such sequences are the
    source of high-quality reference genomes. Given the great difference in cost and effort involved in going
    from draft to finished sequences, it should be no surprise that there is an ever-widening gap between
    the number of finished and draft genomes.
    Sequence analysis has become the cornerstone of modern biology, and many informatics tools and
    genome browsers have been developed to make data analysis and visualization more intuitive and
    effective for scientists and clinicians. Most importantly, genome sequencing provides a reference point
    onto which other types of biomedical data can be layered.
    The clinical relevance of many genomic variants remains uncertain and may require additional gene or
    protein functional analysis
    We are interested in identifying genetic variants that may influence disease, treatment selection and
    dosing or adverse drug events. Such variants fall into three broad categories:
    • Those with a clear clinical interpretation, e.g. when a disease is caused by a single gene (monogenic
    • Those associated with disease but with unknown causal role.
    • Those with no currently known association with any disease.
    As discoveries are made and new genotype–phenotype associations are uncovered, genetic variations
    will move among these categories. Several strategies can help in the analysis of genes known to be
    associated with disease and can explore whether a gene is truly related to pathogenesis:
    • Focussing on the very stable or conserved parts of the genome, where gene variants are more likely to
    be significant.
    • Identification of variants resulting from insertions, deletions and frame-shift mutations leading to the
    modification or elimination of target proteins.
    • Studies of the expression of the target RNA or protein suspected of being modified.
    • Analysis of biochemical networks, metabolic pathways and signalling pathways associated with altered
    genes, which can sometimes be done by simulation.
    Currently, most gene identification and analysis algorithms infer the relationships among genes by using
    sequence similarity. The concept that genes are likely to be linked to similar biological processes if they
    have similar sequences (‘guilt-by-association’) is one of the key assumptions in bioinformatics. However,
    two structurally very similar molecules may still behave quite differently biochemically.
    30.5 Microarrays have enabled the large-scale measurement of gene activity
    In the past, geneticists identified the genes associated with a disease by obtaining DNA from a
    peripheral blood sample or a buccal smear from patients known to be at risk of the disease. However,
    the effects of some genetic patterns on disease are so small that it is very difficult to detect statistically
    significant correlations between the genes and disease through such population studies alone. This is
    especially true when many genes are involved in a disease. In such cases, rather than looking for
    evidence through statistical associational studies across a population, the alternative is to conduct
    functional genomic studies. Here we directly measure the activity of the genes in a particular
    biochemical mechanism. There are many technologies that perform such quantitative measurements of
    gene expression. Microarray measurement systems, which simultaneously measure the activity of many
    genes from a single tissue sample on a chip, have enabled large-scale functional testing.
    Microarrays, also known as ‘DNA chips’, are composed of a high-density array of DNA probes, which are
    placed on a glass or nylon grid. Each probe binds only to a pre-defined DNA pattern, and a single array
    can contain many hundreds of thousands of such different probes. This quantitative change in the scale
    of gene activity measurement has led to a major change in our ability to understand regulatory
    processes occurring at the cellular level.
    In microarray studies, one takes a sample of biological tissue. Its mRNA is then extracted, and a
    fluorescence-tagged complementary DNA (cDNA) copy of this mRNA is made. This tagged cDNA copy,
    called the sample probe, is then hybridized to a slide containing a grid of single-stranded cDNAs called
    probes. The probes are segments of DNA in which the sequences are already known and that have been
    built or placed in specific locations on this grid. A DNA sample probe hybridizes only with its cDNA
    probe. Because we know what DNA has been placed at each probe, we are therefore able to infer the
    type of DNA that has hybridized with it. Next, any unhybridized sample probes that are left are washed
    off, and the microarray is lit under a laser light and scanned. To determine the actual level of expression
    or concentration of DNA at a given probe, a digital image scanner records the brightness level at each
    grid location on the microarray. Studies have demonstrated that the brightness level is correlated with
    the absolute amount of RNA in the original sample and, by extension, the expression level of the gene
    associated with this RNA (Figure 30.6).
    A characteristic of microarray technologies is that they enable the comprehensive measurement of the
    expression level of many genes simultaneously. For example, the level of RNA expression of a biological
    system can be measured under different control and test experimental conditions. Similarly, one can
    compare the expression profile of two such biological systems (two different tissues from one individual,
    or perhaps two different strains of an infectious organism) under one or several conditions.
    30.6 Machine learning methods can assist in the discovery of associations between genetic and
    phenotypic data
    Microarray data generate potentially many thousands of gene expression measurements from a few
    tissue samples, in contrast to typical clinical studies, which measure few variables over many cases
    (Figure 30.7). This high dimensionality of gene expression data poses problems for standard analysis
    because not enough data may be available to solve all the ‘variables’ in the data set, thus leaving it
    relatively undetermined. For example, to solve a linear equation of one variable 4x = 5, we need only
    one equation to find the value of the variable. To solve a linear equation of two variables, e.g. y = 4x+b,
    two equations are required. If we have tens of thousands of variables, but only hundreds of equations,
    then there will be thousands of potentially valid solutions. This is the essence of what constitutes an
    underdetermined system. In this context, we must use techniques that inform us most about the
    relationships among the variables of interest (and find out which ones are of interest). Highdimensionality data sets are well known in machine learning, so it is not surprising that these techniques
    have found their way into functional genomics.
    Figure 30.7 Although clinical studies typically measure few variables over many patients, gene studies
    typically measure many variables over few cases.
    Machine learning techniques can be either supervised learning or unsupervised learning techniques (see
    Chapter 27). In supervised learning, the goal is to find a set of variables (e.g. expressed genes measured
    with a microarray) that can best be used to a categorize data as belonging to a class of interest (e.g. a
    given disease). In unsupervised learning, the typical application is to ‘data mine’ either to find a
    completely novel cluster of genes with a hypothesized common function or, more commonly, to obtain
    a cluster of genes that appear to have patterns of expression that are similar to an already known gene
    (Figure 30.8). These techniques are thus distinguished by the presence or absence of external labels on
    the data being examined:
    • Supervised learning – Before applying a supervised learning technique, each datum needs to be
    assigned a label. For example, we would label a tissue as coming from a case of acute myeloid leukaemia
    or acute lymphoblastic leukaemia before trying to learn which combinations of variables predict these
    diagnoses. Neural networks and decision tree learning systems fall into this category.
    • Unsupervised learning – Here, data patterns must be analyzed without labels. For example, we may
    wish to find those genes that are co-regulated across all tissue samples. The variables (or features) can
    include measures of clinical outcome, gene expression, gene sequence, drug exposure and proteomic
    measurements. Unsupervised techniques include relevance networks, dendrograms and self-organizing
    maps that analyze every possible pair of genes to determine whether a functional relationship exists
    between them. The end result of such analysis may be a ranked list of hypotheses about which pairs of
    genes work together. However, the strengths of relationships found by clustering algorithms are not all
    necessarily novel or illuminating. A case could include several thousand gene expression measurements
    and several hundred phenotypic measurements, such as blood pressure or the response to a
    chemotherapeutic agent (Figure 30.9). A clustering algorithm may reveal significant relationships
    between non-genomic rather than genomic variables. For example, if one looks at the effect of different
    drugs on a cancer cell line, then these drug effects may be tightly clustered around drugs that are
    derived from one another. Similarly, phenotypic features that are highly interdependent, such as height
    and weight, also cluster together. These obvious clusters can dominate those that contain gene
    expression measurements.
    Figure 30.9 Clustering studies can involve phenotypic and genetic data sets. A full-fledged functional
    genomic study with clinical relevance involves multiple and quite heterogeneous data types, with
    missing data at different time points.
    30.7 Gene networks are responsible for many biological processes
    Sometimes we can identify that alteration in a single gene is responsible for a particular disease. More
    frequently, however, genes do not work in isolation, but rather interact with each other to shape
    biological processes. Indeed, scientists now talk about the interactome, which represents all the
    different genetic and molecular network interactions that shape a biological process (Ito et al., 2001).
    It is not unusual in a genetic study to find that large numbers of genes are associated with a specific
    disease. The challenge is to understand in what way each gene may contribute to the end state.
    There are several different reasons that two genes may be highly correlated with a disease:
    • Structural colocation exists – ‘Driver’ mutations, which directly alter biochemistry and cause disease,
    may be collocated with more common ‘passenger’ mutations, which are often found with driver
    mutations but appear to have no biochemical role in disease. They are fellow travellers but not
    functionally related.
    • Both are members of a gene regulatory network – One gene may create a protein that binds to the
    promoter site of another gene and turns it on.
    • Both create gene products that are part of a shared biochemical pathway – Many biochemical
    reactions involve multiple molecules, and as a result several genes will need to be active for a
    biochemical pathway to be active.
    NP computability – Some problems are more difficult for computers to solve than others. The time it
    takes for a computer to solve a problem can be characterized by the shape of the ‘time it takes to solve
    it’ function. Some problems can be solved in linear time, thus making them easy to solve. More difficult
    problems are said to take polynomial time because the time to solve them is a function of polynomial
    form. A further class of potentially intractable problems has super-polynomial or exponential time
    solutions. These are known as non-deterministic polynomial time (NP) problems because it cannot be
    said ahead of time how long it may take to solve them once you start or even whether an answer will be
    found (although by luck you may just solve it).
    Bioinformatics methods can be used to try to infer likely gene networks from expression data (Hecker et
    al., 2009). Biological experiments provide both initial data about gene co-expression and additional data
    that come from experiments that try to understand the nature of gene relationships. Perturbations of
    the gene set can test to see which genes are related to a process and which are not. For example, gene
    deletion of ‘knock-out’ studies can be conducted to see how removal of a gene affects the expression
    and function of others. Typically, such experiments can be carried out only on model organisms such as
    yeast or mice, and they are not suitable for studying human biology.
    An alternate experimental approach is to carry out expression studies over a period of time because a
    time series of gene studies can show evolving changes in gene expression as a biological pathway is
    activated. Different genes and their products become active at different times, and temporal ordering of
    such events gives us strong clues about the causal relationships among the biological entities being
    measured. When genes are annotated according to biological function, for example, using the Gene
    Ontology (GO), then the nature of functional relationships in a network can become clearer (Figure
    Machine learning methods can be used to infer the networks generated by such experimental data.
    Boolean networks (in which entities are either ‘on’ or ‘off’), Bayesian networks and Markov models are
    all used for gene network representations. Given the large number of possible genes and proteins that
    may be candidates for a network, the task of feature selection (deciding which biological entities to
    include in the model) has a major impact on the accuracy of any model learned (Hecker et al., 2009).
    30.8 Bioinformatics analyses are limited by our understanding of biological contexts as well as
    computational complexity
    There is little doubt that one of the tremendous accomplishments of the Human Genome Project is that
    it has enabled a rigorous computational approach to identifying many questions of interest to the
    biological and clinical community at large, and it has led to new ways to create therapies. However, the
    danger of this success is a kind of computational triumphalism, which believes that all the problems of
    disease will now rapidly be solved. Such a view is founded on several risky assumptions, mostly based
    upon a misunderstanding of the nature of models and, in particular, the complexity of biological models.
    The first misunderstanding is that of genetic or sequence level reductionism. Most bioinformaticians
    understand that physiology is the product of the genome and its interaction with the environment. In
    practice, however, a computationally oriented investigator often assumes that all regulation can be
    inferred from a DNA sequence. It is assumed that one can predict whether a change in nucleotide
    sequence will result in different physiology. However, cellular metabolic pathways are substantially
    more complicated than this and involve many levels of interaction among molecules.
    Figure 30.10 A portion of a gene-network from patients with amyotrophic lateral sclerosis. Genes are
    connected by an edge if the correlation of their expression profiles is significant, and genes are colour
    coded according to their biological function by using the Gene Ontology.
    (Adapted from Saris et al., 2009.)
    The second dubious assumption is the computability of complex biochemical phenomena. One of the
    oldest branches of bioinformatics seeks to model molecular interactions such as the thermodynamics of
    protein folding. Studies by computer scientists suggest that the protein-folding problem is ‘NP hard’.
    That is, the computational task belongs to a class of problems that are believed to be computationally
    unsolvable. Therefore, it seems overly ambitious to imagine that we will be able to generate robust
    models that can accurately predict the interactions of thousands of molecules from the transcription of
    RNA through to protein manufacture and function. We could call this ambition ‘interactional
    The final questionable assumption is the closed-world assumption. Both sequence-level reductionism
    and interactional reductionism are based upon the availability of a reliable and complete mechanistic
    model. That is, if a fertilized ovum can follow the genetic programme to create a full human being after
    9 months, then surely a computer program should be able to follow the same genetic code to infer all
    the physiological events determined by the genetic code. Indeed, there have been several efforts, such
    as the E-cell effort (Takahashi et al., 2003), which aim to provide robust models of cellular function
    based on the known regulatory behaviour of cellular systems. Although such models are important, our
    knowledge of all the relevant parameters for these models is still grossly incomplete. We now know, for
    example, that environmental factors can alter epigenetic controls (see Box 30.2), and they have a
    significant influence on the risk of disease. In recognition of this knowledge, the interaction of biological
    and environmental factors to cause disease is sometimes called the diseasome (Barabási, 2007).

    Don't use plagiarized sources. Get Your Custom Essay on
    University of Phoenix Translational Bioinformatics Presentation
    Just from $13/Page
    Order Essay
    Achiever Essays
    Calculate your paper price
    Pages (550 words)
    Approximate price: -

    Why Work with Us

    Top Quality and Well-Researched Papers

    We always make sure that writers follow all your instructions precisely. You can choose your academic level: high school, college/university or professional, and we will assign a writer who has a respective degree.

    Professional and Experienced Academic Writers

    We have a team of professional writers with experience in academic and business writing. Many are native speakers and able to perform any task for which you need help.

    Free Unlimited Revisions

    If you think we missed something, send your order for a free revision. You have 10 days to submit the order for review after you have received the final document. You can do this yourself after logging into your personal account or by contacting our support.

    Prompt Delivery and 100% Money-Back-Guarantee

    All papers are always delivered on time. In case we need more time to master your paper, we may contact you regarding the deadline extension. In case you cannot provide us with more time, a 100% refund is guaranteed.

    Original & Confidential

    We use several writing tools checks to ensure that all documents you receive are free from plagiarism. Our editors carefully review all quotations in the text. We also promise maximum confidentiality in all of our services.

    24/7 Customer Support

    Our support agents are available 24 hours a day 7 days a week and committed to providing you with the best customer experience. Get in touch whenever you need any assistance.

    Try it now!

    Calculate the price of your order

    Total price:

    How it works?

    Follow these simple steps to get your paper done

    Place your order

    Fill in the order form and provide all details of your assignment.

    Proceed with the payment

    Choose the payment system that suits you most.

    Receive the final file

    Once your paper is ready, we will email it to you.

    Our Services

    No need to work on your paper at night. Sleep tight, we will cover your back. We offer all kinds of writing services.


    Essay Writing Service

    No matter what kind of academic paper you need and how urgent you need it, you are welcome to choose your academic level and the type of your paper at an affordable price. We take care of all your paper needs and give a 24/7 customer care support system.


    Admission Essays & Business Writing Help

    An admission essay is an essay or other written statement by a candidate, often a potential student enrolling in a college, university, or graduate school. You can be rest assurred that through our service we will write the best admission essay for you.


    Editing Support

    Our academic writers and editors make the necessary changes to your paper so that it is polished. We also format your document by correctly quoting the sources and creating reference lists in the formats APA, Harvard, MLA, Chicago / Turabian.


    Revision Support

    If you think your paper could be improved, you can request a review. In this case, your paper will be checked by the writer or assigned to an editor. You can use this option as many times as you see fit. This is free because we want you to be completely satisfied with the service offered.

    Live Chat+1(978) 822-0999EmailWhatsApp

    Order your essay today and save 20% with the discount code RESEARCH