Genetomic Promototypes: High-throughput, …...to optimize protein expression and/or redesign their promoter of interest for detailed structure/function studies (e.g., mutagenesis)

Genetomic Promototypes: High-throughput, Computational Design of Synthetic Promoter Regions

by

Mirkó Palla

2

Clarkson University

Genetomic Promototypes: High-throughput, Computational Design of Synthetic Promoter Regions

A Thesis by

Mirkó Palla

Department of Mechanical and Aeronautical Engineering Harvard Medical School, Genetics Department – Church Laboratory

Submitted in partial fulfillment of the requirements for a

Bachelor of Science Degree with

University Honors

April 2007

Accepted by the Honors Program

______________________________ Dana Pe’er, Advisor Date

______________________________ James Schulte, Honors Reader Date

______________________________ David Craig, Honors Director Date

3

Contents

1 Executive Summary 5

2 Introduction 6

1.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Oligonucleotide Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

1.3 BAHSER – Computational Design Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

2 Background Information 9

2.1 DNA Discovery and Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

2.2 From DNA to Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

2.3 From DNA to RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Transcriptional Machinery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

2.5 Basics of Genetic Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Eukaryotic Transcriptional Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Methodology 26

3.1 The Rise of Synthetic Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

3.2 Overall Design of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 3.3 Designing Promoter Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Results 33

4.1 BAHSER Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

4.2 Basher Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 4.2.1 Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.2 Mutagenesis Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2.3 Combinatorial Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42

4.2.4 Promoter Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.5 Regulatory Combinatory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4

4.2.6 Regulatory Combinatory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

4.2.7 Pair Mutagenesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

4.2.8 Overlapping Pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47

4.2.9 Module Mutagenesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49

5 Discussion 51

6 Conclusion 53

6.1 Future Prospects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2 Project Barriers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54

6.3 Project Reflection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5

Acknowledgements I would like to thank to my mentors Dana Pe’er, Aimee Dudley and Noel Goddard at

Harvard University for providing me the unique opportunity to work with them on the

Polypromoter Project. It was truly a life-changing experience, which helped my to get a

glimpse what it takes to be a real research scientist. I would like also thank to Prof.

George Church to let me work in his innovative and well-respected laboratory, I felt

privileged to interact with such a bright group of researchers. I would like to give many

thanks to the people in the laboratory, for their enormous support from day one, who

were never too busy to spend some quality time with a “green horn” in the field of

genomics.

This thesis could have not been possible without the support form the Honors Program.

Prof. Craig and Prof. Shen offered me great mentorship during my academic career here

at Clarkson University, which gave me tremendous opportunity to blossom individually

ad intellectually. I would like to express my gratitude to Prof. Craig, whom helped my

through difficult and sometimes even stressful times. I feel that along the way, I have

grown a lot personally and will enter the “real world” with experiences I could have not

gotten at another place than the Honors Program at Clarkson.

Finally, I must express my gratitude for my mother, who sacrificed a lot for me just to

come to study in the United States. She blindly supported me during my Odyssey-like

journey, giving me strength and confidence to believe in myself in every situation I

encountered. I dedicate this thesis to my ever-young and energetic grandma, who updates

me regularly with her hand-written letters, which keep me going here. She is always with

me – in my thoughts – even though an ocean separates us. I would like to also thank my

dad and two brothers, and their beautiful family for their continuous support, this work

could not have been done without their encouragement.

6

Executive Summary Over the last decade most of molecular biology has concentrated analyzing naturally

occurring DNA sequences as revealed by large scale sequencing efforts. In contrast, the

goal of synthetic biology aims to write new genetic information, thereby designing non-

natural DNA regions, genes, proteins, biological processes and entire organisms [1].

Unlike in the past, protein and DNA sequences now have become easier to obtain

electronically through databases than physically from library clones. At the same time

gene synthesis technologies developed to a level of high reliability. For this reason direct

synthesis of DNA regions is swiftly becoming the most efficient way to make functional

genetic constructs and enables applications such as gene, protein and genome engineering

[2].

Synthetic biology is the junction of molecular biology and engineering principles that

is supported by efficient technologies for creating full-length genes, promoters and even

genomes [3]. DNA segment mutation at the upstream regulatory region of the gene has

been shown to often drastically increase gene expression levels [4]. Central to such

efforts is the ability to design the genetic constructs as easily as possible while

considering multiple design parameters in parallel. For example, degree of sequence

identity to homologs and the presence or absence of specific regulatory sites or motifs

must all be considered simultaneously. Current sequence manipulation packages are

typically very feature-rich with graphic user interfaces and multiple integrated tools to

allow for a seamless workflow. They are primarily built to analyze sequence data with a

very little freedom of design and fine-tuning of genetic information.

On the other hand our software BASHER is built to integrate promoter region

manipulation with all the tools necessary to design, write and edit sequence information

within one unifying interface. The software enables the quick, reliable and robust creation

of predefined and custom genetic building blocks, a process essential for systems

biologists to understand how gene expression is controlled by the cell, which is the main

objective of this research project. The project title genetomic promototypes resembles to

this design approach. Just like atoms making up all elements in nature, genetic building

blocks, small functional DNA sequences are used to create new type of synthetic

regulatory regions, therefore the name genetomic promototypes.

7

Chapter 1

Introduction Transcriptional regulation plays a vital role in all living organisms. It influences

development, complexity, diversity, homeostasis and other important biological functions [5]. Transcription is the first stage in the universal information flow from genome, where

all genetic programs are stored, to proteome, through which these programs are executed.

Thus, understanding the complex mechanism behind the control of transcription

machinery constitutes one of the fundamental goals of quantitative biology. At the most

fundamental level, transcription is controlled by the combinatorial interplay of cis-

regulatory elements (or motifs) present in the gene’s promoter region1 and associated

regulatory proteins (or transcription factors) present in the cytoplasm [6]. Because all

transcription factors are gene products themselves, this mechanism is regulated by a set

of motifs present in the particular gene’s promoter. Thus, the elementary principles

governing transcription can be understood by a quantitative description of how the

motif’s influence on gene expression depends on promoter context. In spite of major

efforts aimed at identifying motifs in different species using a variety of approaches and

analyzing their precise influence on gene expression, little is known about the principles

by which a gene’s motifs translate into an expression level [7]. In other words, the

quantitative effect of motifs on gene expression as a function of their promoter context is

still poorly understood.

1.1 Biological Background Modern molecular biology has brought many new tools to the research scientists as well

as an expanding database of genomes and new genes for study. Of particular use in the

analysis of these genes is the synthetic promoter region, a 600-1000 base pair nucleotide

sequence designed to the specifications of the investigator, which controls the

transcription machinery. Synthetic promoters are responsible to control the same product

1 See Figure 4 on page 17 for hypothetical gene control mechanism.

8

as the gene of interest, but the bioengineered nucleotide sequence regulating that protein

may express it differently under various environmental conditions. Designing synthetic

promoters by hand is a time-consuming and error-prone process that may involve several

computer programs. For this reason, an integrated bioengineering tool (design software

called BASHER) is under development that combines many modules to provide a

platform for high-throughput synthetic promoter region design for multi-kilobase

sequences. Of all sequenced genomes, the yeast Saccharomyces cerevisiae has gained the

most attention due to the availability of multiple yeast genomes and high quality mRNA

data [8]. For this reason, this yeast species was chosen as our core model in the genomic

analysis.

1.2 Oligonucleotide Synthesis The power and flexibility of oligonucleotide synthesis is increasingly being recognized in

the bioengineering community. Traditional promoter region synthesis applications

include facilitation of site-directed mutagenesis, structural analysis and investigation of

transcription regulation. The new theory of promoter variant design takes combinational

and spacial effects of cis-binding sites2 into account and incorporates them into the

modeling process. Since binding sites can act as activators or inhibitors and can form

modules (set of cis-elements) with linear, epistatic, synergistic or switch effects as result

of their interaction, a deep combinatorial analysis is needed to decipher the governing

regulatory logic. Previous studies show that there are functional and mechanistic

implications of spatial organization of these regulatory elements [9]. There are physical

interactions between them as certain transcription factor binding sites overlap, implying

the possibility for protein complex formation. Also, in the higher chromatin structure,

there are regions of 3-dimensional occlusions blocking protein binding to regulatory

motif sequence. Motif positioning relative to transcription start plays a significant role in

the transcription regulatory mechanism, so synthetic DNA segment2 insertions might

reveal some functionality. Finally, the distance between cis-elements plays a major role

in regulation; certain motif pairs only occur in a particular base pair distance form each

other and some pairs occur more frequently then others in the promoter. It was also 2 See Figure VII on page 61 in Appendix for an example of cis-binding sites of YCL027W promoter

9

shown, that motif orientation and order has regulatory effects, i.e., a regulatory module

will only influence gene expression in the right spatial combination (orientation, order)

[9]. To decipher the governing regulatory logic, first combinations of elements must be

removed or replaced with new synthetic motif sequences and the resulting gene

expression profile can be analyzed under various environmental conditions. Furthermore

the additional logical design steps should include: randomly moving a binding site to

other locations, making small changes to cis-elements or adding new motifs based on

new statistical data. These designing steps are performed by BASHER resulting in a set

of systematic promoter variants in a high-throughput manner.

1.3 BAHSER – Computational Design Tool In the past, researchers used many different programs to address the requirements of the

separate steps of synthetic promoter design [10]. Alternatively, they sent off their

requirements to a black box provided by a gene synthesis company and let it use its

proprietary programs to design nucleotide sequences of interest. To facilitate the use of

synthetic promoter regions in both traditional and high-throughput applications, new and

more flexible solutions are required. BASHER is a useful tool for investigators who wish

to optimize protein expression and/or redesign their promoter of interest for detailed

structure/function studies (e.g., mutagenesis) [11]. The objective of this research project

is to create a command-line program that is able to perform all of the functions outlined

above for promoter design in a directed, step-wise manner. It accepts as input both

ortholog promoter sequences and global transcription factor binding site maps of the

organism of interest and allows users to move through the process of design in a series of

modules that address practical issues surrounding oligonucleotide design. Users can

follow the main “design a promoter” path or use the modules individually as needed.

10

Chapter 2

Background Information Life depends on the cell’s multi-dimensional ability to store, recover and translate the

genetic information needed to create and maintain a living organism. At cell division this

hereditary instruction is passed on from a cell to its daughter cells and from one

generation to the next through the organism’s reproductive mechanism. Every living cell

contains these instructions which are called genes3, the information-containing region of

the DNA (deoxyribonucleic acid) that determines the hereditary characteristics of a

distinct species and its individuals within. At the beginning of the twentieth century, when genetics started to emerge as a

scientific field of its own, researchers set their goals to understand the biochemical

structure of genes and cell functions in general. They knew that the heredity information

in genes is copied and transmitted from cell to cell many times during the life cycle of a

multi-cellular organism. They also realized that the genetic code during this process is

essentially unchanged. At this time, they did not know the type of molecule capable of

virtually unlimited replication on such accurate level and directing the development and

daily life of a living cell. The next logical step was to figure out the type of instructions

the genetic code contain and the physical organization of the genetic information, which

is responsible for the development and maintenance of even the simplest organism alive.

In the 1940’s when researchers discovered that genetic information consists primarily

of instructions for making proteins, some promising light shed onto their previous

questions. Proteins are macromolecules that perform the majority of cellular functions:

they enable cells to move and to communicate with each other, they serve as building

blocks for cellular structures, they form enzymes that catalyze all chemical reactions

inside the cell, and they regulate gene expression.

3 See Figure I and II in Appendix for definition of gene and further details on gene architecture.

11

2.1 DNA Discovery and Chromosomes The other crucial discovery of this era was the identification of DNA4 as the most

probable carrier of genetic information [12]. But the mechanism whereby the hereditary

code is transmitted unaltered from cell to cell, and how proteins are directed by the

instructions encrypted in the DNA, remained unknown. In 1953 this mystery was solved

by two molecular biologists - James D. Watson and Francis Crick - when the chemical

and geometric structure of DNA was determined. The structure of DNA immediately

solved the problem of how the information in this molecule might be replicated and also

provided insight how a DNA molecule might encode the instructions for making proteins.

In the nineteenth century, biologists have also recognized that genes are carried on

chromosomes, which are threadlike structures5 in the nucleus. Later, they also discovered

that chromosomes consist of both DNA and protein. As discussed previously, we know

that the heredity information of the cell is encrypted into the DNA, as in contrast, the

protein components of chromosomes play vital role in the packaging and control of the

enormously long DNA molecules so that they fit inside cells and can easily be accessed

by them.

Despite its molecular simplicity, the structure and chemical properties of DNA

provides an excellent fit for the raw material of genes. Every gene of the cell on Earth is

made of DNA, and insights into the relationship between DNA and genes have come

from experiments in a wide variety of organisms. It is crucial to understand how genes

and other important regions of DNA are arranged on the molecules of DNA that are

present in chromosomes in a 3-dimensional fashion. It is also fundamental to fully

comprehend how eukaryotic cells fold these long DNA molecules into compact

chromosomes, which is then can be correctly replicated between two daughter cells at

cell division. Furthermore, more understanding must be gained about enzymatic

chromosomal DNA repair and the specialized proteins that direct the expression of the

DNA’s many genes.

4 For DNA double-helix architecture see Figure III in Appendix. 5 For eukaryote chromosome structure see Figure IV in Appendix.

12

2.2 From DNA to Protein When the structure of DNA was discovered, it became clear how the hereditary

information is encoded in DNA’s sequence of nucleotides. Transitioning to the past forty

years the scientific progress has been astonishing. Now, we have complete genome

sequences for many organisms, and thus the maximum amount of information is known

to produce a complex organism like ourselves. Since the hereditary information has finite

limits constrained by biochemical and structural features of the cell, it is obvious now,

that biology has finite complexity.

At this stage, we still have a great deal to discover about how the genome directs the

development of a simple, unicellular organism with about 500 genes, not to say a human

with approximately 30,000 genes. A vast amount of questions remain to be answered

giving great challenges to the next generation of bioengineers. But, as we know now,

much of the DNA-encoded information present in the genome is used to specify a linear

amino acid order6 for every protein of the organism. The amino acid sequence in turn

determines how each protein folds into a distinct 3-dimensional molecular shape, which

gives unique chemical characteristics. When a specific protein is produced by the cell, the

corresponding genome region must be precisely decoded. Additional sequence

information in the DNA of the genome determines exactly when in the life of the cell and

in which cell types each gene will be expressed into protein [13]. Since proteins are the

main components of living cells, the decoding of the genome determines the mechanical

configuration, biochemical properties as well as the distinctive features of species on

Earth.

Even though we expect the genome information to be arranged in an orderly manner,

the genomes of most multi-cellular organisms are unexpectedly disordered. Small

sections of DNA coding regions are scattered with large blocks of seemingly meaningless

DNA. Some sections of the genome contain multiple genes and others lack genes

altogether. It is common that cooperative proteins in the cell have their genes located on

different chromosomes, and adjacent ones usually express proteins which do not interact

at all [15]. Thus, deciphering genomes is not a simple task. Even with the help of

6 For the schematic depiction of a portion of chromosome 2 from the genome of the fruit fly see Figure V.

13

powerful computers, it is still very difficult to absolutely locate the beginning and end of

genes in the DNA sequences of complex genomes, and to predict when each gene is

expressed in the life cycle.

RNA as an intermediate molecule directs protein synthesis, and not the DNA itself.

When the cell needs a specific protein, the DNA sequence of the corresponding portion

of the chromosome is first copied into RNA, which process is called transcription. Then

these RNA templates of copied DNA sequences are used directly to synthesize protein,

which process is called translation. The genetic information flow in cells is therefore

from DNA to RNA to protein. All cells on Earth, from unicellular to complex multi-

cellular organisms, express their genetic information in this way, which is also termed the

central dogma of molecular biology because of its universality (see Figure 1).

Figure 1 – The central dogma. Figure 2 – Different gene expression efficiencies. Left: The flow of genetic information from DNA to RNA (transcription) and from RNA to

protein (translation) occurs in all living cells. Right: Genes can be expressed with different

efficiencies. Gene A is transcribed and translated much more efficiently than gene B. This allows

the amount of protein A in the cell to be much greater than that of protein B [15].

Despite the generality of the dogma, there are major variations in the way information

flows from DNA to protein. One of the most important variations in eukaryotic cells is

14

that the RNA transcripts undergo a series of processing steps in the nucleus, before they

are allowed to exit and be translated into protein. These steps can fundamentally change

the functionality of the RNA molecule and are therefore vital to understand how the

eukaryotic genome is being deciphered. It is also interesting to point out that for some

genes RNA not protein is the final product. Many of these RNA’s fold into set three-

dimensional structures that have structural and catalytic roles in the cell.

2.3 From DNA to RNA Since the primary focus of the research is related to RNA transcription, it is crucial to

understand in great detail the process of transcription by which an RNA molecule is

produced from the DNA of a gene. Transcription is the means by which cells read out the

genetic code in their genes. Because many identical RNA copies can be produced from

the same gene, and each RNA molecule can orchestrate the synthesis of many identical

protein molecules, cells can synthesize a big amount of protein when necessary. Also,

each gene can be translated and transcribed with a different rate7, allowing the cell to

control the protein quantity production on a delicate scale. Furthermore, the cell can

control its gene expression by controlling the RNA production according to the

momentary need of cell state [16].

The first step a cell takes in retrieving the needed part of genetic instruction is to copy

a specified portion of its DNA nucleotide sequence - a gene - into an RNA sequence. The

new information copied from DNA to RNA, although in another chemical form, is still a

nucleotide sequence, hence the name transcription. RNA is a linear polymer made of four

different types of nucleotide subunits linked together by phosphordiester bonds similarly

to DNA. But, it chemically differs from DNA in two ways: (1) the nucleotides in RNA

are ribonucleotides rather than deoxyribose; (2) although, but nucleic acids contain the

bases adenine (A), guanine (G), and cytosine (C), it contains the base uracil (U) instead

of the thymine (T) in DNA. In RNA, G pairs with C, and A pairs with U by hydrogen-

bonding. It is not uncommon, however, to find other types of base pairs: for example, G

pairing with U occasionally.

7 For gene efficiency control see Figure 2 on page 12.

15

Despite these minor chemical differences, DNA and RNA differ quite significantly in

overall structure. Whereas DNA always forms a double-stranded helix, RNA is single-

stranded. Therefore they fold up into a variety of shapes to form complex three-

dimensional shapes providing structural and catalytic functions as mentioned before.

The transcription process begins with the opening and unwinding of a small section

of the double helix to expose the bases on each strand similarly to DNA replication. One

of the two strands then acts as a template for the synthesis of an RNA molecule. Then the

nucleotide sequence of RNA chain is determined by the complementary base-pairing

between incoming nucleotides and the DNA template (see Figure 3). When an

appropriate match is found, the incoming ribonucleotide is linked to the growing RNA

chain by covalent bonding, which is catalyzed by various enzymes. So, the transcript is

elongated one nucleotide at a time, and is exactly complementary to the strand of DNA

used as the template.

Figure 3 - DNA transcription produces a single-stranded RNA molecule that is complementary to

one strand of DNA [15].

However, transcription differs from replication in several ways. The RNA strand does not

remain hydrogen-bonded to the DNA template, unlike the newly constructed DNA

strand. At the location of ribonucleotides addition the RNA chain is displaced and the

DNA helix re-forms. Thus, the RNA molecules are single stranded as released form the

DNA template. Also, since they are copied from only a specific region of the DNA, the

resultant RNA molecules are much shorter than DNA. Most RNA’s are no more than a

few thousand nucleotides long, and many are considerably shorter in the human body.

16

The enzymes that perform transcription are called RNA polymerases. RNA polymerases

catalyze the formation of the bonds that link the nucleotides together in the formation of

the linear RNA chain. This enzyme moves stepwise along the DNA strand, opening the

helix just ahead of the active site for polymerization to expose a new region of the

template DNA for complementary base-pairing. Thus, the growing RNA chain is

elongated by one nucleotide at a time in the 5’-to-3’ direction. The immediate release of

the RNA sequence from the DNA as it is transcribed means that many RNA copies can

be made from the same gene in a relatively short time. When RNA polymerase molecules

follow close to each other in this way over a thousand transcripts can be synthesized in an

hour from a single gene.

It is important to point out, that the majority of genes carried in a cell's DNA specify

the amino acid sequence of proteins; during the transcription process the RNA molecules

that are copied from these genes are called messenger RNA (mRNA) molecules. To

precisely transcribe a gene, RNA polymerase must identify where on the genome to start

and where to finish its initiation. The initiation of transcription is an essential step in gene

expression because it is the origin at which the cell regulates which proteins are to be

produced and at what rate.

2.4 The Transcriptional Machinery Bacterial RNA polymerase is a multi-subunit complex, in which the sigma (σ) factor is

largely responsible for its ability to tell where to begin transcribing [17]. Initially RNA

polymerase molecules hold on weakly to the bacterial DNA. Then the polymerase

molecule typically slides swiftly along the DNA molecule until it arrives into a region

called a promoter, a special sequence of nucleotides indicating the starting point for

synthesis. At this moment it binds tightly to the promoter DNA and opens up the double

helix to expose a short stretch of nucleotides on each strand. With the DNA unwound,

one of the two exposed DNA strands acts as a template for complementary base-pairing

with incoming ribonucleotides. Approximately after the first ten nucleotides of RNA

synthesis the σ factor relaxes its firm hold on the polymerase and eventually dissociates

from it. RNA chain elongation continues until the enzyme encounters a second signaling

region in the DNA, called the terminator, where the polymerase halts and releases both

17

the DNA template and the new RNA chain. After the release of the polymerase at a

terminator, it regroups with an open σ factor and searches for a new promoter, where the

transcription cycle can start again.

As described above, the processes of transcription initiation and termination involve a

complex sequence of structural transitions in protein, DNA, and RNA molecules. Thus,

the signals encoded in DNA that specify these critical areas are difficult for researchers to

identify. On one hand, after many bacterial promoter comparisons it reveals that they are

heterogeneous. But on the other, it is shown that all contain related sequences, reflecting

that they are recognized directly by the σ factor. These common features are often

summarized in the form of a consensus sequence. In general, a consensus nucleotide

sequence is derived by comparing many sequences with the same basic functionality and

adding up the most common nucleotides found at each position. It therefore serves as a

summary or “average” of a large number of individual nucleotide sequences.

One of the reasons bacterial promoters differ in composition is that the specific

sequence determines the number of initiation events (strength) of the promoter. In other

words, evolution designed each promoter to initiate as often as needed and have created a

wide array of promoters. Promoters for genes that code for widely used proteins are much

stronger than those associated with rare protein encoding genes, and their nucleotide

sequences are responsible for these differences. As bacterial promoters, transcription

terminators also include a wide variety of sequences, where in some cases a simple RNA

structure is the most important common feature [18]. Since an infinite number of

nucleotide sequences have this potential, terminator sequences are much more

heterogeneous than those of promoters.

Although a great deal is known about bacterial promoters, terminators and their

consensus sequences, their dissimilarity makes it difficult for researchers to surely locate

them simply by inspection of the nucleotide sequence of a genome. When analogous

sequences are encountered in eukaryotes, the problem of locating them is even more

complicated. Often, additional information, some of it from direct experimentation, is

needed to accurately locate the short DNA signals contained in genomes.

All promoter sequences are asymmetric, which plays an important role in their

arrangement in genomes. Since DNA is double-stranded, in theory two different RNA

18

molecules could transcribed from any gene, using each of the strands as a template.

However a typical gene only has a single promoter and because the nucleotide sequences

of promoters are asymmetric the RNA polymerase can bind in only one configuration.

Since the polymerase can synthesize RNA in the 5’ to 3’ direction only, the template

DNA strand for each gene is determined by the location and orientation of the promoter.

Analysis of genome sequences revealed that the DNA strand used as the template for

transcription varies from gene to gene [19].

In contrast to bacteria, eukaryotic nuclei have three RNA polymerase, called RNA

polymerase I, RNA polymerase II, and RNA polymerase III. They are structurally similar

to one another and also to the bacterial enzyme. They share some common subunits and

many structural features, but they transcribe different types of genes. Transfer RNA,

ribosomal RNA, and other small RNA’s are transcribed by RNA polymerases I and III,

while vast majority of genes, which encode proteins are transcribed by RNA polymerase

II. Besides many structural similarities to bacterial polymerase, the eukaryotic RNA

polymerase II has many important enzymatic functional differences.

1. Eukaryotic RNA polymerases require general transcription factors (set of specific

proteins), which must assemble at the promoter with the polymerase before the

polymerase can begin transcription (see Figure 4).

2. Eukaryotic transcription initiation must deal with the packing of DNA into

nucleosomes and higher order forms of chromatin structure.

Figure 4 - Gene control mechanism for gene X [15].

19

The general transcription factors aid the correct positioning of the RNA polymerase at the

promoter by pulling the two strands of DNA apart to allow transcription to begin, and

releasing it from the promoter into the elongation phase once transcription has begun.

2.5 Basics of Genetic Switches In the previous section, the basic components of genetic switches - regulatory proteins

and the DNA sequences (motifs) that these proteins recognize - were identified. To

understand how these components operate to turn genes on and off in response to a range

of signals, an E. coli bacteria study is brought up as an example, during which the

composition of their growth medium has been changed. This is an example of one of the

simplest control mechanisms in gene regulation: an on-off switch in that responds to a

single signal [20].

The chromosome of the bacterium E. coli, a single-celled organism, consists of a

single circular DNA molecule, which encodes approximately 4300 proteins. The

expression of these genes is regulated according to the available food in the surrounding

environment. This is demonstrated by five E. coli genes that code for enzymes that

manufacture the amino acid tryptophan. These genes are clustered together on the same

chromosome and are transcribed as one mRNA molecule from a single promoter

[operon]. But when tryptophan is present in the medium, the cell shuts off their

production, since no longer needs these enzymes. The molecular basis for this switch is

understood in extensive detail. If the level of tryptophan is low, the polymerase binds to

the promoter and transcribes the genes of the tryptophan operon. If the level of

tryptophan is high, its repressor is activated to bind to the operator, where it blocks the

binding of RNA polymerase to the promoter. When the level of tryptophan drops, the

repressor releases its tryptophan and becomes inactive, allowing the polymerase to begin

transcribing these genes (see Figure 5). Thus the tryptophan repressor and operator form

a simple device that switches production of the tryptophan enzymes on and off according

to the availability of free tryptophan. Because the active, DNA-binding form of the

protein serves to turn genes off, this mode of gene regulation is called negative control

20

and the gene regulatory proteins that function in this way are called transcriptional

repressors.

Figure 5 – Tryptophan negative control in E. coli [15].

In some cases, bacterial promoter has reduced functionality, because they are not

recognized by the RNA polymerase or the polymerase has some difficulty to open the

DNA double helix at initiation. In both cases, the promoters can be helped out by so-

called gene regulatory proteins that bind to a nearby site on the DNA and attach to the

RNA polymerase so that the transcription probability drastically increases. This form of

gene regulation is termed positive control, since a DNA-binding protein switches the

gene on. For this reason the gene regulatory proteins that function in this manner are

known as transcriptional activators. In some cases, gene activator proteins aid RNA

polymerase binding to the promoter by providing extra surface for attachment. In other

cases, they assist the initial DNA-bound polymerase to transition to the active

transcription phase.

For example, the bacterial activator protein CAP (catabolite activator protein),

activates genes that enable E. coli to use other carbon sources when glucose is not

available [21]. When the glucose level falls there is an increase in the intracellular

signaling molecule cyclic AMP, which binds to the CAP protein, enabling it to bind near

to its target promoters and thereby acting as gene switches. Thus, the expression of a

target gene is turned on or off, depending on whether cyclic AMP levels in the cell is

high or low, respectively (see Figure 6).

21

Figure 6 – Positive and negative gene control by regulatory proteins in prokaryotes [15].

Note that the addition of an "inducing" ligand can turn on a gene either by removing a

gene repressor protein from the DNA (upper left panel) or by causing a gene activator

protein to bind (lower right panel). Likewise, the addition of an "inhibitory" ligand can

turn off a gene either by removing a gene activator protein from the DNA (upper right

panel) or by causing a gene repressor protein to bind (lower left panel).

Positive and negative controls can be combined to form more complicated genetic

switches [22]. An example is the lac operon in E. coli, for example, which is controlled

by both negative and positive regulatory mechanism by the lac repressor protein and

CAP (see Figure 7). The lac operon codes for proteins required in lactose transport and

break down, while CAP provides an alternative carbon source for the bacteria in glucose

scarce medium. CAP should not to induce lac operon expression if lactose is not present,

and the lac repressor should ensure that the lac operon is off in lactose scarce

environment. This circuitry makes sure that the lac operon can respond and differentiate

between two distinct signals, so that lac is only expressed when two conditions are met:

lactose must be present and glucose must be absent.

22

Figure 7 - Dual control of the lac operon [15].

The logic of this simple genetic switch first attracted the attention of biologists over 50

years ago. As explained above, the molecular basis of the switch was uncovered by a

combination of genetics and biochemistry, providing the first insight into how gene

expression is controlled. Although the same basic strategies are used to control gene

expression in higher organisms, the genetic switches that are used are usually much more

complex.

2.6 Eukaryotic Transcriptional Regulation The transcriptional regulation in eukaryotes differs in three important ways from that

found in bacteria. First, in eukaryotes there are gene regulatory proteins that can control

even when they are bound to DNA thousands of nucleotide pairs away from the promoter

that they influence. Second, the eukaryotic RNA polymerase II requires general

transcription factors, which must be assembled at the promoter before transcription can

be initiated. This assembly process can be regulated by signals, so that transcription

23

initiation can be speeded up or slowed down. Third, the packaging of DNA into

chromatin provides opportunities for special regulation not available to bacteria.

Eukaryotes use gene regulatory proteins (activators and repressors) to regulate the

expression of their genes just like bacteria. The DNA sites close to the promoter to which

the eukaryotic gene activators bound increases the rate of transcription. At great surprise,

in 1979, scientists discovered that these activators can act thousands of base pairs away

from the promoter. Moreover, they could influence transcription when bound either

upstream or downstream from it. In this case the DNA between the enhancer and the

promoter loops out to allow the activator proteins bound to the enhancer to come into

contact with proteins bound to the promoter (see Figure 8).

Figure 8 - Transcription initiation by an activator from a distance in a eukaryotic cell [15].

In eukaryotes, the DNA control regions are often spread over a long stretch of DNA,

since some regulatory proteins control gene expression from a distance. For this reason

the gene control region should be defined as the whole DNA stretch involved in

transcriptional regulation, i.e. it should include the promoter, the location of general

transcription factor assembly, and all regulatory sequences to which regulatory proteins

bind to control the rate of the assembly at the promoter.

There are thousands of different gene regulatory proteins, some of which regulate

gene expression recognizing their specific DNA sequences via DNA-binding motifs.

Others do not recognize DNA directly but instead assemble on other DNA-bound

proteins (see Figure 4 and Figure 10). These proteins control the genes of an organism to

be turned on or off according to their presence in different cell types, thus causing unique

gene expression patterns giving each cell type its own characteristics. It is also interesting

24

to point out, that each gene in a eukaryotic cell is regulated differently from every other

gene. Therefore, given the number of genes and the pure complexity of regulatory logic,

it has been almost impossible to come up with standard rules for gene regulatory

mechanism.

Most gene regulatory proteins have usually two domains with distinct functions. One

of the domains contains the motifs that recognize a specific regulatory DNA sequence,

while the other influences the rate of transcription initiation. As shown by biochemists,

the main function of activators is to bind, position, and modify the general transcription

factors and the polymerase at the promoter. This is accomplished by two ways: 1) acting

directly on the transcription machinery itself, or 2) by changing the chromatin structure

around the promoter region.

As it was pointed out earlier eukaryotic gene activators have the ability to influence

transcription initiation steps, and this functionality has important consequences when

they work together. In many cases a joint effort is present in the regulatory mechanism,

which is usually the product of the effect for the regulators alone. So, if factor X

increases the reaction speed of a certain process by 10-fold and another factor Y increases

in a different way at the same rate, and then the parallel effort will result in a 100-fold

overall increase. In a similar manner, when activators X and Y help in the recruitment of

proteins at some reaction site, there will be a multiplicative result in the process. Thus,

gene activator proteins often act in this way, which is called transcriptional synergy [23].

Figure 9 – Transcriptional synergy [15].

Transcriptional synergy is observed between both upstream-bound activator proteins and

multiple DNA-bound molecules of the same activator. Therefore, with this collaborative

switch-like mechanism, multiple gene regulatory proteins - each binding to a different

25

regulatory motif - are responsible for the transcriptional rate control of a eukaryotic gene

(see Figure 9). Thus, in conclusion of regulatory control, regulatory protein must be

bound to DNA to influence its target promoter, and the rate of transcription depends on

the fine arrangement of regulatory proteins bound upstream and downstream of its

transcription start site.

Up until now, eukaryotic regulatory proteins were evaluated as individual

components in the control mechanism. In reality though, most are building blocks of

complexes composed of several polypeptides, each with its own function (see Figure 10).

These complexes often require a sequence specific DNA binding site. In some well-

studied cases, for example, two gene regulatory proteins with a weak affinity on its own

cooperatively bind to DNA with sufficient combined affinity [24]. A particular regulatory

protein usually forms more than one type of complex acting neither as activator nor

repressor on its own, but as a component of a regulatory complex with function

determined by its final assembly. This assembly depends on both on the control region

sequence arrangements and the variety of regulatory proteins present in the cell. In

summary, the assembly of larger complexes of regulatory proteins provides a second

alternative for the mechanism of combinatorial control, offering a new dimension of

opportunities.

Figure 10 – Eukaryotic regulatory protein complex formation [15].

It has been shown by researchers that, in Drosophila (fruit fly) regulatory proteins are

positioned at multiple sites along long stretches of DNA forming multi-component

complexes [25]. They influence the chromatin structure and the recruitment and assembly

of the general transcription machinery at the promoter. With these small cooperative

26

modules present in the coding region, there are unbounded opportunities for the control

of eukaryotic gene transcription.

Another interesting example of regulatory control type is based on the combinatorial

interplay of certain regulatory proteins located on the promoter [26]. As an example,

there is the ‘eve’ Drosophila gene regulated by two gene activators (Bicoid and

Hunchback) and two repressors (Krüppel and Giant). The relative concentrations of these

four proteins determine whether protein complexes forming at the stripe 2 module turn on

transcription of the ‘eve’ gene. Seven combinations of regulatory proteins activate eve

expression, while many other combinations keep the stripe elements silent. This is an

exciting example of combinatorial control, where a single gene can respond to an

enormous number of combinatorial inputs.

27

Chapter 3

Methodology

The motivation of this research project was to fundamentally understand transcriptional

gene regulation in the model organism yeast Saccharomyces cerevisiae. The genome of

this organism was selected for initial hypothesis testing, since there is a lot known about

its genetic regulatory mechanism. This is a very complex area of active research, since

first the cis-regulatory elements on the DNA must be accurately mapped out providing

the physical locations of regulatory protein binding sites. Second, it has to be described

how these gene regulatory elements affect expression under different environmental

conditions. Third, since many gene regulatory sites act as complexes, it must be known

what other regulatory proteins binds to them forming functional subunits. Fourth, it must

be accurately described how these regulatory elements/units interact in a combinatorial

manner, i.e. what kind of general regulatory circuitries, universal combinatorial logic

exist in nature. Finally, it is important to point out, that expression of a gene is the

product of the design of its regulatory region, environmental condition and the abundance

of its regulators, which gives a 3-dimensional scope to the problem (see Figure 11).

Figure 11 – Transcriptional gene regulation as a 3-dimensional space

28

3.1 The Rise of Synthetic Biology In the last decade, molecular biology focused on reading and analyzing naturally

occurring DNA sequences as the result of world-wide sequencing efforts. In contrast, our

innovative research project aimed to write new genetic information, thereby creating non-

natural DNA sequences, proteins and biological processes in order to prove a biological

hypothesis. Since protein and DNA sequences have become easier to obtain

electronically through databases than physically from library clones, direct synthesis of

DNA regions of interest is rapidly becoming the most efficient way to make functional

genetic constructs, which enables applications such as genetic engineering. Thus, high-

throughput computational promoter design in a systematic manner provides new means

of dissecting the underlying mechanism of transcriptional gene regulation.

As it was mentioned earlier, transcription factor mapping on the regulatory region of

DNA is a very active challenge in computational biology, which gave birth to tens of

methods producing hundreds of papers with limited success [27]. There is a high false

positive - false negative rate in their predictive power, little success in the understanding

of the functional role of regulators and combinatorial examples were scarce for solid

theoretical conclusions. Even though many pioneering mathematical and computational

models were developed, the common point specific scoring matrix (PSSM) and

comparative polygenetic conservation methods provided predictive, but confirming

power in hypothesis testing [7, 28].

Another difficulty to cope with is the magnitude and complexity of the existing

problem. Previous researchers demonstrated great efforts in hard core promoter

mutagenesis (virtually mutate each base pair in the region of interest), but as it turned out

they were very time consuming [29, 30]. Many years of work had to be devoted for a

single promoter analysis and even then, the results were not as comprehensive as it was

desired. They only pinpointed a unique regulatory region on the DNA, providing no

global insight into the regulatory mechanism.

Therefore, in order to fully comprehend transcriptional regulation of genes we must

understand the mechanism of regulation, their process of evolutionary evolvement, the

role and interaction of contributing factors on a high-throughput, comprehensive scale. It

is essential to know the physical locations of the regulatory binding sites (cis-elements)

29

on the DNA, the effect of expression under different control conditions, the combinatory

interplay between gene regulatory proteins bound to control regions, the principles of

promoter spatial organization and finally the functionality of upstream transcription

factors, signaling molecules and chromatin modifiers controlling gene regulation. This is

the ultimate “wish-list” that BASHER, the promoter designer software tries to address by

constructing a series of synthetic promoter regions with mutations based on the well-

studied regulatory principles for hypothesis testing. It is important to point out, that

BASHER is only one of the basic elements in the pipeline of analysis, since it contributes

only to computational promoter design, which results in a list of synthetic DNA

sequences specific to the hypothesis testing of interest in a text file. Each of these

promoters is labeled with a unique non-coding DNA segment (bar-code), which is

produced by one of BASHER’s algorithms. In this way, when implanted into yeast and

tested under different stress conditions, the gene expression patterns can be uniquely

identified by these bar-codes in the pool of mRNA’s.

3.2 Overall Design of Hypothesis Testing

BASHER output results, a list of synthetic promoter regions of interest (in a text file), are

decomposed into an array of 30-mers with unique flanking regions, which are

complementary to the next segment for each construct. Then these oligosaccharide

segments can be industrially probed according to traditional solid-phase array technology

into a collection of microscopic DNA spots attached to a solid surface. The

oligosaccharide-chip obtained in this way contains all segments of synthetic promoters

for the particular experimental design of interest. In the next step, polymerase chain

reaction (PCR) reaction is utilized to amplify and connect the pool of 30-mers into their

distinctive promoter constructs [31]. In theory, the synthetic promoter regions will be

obtained when the components of 30-mers line up with their overlapping flanking end,

which leads to combination. When successful promoter regions are constructed, they are

separated into different yeast cell pools, where due to homologous recombination the

synthetic promoters get transferred into the genome of the host organism. Then these

various pools are subjected to different stress conditions (rich media, amino acid

starvation, osmosis stress, mating factor, etc.), which results in the expression of different

30

genes in the organism. These gene expression characteristics are directly correlated with

the mRNA content of the cytoplasm, which can be measured by the polony sequencing

technique developed in the Church laboratory [32]. Since, the synthetic promoters were

each labeled with a bar-code, the mRNA can be uniquely identified and thus qualitative

and quantitative conclusions can be made about some particular gene regulatory

mechanisms for each promoter construct.

But, paired with the gene deletion strain available for Saccharomyces cerevisiae an

even more impressive gene regulatory comparison is possible [30]. Similarly, to the

synthetic promoter constructs, naïve (unmodified) bar-coded yeast cells are mixed with

the yeast deletion cell line, which leads to a new yeast deletion cell line labeled with the

particular bar-code for each gene of interest due to homologous recombination. As

before, these are grown under different stress conditions and their mRNA fingerprint is

captured by polony sequencing. Therefore, in the final stage of the comparative analysis

we have two mRNA profiles in hand, one for the synthetic cell line and another for the

yeast deletion cell line. Comparing these profiles gene regulatory mechanism validation

is possible, which gives the most informative results of regulatory logic of today. This

way, we are able to pinpoint synergistic transcription factor cooperation or even

requirements for complex formation at the regulatory region of the DNA8.

As the result of this comparative analysis we obtain a cis- and trans-regulatory

protein lists and their biochemical interactions of gene control under particular cell states.

From this data, we can infer regulatory logic of the gene(s) of interest, which might lead

to quality of data facilitating biophysics modeling. With the follow up on interesting

constructs or interactions, new regulatory mechanisms can be discovered, which can lead

to gene regulatory network reconstruction. It shows hope to help understanding gene

regulatory mechanism when combining a number of genes and the basic principles of cis-

regulation. This result can ultimately guide better computational “motif finding”, thus

clearer vision of gene regulatory proteins.

8 See flow chart of hypothesis testing in Figure VI in Appendix.

31

3.3 Designing Promoter Variations It is vital, when designing synthetic promoter regions, to properly model cis-regulatory

elements, since they are the fundamental building blocks of transcriptional gene

regulation. First, we must recognize that the particular DNA segment is a binding site for

a regulatory protein. This is usually done, by different computational methods validated

by ChIP-on-chip9 experiments, which give a list of functionally important elements by

location relative to the promoter start site [33]. To figure out, what effects cis-binding site

have, we must consider its location and neighboring sequences by randomly moving

binding sites to different location on the regulatory DNA sequence. Also, the already

mapped binding motifs should be modified by small base pair substitutions, partial or

even full motif deletions. Finally, in order to understand the comprehensive gene

regulatory protein control, the design should include the addition of new motifs based on

PSSM data available from previous experimental efforts [28].

To cover gene regulatory logic, we must decipher how cis-elements combine to

determine gene expression. It is important to know if the regulatory elements act as

activators or inhibitors and if individual elements form bigger, more complex regulatory

modules. And if they do, the combination of modules can be according to various

regulatory modes, for example they can be in linear, epistatic, synergistic, or switch

relationships with each other. In a modular regulatory arrangement it is to be known

which element to modify or remove completely form the complex, changing the effect of

regulatory mechanism or even turning it completely off.

Another important aspect considering synthetic promoters design is the spacing of

regulatory elements on the active DNA segment. As proved by scientist before, different

spatial organizations of transcription factors imply functional and mechanistic

characteristics in gene regulation [34]. The first dimension of control is based on physical

interactions of regulatory elements, which is embodied in the overlapping of closely-

bounded transcription factor binding sites. This means that some of the regulatory

proteins, even though having different functional roles, might share exactly the same or

similar sequence motifs for biochemical binding to the DNA. But regulatory proteins

9 ChIP-on-chip is a technique that combines chromatin immunoprecipitation (ChIP) with microarray technology (chip). It is used to investigate interactions between proteins and DNA in vivo.

32

might have an effect on the target promoter, which does not necessarily contribute to the

transcriptional layer of regulation, but rather influences the positioning of higher

chromatin structure by generating 3-dimensional occlusions. This transcriptional layer

provides the second dimension of regulatory control. “Blocking” proteins or protein

complexes close or even far away from the regulatory region might activate or inhibit

RNA polymerase initiation and determine the rate of transcription with mechanisms

described in the theory section. In the spatial organization of regulatory elements the

relative distance to the promoter start site also plays a major role. Thus, when inserting

large DNA segments to modify the regulatory hotspot on the promoter, we might infer

protein interaction patters related to RNA polymerase II positioning. On the other hand,

when inserting small segments, mutating a smaller portion of the regulatory region, we

can draw conclusions regarding nucleosome positioning (see Figure 12).

Figure 12 - Nucleosomes are the fundamental repeating subunit of all eukaryotic chromatin [15].

The modification of nucleosomes by remodeling factors can open up a new DNA region

for transcription if the new cell state requires the expression of different genes. This way,

nucleosomes play an essential role in RNA polymerase transcription efficiency, sine they

prevent them to unnecessarily access the promoter regions of genes which are not needed

by the cell at that state.

33

As pointed out before, there are many transcription factor binding sites on the 5’

upstream region of promoters, which interact according to the cell’s needs of gene

transcription. It has been shown by Beer and Tavazoie [9], that base pair distances

between these regulatory elements and the order/orientation of protein binding plays a

vital role in the regulatory mechanism. Their key discoveries showed that there is a great

deal of redundancy in the modes of transcriptional regulation (OR logic), many factors

require at least one partner to be functional (AND logic) and one mode of combinatorial

regulation is the absence of a factor that would cause a different mode of regulation

(NOT logic). Therefore, to systematically move pairs of regulatory sites closer or farther

away relative to each other influences gene expression patterns in different stress

conditions of the cell. Also, when the orientation of some of the transcription factors on

the DNA are inverted, different mRNA production response was obtained. Similarly,

when changing the relative order of two or more regulatory elements it showed

significant deviations from the original expression levels.

Figure 13 – Gene regulatory logic inferred from motif sequence and expression pattern

34

Chapter 4

Results

My main contribution to the overall collaborative Harvard project was to develop

BASHER, the fundamental software for synthetic promoter region design, which

incorporates multiple built-in functionalities for DNA sequence modification based on

previous research results. It was written in freely available, open-source programming

language Perl version 5.8.6, because of its flawless capability of string manipulation. The

synthetic promoters described in this article were designed on a 1.0 GHz Intel Pentium III

PC with 512 MB of memory, running the Fedora 1.0 Linux operating system. BASHER

uses multiple data bases as input: ortholog promoter sequences and global transcription

factor binding site maps of the organism of interest and allows users to move through the

process of design in a series of modules that address practical issues surrounding

oligonucleotide design (see Figure 14). BASHER is a useful tool for computationally

experienced investigators who wish to optimize protein expression and/or redesign their

promoter of interest for detailed structure or functional studies.

Figure 14 – Flow chart of BASHER architecture for input/output variants

35

4.1 BASHER Preliminaries The objective of BASHER is to provide well-designed promoter sequences for the gene

of interest (YFG). Combinatorially, this is a very complicated task, since the total number

of possible mutations on a single 600 base pair long promoter sequence is 4600 when

performing single nucleotide substitutions. This gives an infeasible number which

exceeds even the computational limitations of powerful computers of today. Thus, the

solution to this problem is to develop an interactive tool operated by biologist which

automatically provides “smart” promoter designs founded on the results of preliminary

research on transcriptional regulation.

There are six major sources of motifs available in the literature, two of which are used

as input into BASHER. The motif compendiums came from various collaborators

involved in transcription factor mapping in the Saccharomyces cerevisiae and other yeast

genomes. The first compendium was obtained by computational, motif discovering

techniques based on genome-wide chromatin immunoprecipitation data by Fraenkel [34].

The second was produced by the National Laboratory of Protein Engineering and Plant

Genetic Engineering in Germany with the database named TRANSFAC [35]. The third

compendium is based on Kellis’ research of genome-wide comparatrative analysis of

three, closely related yeast species [36]. The fourth motif compendium was compiled by

automated, comprehensive analyses of promoter regulatory motifs based on expression

coherence by Lapidot [37]. The fifth data base was generated by p-value statistical

analysis derived from probabilitistic graph theory developed by Friednman [38]. Finally,

the sixth data base was obtained by Tanay using bicluster analysis of enriched motifs of

previously published results from heterogeneous experimental techniques [39].

Unfortunately, these motif databases are far from reliable and complete, thus – as an

initial step - it is necessary to filter out and choose between redundant motifs.

Even though, we have these comprehensive lists of motif compendiums available, we

must come up with the one, which realistically models regulatory binding sites at the

upstream regions of the promoters. The motif deciphering poses a fundamental barrier,

since there are no accurate and systematic computational / experimental validating

methods available in the scientific community. Therefore, our research group decided to

use motif compendiums [34] and [38] as defaults in BASHER, since they provided the

36

most-comprehensive motif coverage and a unique probabilitistic method showing great

potential. Based on user’s configuration parameters, BASHER is able to run its design

functionalities on each individual and combined data set. In this way, the investigator is

able to compare and combine transcription factor binding locations founded on

independent theoretical techniques. It must be noted, that in the evolution of BASHER

these motif compendium repertoires were updated multiple times following the latest

developments in motif discovery. With this needed flexibility in mind, BASHER was

designed in a way that in the future compendium upgrades are easily implementable.

As mentioned before, BASHER also needs an ortholog promoter sequence library in

order to use them as templates for synthetic promoter region manipulation. We used the

Saccharomyces Genome Database (SGD) project data [40] which collects information

and maintains an up-to-date database of the molecular biology of the yeast

Saccharomyces cerevisiae. Thus, about 8000 ortholog promoter sequences – permanently

stored in a sub-directory of BASHER - were available for input in standard FASTA

format in each experimental promoter design. This simple, text-based format contains

promoter sequences, in which base pairs are represented with single-letters (A, C, T, and

G). The format also allows for sequence names and comments to precede the sequences,

which makes it easy to manipulate and parse sequences using Perl scripting language.

4.2 BASHER Structure BASHER’s structure reflects available data sources in hand. That is, it has a blindly

functional part that manipulates the promoter sequences without priory biological

knowledge of the regulatory regions and transcription factors involved in transcriptional

regulation. To perform synthetic promoter region design the software does not need to

have an input which specifies the cis-regulatory elements, thus performing systematic,

combinatorial string manipulation.

The second structural level of BASHER is based on biological data already

discovered by collaborating research groups on transcriptional regulation in yeast. The

software must use these PSSM compendiums – allocated in a specific library in the main

frame – to perform data mining of cis-regulatory element binding sites on the promoter.

In other word, from raw computational / experimental data, it constructs a cis-regulatory

37

map (one for each promoter of interest), which is used in all functionalities built into this

layer of computational promoter design. Utilizing on these regulatory maps, one of

BASHER’s unique features is to perform the visualization of transcription factor binding

sites on the promoter of choice using a GUI interface. According to current literature this

has never been done before placing BASHER into a novel design software category for

computational biologists interested in transcriptional regulation. This second structural

level - with the priory knowledge of cis-regulatory elements – performs various

mutations on the promoter region based on regulatory logic and cis-element pair analysis.

These mutation algorithms result in a set of newly created, synthetic promoters regions in

text format, which are the base of our hypothesis testing as described before. Since

BASHER is a softer prototype itself, it was designed in mind so that it can be easily

extended with other capabilities using the same or different resources available. Thus, the

flexible main frame is easily upgradeable incorporating updated biological data and/or

function libraries.

4.2.1 Configuration File

In order to run BASHER a configuration file [config.txt] must be specified and placed in

the main directory, which contains the configuration data required by all scripts. These

configuration parameters are changeable by the investigator of use, thus each promoter

sequence is uniquely designed as required by specifications. Thus, the input to BASHER

is the configuration file only, which after evaluation results in a set of modified, synthetic

promoter sequences outputted into a text file (see Figure 15).

Figure 15 – Flow chart of BASHER input/output requirements

For each promoter variant, the mutation steps and locations are documented. So this way,

when the investigator finds an interesting construct with a unique gene expression

Basher

Config. file Any output

38

pattern, he can exactly pinpoint which change cause that response. Then with another

slight modification of that combinatorial mutation new regulatory mechanisms can be

retrieved from the expression data of synthetic regulatory regions.

At the beginning of every program data retrieval occurs from ‘config.txt’. If required

by the procedure, the particular keyword – surrounded by brackets [KEYWORD] – is

pattern-matched and checked for input validity. This means each input field in use must

have a specific data type; otherwise an error message will be generated prompting for

correct input format. After this step, the next field is read, checked again and stored in an

input hash corresponding to that keyword. The configuration file typically contains fields

specifying certain directory locations (base directory, promoter library, PSSM matrix

library, etc.), mutation algorithm arguments (kmer length, cis-element distance threshold,

overlap gap, etc.) and running mode designators (unit definition modes, modules

definition modes) as shown below in Figure 16. Thus, the newly-created input hash will

act as a memory module for any scripts to retrieve input configurations for the promoter

design of interest. This way, input has to be read in only once from the text file via I/O.

Figure 16 – Partial list of configuration file parameters.

4.2.2 Mutagenesis Module

The goal of the first structural module of BASHER is to pinpoint all active cis-elements

and elements with functional relevance in the promoter. One of the scripts mimics a

39

typical genetic technique called scanning mutagenesis used in the laboratory setting by

microbiologist, when trying to determinate the transcriptional hotspots in the DNA region

of interest. The method behind this technique is to remove kmers (short segments of

DNA) and ultimately detect a change in the gene’s expression patters under investigation.

When systematically performing mutations, sliding window iteration is utilized where the

kmer length and sliding window size is adjustable (see Figure 17). As noted before, this

algorithm parameter (or argument) can be modified according to the investigators desire

in the configuration file, thus providing great flexibility in high-throughput synthetic

promoter generation.

Figure 17 – Partial list of configuration file parameters.

There are three types of sequence mutations available in BASHER’s repertoire of

algorithms. The first mutation is called permutation, which randomly permutes a given

length of kmer in the promoter, i.e., k-length sub-sequences are randomly shuffled using

the sliding window method described above. This “weak” mutation conserves the GC

content10 of the promoter region, thus does not cause major disturbances in the genome

according to theory. This computational script mimics the naturally occurring DNA

replication discrepancies providing a simple modeling option for promoter design. [For

detailed algorithm description see ‘permute.pl’ in the Manual.]

10 Guanine-cytosine content (GC-content) is a characteristic of the genome of an organism or a piece of DNA. Usually expressed as a percentage, it is the proportion of GC-base pairs in the genome of interest. The remaining fraction of provides the AT-content (adenine-thymine content). For example 58% GC-content = 42% AT-content.

40

Figure 18 – Partial permutation script output for gene FUS1 with specified changes.

Before introducing the second mutation type, the proper mathematical definition of

position-specific scoring matrix (PSSM) is necessary, a commonly used representation of

motifs (patterns) in biological sequences. The PSSM is a matrix of score values that gives

a weighted match to any given substring of fixed length. It has one row for each symbol

of the alphabet, and one column for each position in the pattern. The score assigned by a

PSSM to a substring ( ) 1==j

Nss j is defined as∑

=

N

jjs j

m1

, , where j represents position in

the substring, sj is the symbol at position j in the substring, and mα,j is the score in row α,

column j of the matrix. In other words, a PSSM score is the sum of position-specific

scores for each symbol in the substring. [http://en.wikipedia.org/wiki/PSSM]

A PSSM assumes independence between positions in the pattern, as it calculates

scores at each position independently from the symbols at other positions. The score of a

substring aligned with a PSSM can be interpreted as the log-likelihood of the substring

under a product multinomial distribution. The PSSM scores can also be interpreted in a

physical framework as the sum of binding energies for all nucleotides aligned with the

PSSM.

In our model, the PSSM matrices were obtained from one of our collaborators at

MIT, which represent a collection of 204 cis-regulatory binding motifs. These regulatory

sequences were computationally derived as demonstrated in [34]. These matrices can be

graphical represented as sequence logos where at each position the size of each residue is

proportional to its frequency in that position compared to background frequency (see

Figure 19).

41

Figure 19 – Graphical representation of a motif.

The second mutation is called randomization, which randomly generates a log-value

proof kmer, i.e. a short (4-10 base pair) DNA sequence which is checked against all

PSSM matrices satisfying constant log-value threshold of 0.005 (default) or as specified

by the user in the configuration file. This guarantees that a functionally neutral sequence

is created and will act as a “strong” mutation when systematically inserted into the

promoter region using the siding window method, similarly to the “weak” mutation (see

Figure 19). The GC-content is not conserved in this case, so the small region is

completely erased keeping the original promoter length the same. This computational

script mimics the procedure frequently used to microbiology, as in yeast deletion strains

blocking the functional region out completely. [For detailed algorithm description see

‘randomize.pl’ in the Manual.]

Figure 19 – Partial randomize script output for gene FUS1 showing random kmer insertion.

Therefore, when performing both mutations in parallel in an experimental promoter

design we expect four possible outcomes with different biological meanings. 1) Case (--):

If neither of the mutations had any effect on the gene expression pattern, then the

promoter region must be a non-functional sequence. 2) Case (++): If both mutations had

an effect on the mRNA levels due to transcriptional changes, then it shows that the

promoter region is a possible hotspot for cis-regulatory element locations. 3) Case (-+): If

permutation does not have an effect, but on the other hand randomization changes gene

42

expression, then we conclude that the randomly generated kmer is a newly discovered

motif, since it plays a regulatory role. 4) Case (+-): This case is a highly unlikely

scenario, since the “strong” mutation does not show a regulatory effect while the “weak”

does. This is very unrealistic in biological systems and should be treated as a low

probability (or even no-occurring) scenario.

The third built-in mutation type available in BASHER is ‘scan’, which was

implemented because of one of the well-known combinatorial transcriptional mechanism

present in yeast. As it has been shown by scientist there are promoters in which four

copies of STE12 motifs (each varying by 1 base pair) are present and two of which is

enough for transcriptional response, while three active sites will produce full response.

Because of the presence of this combinatorial control, we scan the promoter regions with

a sliding window and map the locations of similar kmer occurrences. The strength of

similarity can be refined by the user in the configuration file specifying any base pair

differences in the motif. After the mapping algorithm is finished, we randomize a kmer

corresponding to the motif length of interest via the randomization script described

above. Then we replace these frequently occurring motifs with the same random

sequence safeguarding against STE12-like combinatorial control (see Figure 20). With

this type of promoter mutation the investigator has the option to filter cis-regulatory

complexes, which were not caught during the first two filtration process. [For detailed

algorithm description see ‘scan.pl’ in the Manual.]

Figure 20 – Combinatorial control demonstrated by STE12 via promoter visualization.

43

4.2.3 Combinatorial Module

The second structural level of BASHER is based on biological data already discovered by

collaborating research groups on transcriptional regulation in yeast. The software uses the

cis-regulatory element maps and the ortholog promoter sequences to perform data mining

of regulatory binding sites in the promoter. Both of these sources are needed to generate

the synthetic DNA regions of S. cerevisiae, since the cis-regulatory maps are based only

on the relative motif start site location in the corresponding promoter. Thus, first we need

to retrieve the motif coordinates from the maps then locate them on the ortholog

promoters for further sequence manipulation. This mapping procedure is preformed by

‘factors’, which is the fundamental data mining script contained in all sequence

modifying algorithms in the combinatorial module. So, the purpose of this program is to

output transcription factor binding site data given cis-element maps and promoter regions

as inputs (see Figure 21).

Figure 21 – Flow chart of BASHER’s ‘factors’ data mining I/O requirements

The output results in a text file with specific transcription factor binding site parameters

like name, orientation, relative offset, location on chromosome, P-value, motif sequence

(see Figure 22 & 23). These parameters are used in the string manipulation part of the

algorithm to locate the motif on the promoter of interest.

44

Figure 22 – Partial list of transcription factors bound by gene STE12.

Figure 23 –Visualization script showing partial STE12 promoter with transcription factors 4.2.4 Promoter Visualization

Utilizing on the result of ‘factors’, one of BASHER’s unique features is to perform the

visualization of transcription factor binding sites in the promoter of choice using a GUI

interface. Perl Tk module was used in the graphical features of BASHER, thus if

necessary the appropriate installations are needed from the CPAN site.

Depending on user filtering selection in the configuration file based on evolutionary

conservation and/or log-value threshold, the corresponding regulatory motifs can be

displayed via this script. The bound transcription factors (TF) are shown in the same

color as the cis-element coloring in the sequence and are aligned with the starting point of

the site (see Figure 24).

45

Figure 24 – Major differences in the cis-regulatory maps for the same FUS1 promoter based on

two techniques of motif mapping: (left) Friedman group’s regulatory map based on statistical

analysis, (right) Fraenkel group’s map based on regulatory protein binding strength

The factors are denoted by their name and designated orientation (YAP6-) of the DNA

strand, where the plus denotes Watson, while the minus the Crick strands of the helix. If

two or more TF binding sites have same offsets, the TF names are displayed in the same

color under each other, like DIG1 and STE12 on line [1-80] highlighted in purple. If two

or more TF binding sites overlap, the bounded TFs are displayed in separate rows under

each other in their assigned colors, like PHO4-RTG3-CBF1-INO2 {blue-gold-purple-

blue} complex on line [161-240]. At the beginning of each promoter segment, the relative

offset location is denoted in brackets.

This open-source script was frequently used by my research group members to take

an initial look at the promoter region and its potential control regions, thus contributing to

the “smart” series of the synthetic promoter design.

4.2.5 Regulatory Combinatory

In this BASHER module, we developed algorithms investigating the fundamental

combinatorial interplay between cis-regulatory elements in the upstream region of the

promoters. We wish to find out what types of functions promoters “calculate” and how

are these “calculations” performed resulting in a particular gene expression pattern. It

must be mapped out what relationships exist between cis-regulatory elements based on

the biological data available. But it must be noted that there is a complex control

46

mechanism between cis- and trans-regulatory elements also, which should be look at as a

future extension of the software. As shown by previous research efforts [41], there are

different types of interactions between cis-elements. In one case, given one

environmental condition there is no interaction between elements, but in other state of the

cell the element behave actively in the regulatory control. Functional relationships can be

embodied in additive or opposing logic. As the part of opposing logic inhibiting and/or

occluding factors can play a major role as they are bound to the regulatory region of the

DNA. Other types of regulatory logic might include epistatic (OR) and/or synergistic

(AND) interactions between elements, thus after detailed analysis a regulatory network

can be constructed as a visual demonstrator of interactions (see Figure 25). When

deciphering combinatorial interplay, it should be also known how many cis-elements

(pairs, triplets, or beyond) interact with each other as part of this regulatory circuitry.

Since regulatory modules exits composed of a set of transcription factor binding sites, it

is important to break them down to the smallest regulatory blocks and compare them to

unique combinations of pair-wise analysis. In this way fundamental regulatory rules can

be deducted and used later in the synthetic promoter region design as optional mutation

parameters.

Figure 25 – Simple, Boolean regulatory logic of cis-elements [template.bio.warwick.ac.uk]

47

4.2.6 Defining Cis-Modules

In BASHER computational methods are used to define cis-regulatory modules in the

promoter, which targets clusters of potentially interacting element during gene regulation.

An algorithm has been developed, which gives some flexibility in the module definition

specified by the user in the configuration file. There are three running modes of cis-

module definition: 1) every copy of a motif, 2) repeated copies of the same motif, and 3)

copies of same motif in a certain base pair distance form each other is treated as a distinct

unit (see Figure 26). These unit definitions are based on the same motifs, which are a true

model of biological modules already discovered by molecular biologist. Extending

module definition capabilities, BASHER can define a module as a combination of motifs

specified by the investigator. As described, this software has a great ability for flexible

cis-regulatory module definition, which is essential in the efforts of understanding gene

regulation.

Figure 26 – Unit definition flexibility cis-element TEC1 in UME6 gene regulatory region

4.2.7 Pair Mutagenesis

When investigating cis-regulatory element interactions, the most fundamental and

combinatorially simplest base comparison is the pair-wise analysis. The regulatory motifs

can be located in various geometrical positions in the upstream regions, one of which is

48

the overlapping motif pair scenario. In this case, the strategy is the remove one element

of the pair while keeping the other unchanged and vice and versa. On of the other built-in

pair mutagenesis algorithm deals with the removal of each possible cis-regulatory pair in

the promoter according to the given mutation types available in BASHER. Finally, the

rest of the scripts in this module deal with relative distances and orientations of elements

of pairs in the control regions. It demonstrates peek distribution of distance or peaked co-

expression at a specified distance of pairs under investigation. Similarly to regulatory

element distances, orientation and order patterns can be investigated in the same fashion

providing another unique synthetic promoter design option for the user of this software.

4.2.8 Overlapping Pairs

This algorithm first scans the promoter region of interest and determines the exact

locations of overlapping transcription factor pairs according to ‘overlap’ definition. There

are three overlap type definitions available in BASHER, which are handled differently in

the script. The potential transcription factor overlap positioning can occur as follows:

1. TYPE: Fake (overlap)

Def.: If two motifs are in a certain distance – [Overlap_gap]11- from each other, then

consider them as a type fake overlap.

2. TYPE: Real-> I (overlap)

Def.: If two motifs overlap such that there is an overlapping and only one non-

overlapping fragment (from any of the two motif’s point of view), then consider them as

a type real->I overlap.

11 Overlap gap parameter [Overlap_gap] can be modified in the configuration file by the user.

49

3. TYPE: Real-> II (overlap)

Def.: If two motifs overlap such that there is only one overlapping and two non-

overlapping fragments (from at least one of the two motif’s point of view), then consider

them as a type real->II overlap.

Then in every motif overlap pair motif A is "knocked out" using PSSM values, while

motif B is conserved [as it was observed originally] and vice and versa (see Figure 27).

The “knock out” algorithm called ‘remove motif’ replaces the overlapping region with

the most extreme motif mutation, which has the lowest log-value possible for that

particular case, i.e., at each PSSM position in the non-overlapping motif fragment the

base with the smallest entry value, while in the overlapping fragment the best base-pair

entry combination is chosen12. This way it is guaranteed that in each overlap mutation

direction, one cis-element is damaged while the other is conserved since theoretically

untouched. This script automatically generates a log file in PDF format, where all overlap

mutation steps are documented for each cis-element pair of interest. It includes

before/after motif sequence modification statistics with sequence and log-value changes.

It also contains the sequence logo for the particular cis-elements based on default PSSM

matrices (see Figure 28). This is log file is very useful for the investigator, since visual

representation of modifications are provided via sequence logos instead of matrices

which are hard to process when looking at them.

12 For more detailed algorithm steps see ‘remove_overlap.pl’ in Manual.

50

Figure 27 – SPT23-STE12 cis-element overlap and matrix entry “knock out” schematics

Figure 28 – Statistics of PHD1-SKN7 cis-element overlap removal

4.2.9 Module Mutagenesis

As described before, regulatory modules play a major role in transcriptional regulation

[43]. Thus, it is crucial to understand their functional role as singleton elements or a list

of interacting regulatory complexes. A particular module of BASHER was developed for

the investigation of these cis-regulatory complexes, which has multiple built-in promoter

mutation algorithms available.

The first focuses on the functional role of the module. 1) The program removes all

copies of the same transcription factor if occurs more than once in the upstream region of

the promoter (see Figure 29 & 30). 2) It removes all elements of a unit (see definition

above) it the number of elements of the unit is more than one but not all. 3) It can also

remove all cis-regulatory elements but the module itself focusing on the individual unit

51

contribution in regulation. These removal techniques can be specified in the configuration

file as desired.

The second algorithm deals with the relationships within the module. It removes all

pair-wise combinations of elements within a module. The third investigates the

relationships between the modules, i.e., removes pairs of entire modules and removes

representatives from pairs (possible more) modules. Finally, we can also vary the

distance of a module to transcription start. In all of the above algorithms by removal we

mean “strong” mutation [randomization], so we make the removed regions become

functionally insignificant.

Figure 29 – Removal of all copies of DIG1 motif in the upstream region of gene DIG1

Figure 30 – Removal of all copies of DIG1 motif in sequence format via permutation

52

Chapter 5

Discussion

Here, we describe a user-friendly, advanced software package called BASHER for the

design of synthetic promoter regions in a high-throughput manner. The software was

developed to design synthetic promoters in Saccharomyces cerevisiae to be made by the

PCR assembly of short oligonucleotides in Perl. But it should be adaptable for other yeast

genomes (e.g. S. paradoxus, S. bayanus. S. hanseii) which are closely related to S.

cerevisiae in the polygenetic tree.

It provides a powerful and flexible tool for hypothesis testing of regulatory logic in

the eukaryotic yeast cell. Beside the traditional promoter region synthesis such as site-

directed mutagenesis, structural analysis and investigation of transcription regulation, it

incorporates the option of the new theory of promoter variant design. It takes

combinational and spacial effects of cis-binding sites into account and integrates them

into the modeling process. It provides novel algorithms considering local and global

binding site geometry providing functional and mechanistic implications in gene

regulation. It also considers the physical interactions between cis-elements with linear,

epistatic, synergistic or switch-like effects as result of their interaction. Novel mutations

algorithms were developed using the latest PSSM compendiums available in the

literature. Based on this data BASHER is capable to perform the visualization of

transcription factor binding sites in the promoter of choice using a GUI interface. The

software also contains a module designated to analyze combinatorial interactions between

regulatory complexes. The algorithms design synthetic promoter regions, which might

lead to the understanding of the functional roles of modules, the relationships in modules,

the relationships between modules and the spacial relevance of modules to transcription

start site.

BASHER is a useful tool for computationally trained investigators who wish to

optimize protein expression and/or redesign their promoter of interest in a step-wise

manner for detailed structure/function studies. It accepts as input both ortholog promoter

sequences and global transcription factor binding site maps of the organism of interest

53

and allows users to move through the process of design in a series of modules that

address practical issues surrounding oligonucleotide design. Users can follow the main

“design a promoter” path or use the modules individually as needed. The design software

is freely available for download from the author upon request. The software is provided

“as is” with no guarantee or warranty of any kind for non-commercial use.

54

Chapter 6

Conclusion

The proposed software for high-throughput synthetic promoter region design has been

developed to a level of a very exciting tool for scientific investigators interested in

genome engineering. But as all software packages it can be always updated with new data

libraries available and with new functionalities resulted from fundamental genetic

research. Since scientific efforts are on-going in cis-element discovery the motif

compendium should be periodically updated with more reliable data sources. The

software uses internally standardized formats, thus the new data can be smoothly

incorporated using some of the supporting scripts written for data mining purposes from

FASTA and other standardized formats13 used in bioinformatics.

6.1 Future Prospects

There are way more combinatorial possibilities involved in the pair-wise cis-element

analysis, which was not implemented because of limitless algorithm options involved in

the subject area. Since I was the only software developer we had to draw a realistic line to

consider BASHER as a finished product. Even though we focused on some of the major

combinatorial interactions and geometric constrains of regulatory elements, available

mRNA expression data was never incorporated as a guiding biological result in the

design. With this data in hand we could have selected an interesting set of genes with

similar gene expression patterns. The co-occurrence of these sets could have been

investigated in all promoter regions or all hyper-geometric biclusters. This could have

provided another interesting module in BAHSHER, which could strictly modify the

upstream region of the promoter based on these experimental results obtained under

various experimental conditions.

Another interesting idea came up during the research project to incorporate other

yeast species from the phylogenetic tree. In this way, other genetically closely related

13 For supporting data mining Perl script descriptions see the Manual’s data miner section.

55

genomes could have been compared to each other, thus from the homologous promoter

regions evolutionary conservation of transcriptional regulatory logic could have been

inferred [44, 45].

As of now, the software requires some computational knowledge since a graphical user

interface (GUI) was not developed for user interaction at the time of my departure. I feel

that the fundamental algorithms have been completed, which is the research part of the

computational project. The GUI development is just “icing-on-the-cake”, which does not

require any theoretical understanding of the topic of gene regulation or design. Therefore

it can be easily implemented by software engineers if needed, so it can be widely used

even among the ones with limited programming experiences.

6.2 Project Barriers It also important to point out, that since the “Polypromoter Project” was a collaborative

effort between computational and experimental biologist at the Church laboratory, we

were mutually relying on each others’ results. As the product of BASHER’s output, a list

of synthetic promoter regions of interest could be printed onto an industrially obtained

oligosaccharide-chip [46]. The promoters are decomposed into an array of 30-mers on the

chip surface with unique flanking regions, which are complementary to the next segment,

for each construct. From the computational, design phase the project now transitions into

the experimental phase.

In the next step, polymerase chain reaction (PCR) reaction is utilized to amplify and

connect the pool of 30-mers into their distinctive promoter constructs. In theory, the

synthetic promoter regions will be obtained when the components of 30-mers line up

with their overlapping flanking end, which leads to combination. Unfortunately, this part

of the project did not work, even though a post-doctoral candidate spent tremendous

hours in the optimization of the protocol. I was involved in the trouble-shooting process

of figuring out the reason behind the unsuccessful assembly, where I developed scripts to

analyze DNA segment obtained from defected experimental trials.

As a result we concluded that certain oligonucleotides amplified with more success

than others from the oligo-chip, resulting in an uneven distribution in the PCR solution.

For this reason, the desired promoter regions were not constructed evenly with the same

56

concentration as expected. This could have had to reasons. One of the error sources could

have been the defective industrial DNA spotting of the oligo-chip, while the other had to

due with a unsuccessful repeating of large scale oligo-assembly protocol described by

[cite]. For this reason, the hypothetical testing of constructs was unavailable. In a sense,

the research group was ahead of current technology since large scale oligo-assembly does

not exist industrial setting only in certain research laboratories. But, when the technology

for assembly will be available, BASHER can be revisited and commonly used in

promoter design investigating the underlying principles of gene regulation.

6.3 Project Reflection In conclusion, I had a marvelous intellectual experience at Harvard Medical School,

thanks to the generosity of the Honors Program and my advisor giving me the

opportunity to be a part of such a well-respected laboratory. I did not only learn a great

deal about computational genomics and bioengineering, but also about a cutting-edge

research institution and the surrounding circumstances how to be a good research

scientist. I had been exposed to an area of my career interest: bioengineering, which I did

not have particular training in, but I feel that this opportunity challenged me in every

aspect of life. I learned a new programming language and the tiniest details of software

development while at Harvard. I also learned a tremendous amount of biology and

genetics when reading hundreds of journals assigned by my advisor on a regular basis,

attending lab meetings twice a week and just talking with my experienced lab mates

every day. Participating in a graduate level course in biophysics also helped me in this

intellectual journey, which confirmed the continuation of my career path into the field of

biomedical engineering. This unique experience also gave me confidence in my abilities

to learn a completely new topic I have never been exposed to. Just to think about a

problem and come up with your own design decisions every day helped the development

of my logical though process, which improved a lot while there. Even though it was

challenging experience, it was great fun too: I met a lot of bright individuals I made

lifelong friendships with and opened up my eyes in a direction I would like to head too in

the near future.

57

Bibliography

[1] S.A. Benner, A.M. Sismour, “Synthetic biology,” Nature Reviews Genetics, 6: 533 543 (2005). [2] C. Gustafsson, S. Govindarajan, J. Minshull, “Putting engineering back into protein

engineering: bioinformatic approaches to catalyst design,” Current Opinion in Biotechnology, 14: 366-370 (2003).

[3] S.J. Kodumal, K.G. Patel, R. Reid, H.G. Menzella, M. Welch, D.W. Santi, “Total synthesis of long DNA sequences: synthesis of a contiguous 32-kb polyketide synthase gene cluster,” Proc. Natl. Acad. Sci., 101: 15573-15578 (2004). [4] C. Gustafsson, S. Govindarajan, J. Minshull, “Codon bias and heterologous protein expression,” Trends in Biotechnology, 22: 346-353 (2004). [5] E.H Davidson, “Genomic Regulatory Systems: Development and Evolution”, San Diego: Academic Press, 2001. [6] F. Jacob, J. Monod, “Genetic regulatory mechanisms in the synthesis of proteins,“ Journal of Molecular Biology, 3: 318–356 (1961). [7] A.M. McGuire, G.M. Church, “Predicting regulons and their cis-regulatory motifs by comparative genomics,” Nucleic Acids Research, 15: 4523–4530 (2000). [8] F.P. Roth, J.D. Hughes, P.W. Estep, G.M. Church, “Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantization,” Nature Biotechnology, 16: 939–945 (1998). [9] M.A. Beer, S. Tavazoie, “Predicting gene expression from sequence,” Cell, 117: 185–198 (2004). [10] D.M Hoover, J. Lubkowski, “DNAWorks: an automated method for designing

oligonucleotides for PCR-based gene synthesis,” Nucleic Acids Research, 30, e43 (2002).

[11] G.Giaever et al., “Functional profiling of the Saccharomyces cerevisiae genome,” Nature, 418: 387–391 (2002). [12] J.D. Watson, “The Double Helix: A Personal Account of the Discovery of the

Structure of DNA,” Touchstone, 2001. [13] C.T. Harbison et al., “Transcriptional regulatory code of a eukaryotic genome,”

Nature 431: 99–104 (2004).

58

[14] R.K. Mortimer, D.C. Hawthorne, “Genetic mapping in yeast”, Methods in cell biology, 11: 221-33 (1975).

[15] B. Alberts, A. Johnson, J. Lewis, K. Roberts M. Raff, and P. Walter, Molecular Biology of the Cell. Garland, 2002.

[16] K. Gausing, “Efficiency of protein and messenger RNA synthesis in bacteriophage

T4-infected cells of Escherichia coli,” Journal of Molecular Biology, 7: 529-45 (1972).

[17] W.G. Haldenwang, “The sigma factors of Bacillus subtilis,” Microbiology Review,

59: 1-30 (1995). [18] D.F. Browning, S.J.W. Busby, “The regulation of bacterial transcription initiation,”

Nature Reviews Microbiology, 2004. [19] A.I. Lamond, A.A. Travers, “Stringent control of bacterial transcription,” Cell, 41:

6-8 (1985). [20] T. Denis et al., “From specific gene regulation to genomic networks: a global

analysis of transcriptional regulation in Escherichia coli,” BioEssays, 5: 433-440 (1998)

[21] O. Soutourina et al., “Multiple Control of Flagellum Biosynthesis in Escherichia

coli: Role of H-NS Protein and the Cyclic AMP-Catabolite Activator Protein Complex in Transcription of the flhDC Master Operon,” Journal of Bacteriology, 24: 7500-7508 (1999).

[22] B.L. Wanner, R. Kodaira, F.C. Neidhart, “Physiological regulation of a decontrolled

lac operon,” Journal of Bacteriology, 130: 212-222 (1977). [23] M. Carey, “The Enhanceosome and Transcriptional Synergy,” Cell, 92: 5–8 (1998). [24] S. Tavazoie et al., “Systematic determination of genetic network architecture,”

Nature Genetics, 22: 281–285 (1999). [25] E. B. Lewis, “A gene complex controlling segmentation in Drosophila,” Nature,

276: 565-570 (1978). [26] M. Hoch, E. Seifert, H. Jäckle, “Gene expression mediated by cis-acting sequences

of the Krüppel gene in response to the Drosophila morphogens bicoid and hunchback,” EMBO Journal, 10: 2267–2278 (1991).

[27] J.D. Hughes, P.W. Estep, S. Tavazoie, G.M. Church, “Computational identification

of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae,” Journal of Molecular Biology, 296: 1205–1214 (2001).

59

[28] K.D. MacIsaac et al., “A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data,” Bioinformatics, 22: 423–429 (2006).

[29] Hagen et.al., “Pheromone response elements are necessary and sufficient for basal

and pheromone-induced transcription of the FUS1 gene of Saccharomyces cerevisiae,” Molecular and Cellular Biology, 11: 2952-61 (1991).

[30] A.M Dudley et al., “A global view of pleiotropy and phenotypically derived gene

function in yeast,” Molecular Systems Biology, 1: 2005.0001 (2005). [31] C.A. Heid et al., “Real time quantitative PCR,” Genome Research, 6: 986-994

(1996). [32] J. Shendure et al., “Accurate Multiplex Polony Sequencing of an Evolved Bacterial

Genome,” Science, 309: 1728 – 1732 (2005). [33] F. Gao, B.C. Foat, H.J. Bussemaker, “Defining transcriptional networks through

integrative modeling of mRNA expression and transcription factor binding data,” BMC Bioinformatics, 5: 31 (2004).

[34] K.D. MacIsaac, T. Wang, D.B. Gordon, D.K. Gifford, G. Stormo, E. Fraenkel, “An

Improved Map of Conserved Regulatory Sites for Saccharomyces cerevisiae,” BMC Bioinformatics, 7: 113 (2006).

[35] E. Wingender et al., “TRANSFAC: an integrated system for gene expression

regulation,” Nucleic Acids Research, 28: 316–319 (2000). [36] M. Kellis et al., “Methods in comparative genomics: genome correspondence, gene

identification and regulatory motif discovery,” Journal of Computational Biology, 11: 319-55 (2004).

[37] M. Lapidot, Y. Pilpel, “Comprehensive quantitative analyses of the effects of

promoter sequence elements on mRNA transcription,” Nucleic Acids Research, 31: 3824-8 (2003).

[38] Y. Barash, G. Elidan, T. Kaplan, N. Friedman, “CIS: Compound importance

sampling method for protein-DNA binding site p-value estimation,” Bioinformatics, 2004.

[39] A. Tanay et al., “Links Integrative analysis of genome-wide experiments in the

context of a large high-throughput data compendium,” Molecular Systems Biology, 1: 2005.0002 (2005).

[40] J.M. Cherry et al., “Genetic and physical maps of Saccharomyces cerevisiae,”

Nature, 387: 67-73 (1997).

60

[41] Y. Pilpel, P. Sudarsanam, G.M. Church, “Identifying regulatory networks by combinatorial analysis of promoter elements,” Nature Genetics 29: 153–159 (2001).

[42] E. Segal et al., “Module networks: identifying regulatory modules and their

condition-specific regulators from gene expression data,” Nature Genetics, 34: 166–176 (2003).

[43] A.M. McGuire, J.D. Hughes, G.M. Church, “Conservation of DNA regulatory

motifs and discovery of new motifs in microbial genomes,” Genome Research, 10: 744–757 (200).

[45] M. Kellis et al., “Sequencing and comparison of yeast species to identify genes and

regulatory elements,” Nature, 423: 241–254 (2003). [46] J. Tian et al., “Accurate multiplex gene synthesis from programmable DNA

microchips,” Nature, 432: 1050–1054 (2004).

61

APPENDIX

[Index of Tables and Figures]

62

Figure I - Gene in global view [Wikipedia].

As shown on Figure I, the functional units correspond to a single protein or RNA

(ribonucleic acid) encompassing coding, non-coding regulatory DNA sequences and

introns. In most genes, exons contain the part of the open reading frame (ORF) that codes

for a protein’s specific portion. While introns are regions, that will be removed (spliced)

after transcription, but before the RNA is used. In contrary of common misconception,

exons are not only the coding sequences for the final protein, but also some non-coding

sequences that play major role in translation phase.

Figure II – Gene in local view [Wikipedia].

Figure II depicts an unedited mRNA transcript, or pre-mRNAs. Both sequence that code

for amino acids (red) and untranslated stretches (grey) are classified as exons. Regions of

unused sequence called introns (blue) are spliced out, and the exons are joined together to

form the final functional mRNA. The untranslated regions are vital in the process of

efficient transcript translation and translation rate control.

63

Figure III - The structure of part of a DNA double helix [Wikipedia].

64

Figure IV - A representation of a condensed eukaryotic chromosome, as seen during cell division [Wikipedia].

Figure V - Schematic depiction of a portion of chromosome 2 from the genome of the fruit fly Drosophila melanogaster [15].

65

This figure represents approximately 3% of the total Drosophila genome, arranged as six

contiguous segments. The symbolic representations are → rainbow-colored bar: G-C

base-pair content; black vertical lines: locations of transposable elements; colored boxes:

genes coded on one strand of DNA. The color of each gene box (see color code in the

key) indicates whether a closely related gene is known to occur in other organisms. For

example, MWY means the gene has close relatives in mammals, in the worm

Caenorhabditis elegans, and in the yeast Saccharomyces cerevisiae. MW indicates the

gene has close relatives in mammals and the worm but not in yeast.

Figure VI – Flow chart of hypothesis testing using synthetic promoter constructs and the yeast deletion strain.

66

Figure VII – Transcription factor binding sites for promoter YCL027W.

Documents

Genetomic Promototypes: High-throughput, …...to optimize protein expression and/or redesign their promoter of interest for detailed structure/function studies (e.g., mutagenesis)