Microbacterium Tuberculosis

8/13/2019 Microbacterium Tuberculosis

http://slidepdf.com/reader/full/microbacterium-tuberculosis 1/21

1

Prediction of Co-regulated genes in Mycobacterium

Tuberculosis using Microarray expression profile

Neha Gupta and D. Prasad*

Department of Biotechnology, Sharda University, Greater Noida-201603, (U.P.)

*NCIPM, LBS Centre, I.A.R.I. Pusa Campus New Delhi- 110012

1. Introduction

The rapid advance of genome-scale sequencing has driven the development of methods to

exploit this information by characterizing biological processes in new ways. The knowledge of

the coding sequences of virtually every gene in an organism, for instance,

invites development of

technology to study the expression of all of them at once, because the study of gene expression of

genes one by one has already provided a wealth of biological insight.

To this end, a variety of techniques has evolved to monitor, rapidly and efficiently, transcript

abundance for all of an organism's genes. A natural basis for organizing gene expression data is

to group together genes with similar patterns of expression. The first step to this end is to adopt a

mathematical description of similarity. For any series of measurements, a number of sensible

measures of similarity in the behavior of two genes can be used, such as

the Euclidean distance,

angle, or dot products of the two n-dimensional vectors representing a series of n measurements.

There are three basic challenges in bioinformatics today those are (i) finding the genes; (ii)

locating their coding regions; and (iii) predicting their functions. DNA chip technology enables

the study of gene expression in a large scale. Large-scale gene expression experiments are used

to determine drug targets, identify co-regulated genes and study the response to environmental

conditions and the effect of a single gene on the entire genome. Co-regulated genes may sharesimilar expression profiles, may be involved in related functions or regulated by common

regulatory elements. There are different approaches to analyzing the large-scale gene expression

data. The essence is to identify gene clusters. For example, one can start from clustering on the

expression profiles. For genes with similar expression patterns, identify their functions. For

genes with related functions, study their expression patterns.



2

Gene expression and regulation are complex biological processes. Genes involved in the same

metabolic pathway or related functions, have same expression patterns. It is important to

understand what expression patterns are associated with a specific function. Clustering study on

genes sharing regulatory elements may provide clue on issues such as on what conditions those

elements are active, their roles in activation and repression and their interactions with each other.

Since each approach focuses on different aspect of the genome, these approaches are equally

important. We tested the above conditions by taking data of hypoxic condition that makes the

Mycobacterium tuberculosis latent in human body.

Predicting of the genes is necessary to identify the genes involved in the disease. The majority of

newly-identified genes in the human genome and in other genomes show little or no significant

sequence similarity to genes with currently known function, so we need alternatives to sequence

analysis. Gene expression data are available via expression microarrays; expression data may be

readily collected for 10,000 genes with a single array. Expression data provide an alternative to

sequence data to identify genes that may be candidate drug targets.

The simultaneous alignment of many nucleotide or amino acid sequences is now an essential tool

in molecular biology. Multiple alignments are used to find diagnostic patterns to characterise

protein families; to detect or demonstrate homology between new sequences and existing

families of sequences; to help predict the secondary and tertiary structures of new sequences.

Tuberculosis describes an infectious disease that has plagued humans since the Neolithic times.

Two organisms cause tuberculosis- Mycobacterium tuberculosis and Mycobacterium bovis.

2. Microarray

This technology enables the monitoring of

expression levels for thousands of genes

simultaneously. When the magnitude of the experiment increases, it becomes common

to use the

same type of microarrays from different laboratories or hospitals. Thus, it is important to analyze

microarray data together to derive a combined conclusion after accounting for

the differences.

One of the main objectives of the microarray experiment is to identify differentially expressed

genes among the different experimental groups.

The generation of large amounts of microarray data and the need to share these data bring

challenges for both data management and annotation and highlight the need for standards.

MIAME specifies the minimum information needed to describe a microarray experiment and the



3

microarray Gene Expression Object Model (MAGE-OM) and resulting MAGE-ML provide a

mechanism to standardize data representation for data exchange, however a

common terminology

for data annotation is needed to support these standards.

Today, microarrays are widespread in genomic research and have a diverse range of applications

in biology and medicine. A few recent applications include microbe identification, tumor

classification, and evaluation of the host cell response to pathogens and analysis of the endocrine

system. Following commercialization of microarray technology, many researchers have

abandoned the manufacturing of their own arrays. On the whole, the emphasis for the researcher

has shifted away from manufacturing toward data analysis, which involves image acquisition and

quantification. Image acquisition pertains to scanning the array and quantification refers to the

conversion of images into numerical data, which are stored in a spreadsheet. This is where

biologists start to get interested; however, we will backtrack a little and discuss the microarray

platforms used to generate the raw data known as the image file. The production and

hybridization of slides is just one pace in a pipeline of many steps necessary to gain meaningful

information from microarray experiments. Because of the vast amount of data produced by a

microarray experiment, sophisticated software tools are used to normalize and analyze the data.

First the scanned images are analyzed using image analysis software, which evaluates the

expression of a gene by quantifying the ratio of the fluorescence intensities of a spot. The

quantified intensities provide information about the activity of a specific gene in a studied cell or

tissue. High intensity means high activity, low intensity indicates low or no activity.

The next step is to extract the fundamental patterns of gene expression inherent in the data in a

mathematical process called clustering, which organizes the genes into biological relevant

clusters with similar expression patterns (co expressed genes). There are three reasons for

interest in co expressed genes.

First, there is evidence that many functionally related genes are co-expressed .For example,

genes coding for elements of a protein complex are likely to have similar expression patterns.

Hence, grouping genes with similar expression levels can reveal the function of those which

were previously uncharacterized.

Second, co-expressed genes may reveal much about regulatory mechanisms. For example, if a

single regulatory system controls two genes, then the genes are expected to be co-expressed. In

general there is likely to be a relationship between co-expression and co-regulation.



4

Third, gene expression levels differ in various cell types and states. The interest is in how gene

expression is changed by various diseases or compound treatments, respectively.

Figure 1: Shows basic steps of microarray

2.1 Basic Steps of Microarray

Print & cross-link DNA clones (probes) onto a glass slide.

Reverse transcribe mRNA’s from sample tissues into cDNAs (targets) & label withdifferent fluorescence dye.

Hybridize target to probes.

Images of fluorescence emission are compared to find out differentially expressed genes.



5

2.2 Organism (Mycobacter ium tuberculosis )

Mycobacterium are Gram-positive (no outer cell membrane), non-motile, pleomorphic rods,

related to the Actinomyces. Most Mycobacteria are found in habitats such as water or soil.

Mycobacterium tuberculosis is the causative agent of tuberculosis, a disease that together with

human immunodeficiency virus (HIV) and malaria, is one of the main causes of mortality due to

an infectious agent. According to the WHO, one-third of the world's population is infected

asymptomatically with M. tuberculosis, representing a large reservoir of infection. To block

further transmission and reactivation in the already-infected population, it is necessary to

develop improved intervention strategies that require a better understanding of the host-pathogen

interaction.Each member of the TB complex is pathogenic, but M. tuberculosis is pathogenic for

humans while M. bovis is usually pathogenic for animals.

Mycobacterium tuberculosis is the bacterium that causes most cases of tuberculosis.

2.3 Tuberculosis

Tuberculosis describes an infectious disease that has plagued humans since the Neolithic times.

Two organisms cause tuberculosis- Mycobacterium tuberculosis and Mycobacterium bovis.

M tuberculosis continues to kill millions of people yearly worldwide. In 1995, 3 million deaths

from TB occurred. Up to 8 million new cases of TB develop each year. More than 90% of these

cases occur in developing nations that have poor resources and high numbers of people infected

with HIV. In the United States, incidence of TB began to decline around 1900, because of

improved living conditions. TB cases have increased since 1985, most likely due to the increasein HIV. Tuberculosis continues to be a major health problem worldwide.



6

2.3.1 Tuberculosis Causes

All cases of TB are passed from person to person via droplets. When someone with TB infection

coughs, sneezes, or talks, tiny droplets of saliva or mucus are expelled into the air, which could

be inhaled by another person. Once infectious particles reach the alveoli, small sacs in your

lungs, another cell called the macrophage engulfs the TB bacteria. Then the bacteria are

transmitted to your lymph system and bloodstream and spread to other organs. The bacteria

further multiply in organs that have high oxygen pressures, such as the upper lobes of your lungs,

your kidneys, bone marrow, and meninges — the membrane like coverings of your brain and

spinal cord. When the bacteria cause clinically detectable disease, you have TB. People who

have inhaled the TB bacteria, but in whom the disease is controlled are referred to as infected.

They have no symptoms, frequently have a positive skin test, yet cannot transmit the disease toothers.

Risk factors for TB include the HIV infection,Low socioeconomic status,Alcoholism,Diseases

that weaken the immune system, Migration from a country with a high number of cases

Symptoms of tuberculosis include fever, Night-time sweating, loss of weight, persistent cough,

Constant tiredness, Loss of appetite.

2.3.2 Tuberculosis Treatment

Standard therapy for active TB consists of a 6-month regimen:

2 months with Rifater (isoniazid, rifampin, and pyrazinamide)

4 months of isoniazid and rifampin (Rifamate, Rimactane)

Ethambutol (Myambutol) or streptomycin added until your drug sensitivity is known.

2.4 Multiple sequence alignment

One of the cornerstones of modern bioinformatics is the comparison or alignment of protein

sequences. With the aid of multiple sequence alignments, biologists are able to study the

sequence patterns conserved through evolution and the ancestral relationships

between different



7

organisms. Sequences can be aligned across their entire length (global alignment) or only in

certain regions (local alignment). The most widely used programs for global

multiple sequence

alignment are from the Clustal series of programs The third generation of the series, ClustalW

incorporated

a number of improvements to the alignment algorithm, including

sequence

weighting, position-specific gap penalties and the

automatic choice of a suitable residue

comparison matrix at each stage in the multiple alignment. In addition, the approximate

word

search used for the pre-comparison step was replaced by a more sensitive dynamic programming

algorithm, and the dendogram construction by UPGMA was replaced by neighbor joining

(NJ).Different steps of ClustalW are:

1.Determine all pairwise alignments between sequences and determine degrees of similarity

between each pair.

2.Construct a "rough" similarity tree

3.Combine the alignments starting from the most closely related groups to most distantly related

groups, while maintaining the "once a gap, always a gap" policy.

The above steps can be understood with an example Given k sequences, {s1, s2… sk}, the

alignment of these k sequences have to be found.

Step 1: Determine all pairwise alignments between sequences and determine degrees of

similarity between each pair.a. Compute pair wise alignments

b. These pair wise alignments are used to compute a "distance" between all pairs of sequences.

One method to assign distances is the following. For each pairwise alignment, look at the non-

gapped positions and count the number of differences per site.

QKL-MN

-KL-VN

A sample alignment of 2 sequences with one mismatch

Step 2: Construct a "rough" similarity tree

We now construct a tree that is based on the above distance matrix. The exact details of tree

construction will be discussed in a later lecture. The ClustalW software uses the neighbor

joining (NJ) method to compute this tree.



8

Step 3: Combine the alignments starting from the most closely related groups to most distantly

related groups, while maintaining the "once a gap, always a gap" policy. Pairwise is combined,

then forcing gaps in the alignments via the "once a gap, always a gap" policy. Alignment of each

pair of sequences via the Needleman-Wunsch method with an affine gap penalty. That is, a

smaller penalty for a gap continuation than for a gap initiation is charged. As with pairwise

alignments, this is done via dynamic programming, but here the score in each cell of the sim

matrix uses the average of all pairwise scores from the 2 sets of sequences used in the 2

alignments. For example, suppose the following two alignments are present one of 2 sequences

and the other of 4 sequences:

Alignment 1: ATA

CCA

Alignment 2: TCAFE

TAT-E

TATF-AGTFD

The first column is scored of the first alignment against the second column in the otheralignments using:

= 1/8(score (A, C) + score (A, A) + score (A, A) + score (A,G) +

Score(C, C) + score(C, A) + score(C, A) + score(C, G))

Here score (A,C) is the score of aligning A against C; other scores are assigned similarly.

Sequence Weighting. By giving each sequence equal weighting we are not taking into account

any evolutionary relationships. Two sequences that are closely related should receive less

weight than two sequences that are less closely related. The closely related sequences contain

duplicate information so we should not give too much weight to this type of data.

2.5

Micro array related work that has been done on Mycobacteriumtuberculosis

Regulation of the Mycobacterium tuberculosis hypoxic response gene encoding -

crystallin

Since early in the 20th century latency has been linked to hypoxic conditions within the host,

but

the response of M. tuberculosis to a hypoxic signal remains poorly characterized. The M.



9

tuberculosis -crystallin (acr) gene is powerfully and rapidly induced at reduced oxygen tensions,

providing us with a means to identify regulators of the hypoxic response.

Inhibition of respiration by nitric oxide induces a Mycobacterium tuberculosis

dormancy program.

An estimated two billion persons are latently infected with Mycobacterium tuberculosis. The

host factors that initiate and maintain this latent state and the mechanisms by which M.

tuberculosis survives within latent lesions are compelling but unanswered questions. One such

host factor may be nitric oxide (NO), a product of activated macrophages that exhibits

antimycobacterial properties.

Mycobacterium tuberculosis gene expression during adaptation to stationary phase and

low-oxygen dormancy.

The innate mechanisms used by Mycobacterium tuberculosis to persist during periods of non-

proliferation are central to understanding the physiology of the bacilli during latent disease. We

have used whole genome expression profiling to expose adaptive mechanisms initiated by M.

tuberculosis in two common models of M. tuberculosis non-proliferation.

Rv3133c/dosR is a transcription factor that mediates the hypoxic response of

Mycobacterium tuberculosis.

Among M. tuberculosis genes induced by hypoxia is a putative transcription factor,

Rv3133c/DosR. We performed targeted disruption of this locus followed by transcriptome

analysis of wild-type and mutant bacilli. Nearly all the genes powerfully regulated by hypoxia

require Rv3133c/DosR for their induction.

3. Clustering

―Cluster analysis is done to group similar objects in one group such that the objects that are in

one group are similar to each other than to objects of the other group. In this objects are referred

to as genes.‖ Clustering can be defined as the process of separating a set of objects into several



10

subsets on the basis of their similarity. The aim is generally to define clusters that minimize

intracluster variability while maximizing intercluster distances, i.e. finding clusters, which

members are similar to each other, but distant to members of other clusters in terms of gene

expression based on the used similarity measurement. Two clustering strategies are possible:

supervised (based on existing knowledge) or unsupervised.

Figure 1: Supervised and unsupervised data analysis. In the unsupervised case (left) we are given data points

in n-dimensional space (n=2 in the example) and we are trying to find ways how to group together points with

similar features. For instance, there are three natural clusters in the example, each consisting of data points

close to each other in a sense of Euclidean distance.

3.1 Hierarchical Clustering

Hierarchical clustering is an unsupervised procedure of transforming a distance matrix, which is

a result of pair wise similarity measurement between elements of a group, into a hierarchy of

nested partitions. The hierarchy can be represented with a tree-like dendrogram in which each

cluster is nested into the next cluster. Hierarchical algorithms can be further categorized into two

kinds:

(1) Agglomerative procedures: This procedure starts with n clusters (each object forms a Cluster

containing only itself) and iteratively reduces the number of clusters by merging the two most

similar objects or clusters, respectively, until only one cluster is remaining.(n .1).

(2) Divisive procedures: This procedure starts with 1 cluster and iteratively splits a cluster, so

that the heterogeneity is reduced as far as possible (1 .n).If it is possible to find a reasonable

distance definition between clusters, agglomerative-procedures are less computationally

expensive than divisive procedures, since in one step two out of maximum n elements have to be

chosen for merging, whereas in divisive procedures, fundamentally all subsets have to be

analyzed so that divisive procedures have an algorithmic complexity in the magnitude of O (2 n).



11

Algorithm

The procedures of agglomerative hierarchical clustering execute the following basic steps:

(1) Calculate the distance between all objects and construct the similarity distance matrix. Each

object represents one cluster, containing only itself.

(2) Find the two clusters r and s with the minimum distance to each other.

(3) Merge the clusters r and s and replace r with the new cluster. Delete s and recalculate all

distances, which have been affected by the merge.

(4) Repeat step (2) and (3) until the total number of clusters become one.

Figure 2: Hierarchical clustering Dialog

3.1.1 k-means clustering

K-means is a commonly used clustering method because it is based on a very simple principle

and provides good results. It is very similar to SOM, unsupervised, and can be seen as a

Bayesian (maximum likelihood) approach to clustering.

The basic idea is to maintain two estimates:

(1) An estimate of the center location for each cluster and

(2) A separate estimate of the partition of the data points according to which one goes into which

cluster. One estimate can be used to refine the other. If we have an estimate of the center

locations, then (with reasonable prior assumptions) the maximum likelihood solution is that each

data point should belong to the cluster with the nearest center. Hence, we can compute a new

partition from a set of center locations.



12

Algorithm

The essence of the k-means clustering algorithm is now to minimize the cost function of all

clusters by executing the following steps:

(1) Put each vector xi of X in one of the k clusters.

(2) Calculate the mean for each of the k clusters.

(3) Calculate the distance between an object and the mean of a cluster.

(4) Allocate an object to the cluster whose mean is the nearest to the object.

(5) Re-calculate the mean of the clusters affected by the reallocation.

(6) Repeatedly perform the operations (3) to (5) until no more reallocations occur.

Figure 3: K-means dialog

3.2 Tool used

GENESIS is a platform independent Java suite, which integrates tools for analyzing gene

expression data. Fluorescence ratios are first imported and can be then normalized in several

ways to gain a best possible representation of the data for further statistical analysis. Clusteranalysis of fluorescence rations from multiple experiments can be used to identify co-expressed

genes, retrieve meaningful patterns of gene expression and point out similarities and/or

differences between analyzed conditions. The imported data can be clustered using all common

distance similarity measurements and the following methods: hierarchical clustering, k-means,

self organizing maps, principal component analysis, and support vector machines.



13

3.2.1 Steps of Genesis

1. Start genesis from the start menu

Figure 4 : Home page of genesis

2. Go to file menu and then open the text document of your data

Figure 5: Retrieval of text file

3. Perform HCL by clicking on the option HCL, after clicking a page appears on to which set thelinkage through which you want to perform HCL



14

Figure 6: Hierarchical clustering

4. Result of Hcl are obtained,in the form of tree as shown below

Figure 7: Result of HCL

5. Then perform K-means by clicking on K-means, and then foolowing page appears.



15

Figure 8 : K-means clustering

6. Compare the centroid and expression views of the clusters obtained from the HCL & K-means

and look for the clusters which showed similar views.

3.3 Microarray Data Retrieval, Missing Value and Data Filtration:



16

Figure 9

Microarray data of Mycobacterium Tuberculosis

can be obtained from a microarray databases.

Stanford Microarray Database (SMD)

Calculate these missing values; consider howmany spots are having values in a row.

Merge all the data and create a new excel file

From all the excel files of the raw data, extract the

data that shows the expression ratio [Log(base2) ofR/G Normalized Ratio (Mean)] of the genes at

various levels of Laser power intensity in each

excel sheet

Now convert this excel file to a text file

format so that it can be imported in the

Genesis tool for analysis

If missing values are more than 80% then ignore

that row (which represents a gene) by deletingit.



17

3.4 Multiple sequence alignment

Multiple sequence alignment was done after database searching ,for the genes which were same

in the clusters with the similar expression.MSA was done to find out the evolutionary

relationship between the genes.

Figure 10: Flow chart of MSA

Sequences of these genes were

taken from NCBI.

Then these sequences were given asinput to the ClustalW for MSA.

Results were obtained and analysed

Clusters with similar expressionwere taken and searched for the

similar genes



18

Further studies can be done to find out which transcription factor which effects tuberculosis &

also which protein is mainly effected.

S.no. Gene

Name

Locus

tag

Gene

type

Location of the

gene

General protein

information1 lprN Rv3495c protein

coding

C3914504-3914472

PS00013 Prokaryotic

membrane lipoprotein

lipid attachment site.

Description: ABC-type transport system

involved in resistance to organic solvents,

periplasmic component.

Category: METABOLISM

Group: Secondary metabolites

biosynthesis, transport and catabolism

lipoprotein which belongs to 24-

membered Mycobacterium tuberculosis

Mce protein family,

2 dnaE2 Rv3370c protein

coding

Not Available DNA polymerase involved in damage-

induced mutagenesis and translesion

synthesis.

3 lipT Rv2045c protein

coding

c2290650-2290603

PS00122

Carboxylesterases

type-B serine active

site

Description: Carboxylesterase type B


Group: Lipid transport and metabolism

Probable lipT.

4 lprJ Rv1690 protein

coding

1915599-1915631

PS00013 Prokaryotic

membrane lipoprotein

lipid attachment Site

Contains possible signal sequence and

Prokaryotic membrane lipoprotein lipid

attachment site.

5 lppM Rv2171 protein

coding

2432993-2433025

PS00013 Prokaryoticmembrane lipoprotein

lipid attachment site

Probable lppM conserved lipoprotein;.

Prokaryotic membrane lipoprotein lipidattachment site.

6 sugC Rv1238 protein

coding

1381086-1381130

PS00211 ABC

transporters family

signature.

Description:ABC-type sugar transport

systems, ATPase components


Group: Carbohydrate transport and

metabolism. Probable sugC, sugar-

transport ATP-binding protein ABC

transporter

Table. 1 After comparision of the information obtained from NCBI, genes lprJ, & IIpM genes

have the similar functional significance.LprN & lipT are involved in the metabolism. The gene lprN is alsoinvolved in Pathogenesis.

4. Conclusion



19

Different clustering was performed to find out the co-expression of genes.Co-expressed genes

showed similar function. Clustering result shows that there are most of the genes which are

common in specific clusters of Hierarchical Clustering and cluster of k-means clustering. Here

the expressions of those clusters (which are present in both type of clustering) are also same.

This comparative analysis proves the coexpression of genes. These genes are similar in function.

These can be treated as potential drug targets to prevent tuberculosis.

Genes like dnaE2, lprN, LprJ, IIpM were similar in both the clusters. Information that was

obtained from NCBI about the above said genes showed that lprN, lprJ had similar functions.

The gene lprN was responsible for pathogenesis (UniProtKB/TrEMBL O53540).

Further studies can be done to find out which transcription factor which effects tuberculosis &

also which protein is mainly effected.

References



20

1. J.D. Thompson, Desmond G. Higgins and Toby J Gibson (1994).CLUSTAL

W: improving the sensitivity of progressive multiple sequence alignment

through sequence weighting, position specific gap penalties and weight

matrix choice. Nucleic Acids Research, 22(22) 4673-4680

2. Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein

(1998).Cluster analysis and display of genome-wide expression

patterns, Proc. Natl. Acad. Sci. 95(25), 14863-14868.

3. Jean-Michel Claverie (1999). Computational methods for the identification

of differential and coordinated gene expression, Human Molecular Genetics,

8, 1821-1832.

4.

Patricia Fontán, Virginie Aris,

Saleena Ghanny,

Patricia Soteropoulos andIssar Smith (2008). Global Transcriptional Profile of Mycobacterium

tuberculosis during THP-1 Human Macrophage Infection. Infection and

Immunity, 76(2) 717 – 725

5. Trevor Hastie, Robert Tibshirani, Michael B Eisen, Ash Alizadeh, Ronald

Levy, Louis Staudt, Wing C Chan, David Botstein and Patrick Brown

(2000). Gene shaving' as a method for identifying distinct sets of genes with

similar expression patterns, Genome Biology, 2, 1-21

6. Hongya Zhao, Kwok-Leung Chan, Lee-Ming Cheng, and Hong Yan (2008)

Multivariate hierarchical Bayesian model for differential gene expression

analysis in microarray experiments, BMC Bioinformatics, 9(1), S9,1-10.

7. Helena I. M. Boshoff, Timothy G. Myers, Brent R. Coppl, Michael R.

McNeil, Michael A. Wilson, and Clifton E. Barry(2004). The

Transcriptional Responses of Mycobacterium tuberculosis to Inhibitors of

Metabolism novel insights into drug mechanisms of action, The Journal of

Biological Chemistry, 273, 40174-40184



21

8. Sebastien Gagneux, Kathryn DeRiemer, Tran Van, Midori Kato-Maeda,

Bouke C. de Jong, Sujatha Narayanan, Mark Nicol, Stefan Niemann, Kristin

Kremer, M. Cristina Gutierrez, Markus Hilty, Philip C. Hopewell, and Peter

M.(2006). Variable host – pathogen compatibility in Mycobacterium

tuberculosis, Proc. Natl. Acad. Sci. 103, 2869-2873

Documents

Microbacterium Tuberculosis