Www.cse.Buffalo.edu Faculty Azhang Teaching Index

7/31/2019 Www.cse.Buffalo.edu Faculty Azhang Teaching Index

1/3

Bioinformatics Data Set

Project 1

Schema Design for Biomedical Data Warehouse

Biomedical data are being generated in an explosive rate, ranging from clinical test results tomicroarray gene expression profiles. The scale and complexity of these data sets give rise tosubstantial challenges in data management and analysis. Data warehouse and on-line analyticalprocessing (OLAP) technologies have been developed for business applications. It is highlydesirable that the these technologies can be applied to biomedical data integration and mining. Themajor difficulty lies in capturing and modeling diverse biological objects and their complexrelationships. There have been various logical data models proposed to specify biomedical data indatabases, including relational data models, object-oriented data models, and multidimensional datamodels. However, it is not clear yet which approach is the best for modeling and analyzing data inbiomedical data warehouses.

Please refer the hw1.pdffor details

Project 2

Mining Association Rules from Gene Expression Data

Problem1. Implement the Apriori algorithm to find all frequent itemsets.

2. Generate association rules based on the templates you specify.Please refer the hw2.pdffor details

Please download the Homework2.rar which includes instruction and data set data.txt

Gene expression data ("up" regulated or "down" regulated) for 100 samples and 100 timepoints. Each row represents each sample. Each column (from 2nd column to 101st column)represents each time point. The 102nd column shows a disease for each sample.

sample_output.txtSample output file of frequent itemset detection with minimum support 40. The 1st columnshows the support for the frequent itemset, which is shown from the 2nd column. The format

of a frequent item is "sample id number" + "up (U) or down (D)".

Project 3

Biomedical Data Warehouse/OLAP System

In this project, you are asked to implement a clinical and genomic data warehouse based on yourschema design using the Oracle system. A good data warehouse should satisfy the followingrequirements: 1) support regular and statistical OLAP operations; 2) be robust to potential changes in

the future; and 3) support knowledge discovery.Please refer the project1.pdffor details

Please download the Project1.rar which includes instruction and data set

Page 1 of 3Data Sets

16-08-2012http://www.cse.buffalo.edu/faculty/azhang/Teaching/index.html


2/3

All the data you need for Project 3 have been generated in Project1.rar. All the files are tabdelimited. You may open them in Excel to have a better view. Please note the file structuresmay be slightly different from what have been listed in the project handout.

For some entities, we removed some attributes which are hard to understand and notessential to this project.

For some entities, we missed some important attributes in the handout. Now we haveadded them in the files. The first row of each file describes the file structure, please readit carefully.

If a patient has multiple samples, use the average value of those samples as the patient valuewhen you do the OLAP operations unless otherwise specified in the project handout.

If a sample was tested by multiple experiments, use the average value of those experiments asthe sample value unless otherwise specified.

If a gene corresponds to multiple microarray probes, use the average value of those probes asthe gene value unless otherwise specfied.

To save looking up the t-statistic table, we make the following assumptions: if the t-statistic value of a gene is greater than or equals to 1.0, this gene is regarded as

"informative gene"; if the t-statistic value on the correlations is greater than or equals to 5.0, the patient is

classified as "ALL". You are asked to classify new patients based on the informative genes you find. The

microarray data for the new patients are recorded in the file "test_samples.txt". The first rowlists the names for the patients, while the first column lists the UIDs of the genes. Each of theother cells represents the expression value of the corresponding gene in the correspondingpatient. You do not need to populate this file into your data wharehouse. Moreover, when youclassify the new patients, you can read this "test_sample.tst" file directly. But for other data,you have to retrieve them from your data wharehouse.

Project 4

Microarray Data Analysis

In the past few years, microarray technology has become one of the foremost tools in biologicalresearch. The emergence of this technology has empowered researchers in functional genomics tomonitor gene expression profiles of thousands of genes (perhaps even an entire genome) at a time.However, mining microarray data also presents great challenges to Bioinformatics research. Thisproject will acquaint you with several basic approaches to analyzing microarray data from thebeginning to end. You will apply the techniques introduced in class to real-world microarray datasets and learn how to discover useful knowledge from the data sets. This project will also help you

understand the challenges in microarray data analysis and motivate you to develop novel approachesto addressing those challenges.Please refer the project2.pdffor details

Please download the Project2.rar which includes instruction and data set Clustering

For clustering, you can use 'cho.txt' and/or 'iyer.txt', which have expression values for eachgene and each time-point. The first row has the number of genes and the number of time-points. Each row from the second represents each gene. The first column has gene_id, and thesecond column has ground_truth of clusters. (You can compare it with your results. -1 meansoutliers.) Each column from the third represents each time-point.

ClassificationFor class prediction, you can use 'golub_*.txt', which have normalized expression values foreach gene and each sample. 'golub_training.txt' is training data. Each row represents eachgene, and each column represents each sample. (Training data has total 38 samples.)


16-08-2012http://www.cse.buffalo.edu/faculty/azhang/Teaching/index.html


3/3

'golub_test.txt' is test data. Each row represents each gene, and each column represents eachsample. (Test data has total 34 samples.) 'golub_truth.txt' is the ground truth for total 72samples (training 38 samples + test 34 samples).

Send comments: [email protected]


16-08-2012http://www cse buffalo edu/faculty/azhang/Teaching/index html

Documents

Www.cse.Buffalo.edu Faculty Azhang Teaching Index