39
1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao Supervised by Carole Goble Draft: 30 th March, 2016 A report submitted in part fulfillment of the degree of BEng(Hons) Computer System Engineering

Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

1

Add that to my notebook: Reporting my Systems Biology

Experiments

Saloni Rao Supervised by Carole Goble Draft: 30th March, 2016

A report submitted in part fulfillment of the degree of

BEng(Hons) Computer System Engineering

Page 2: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

2

Abstract SEEK for science is gaining acclaim for its ability to allow the non-technical users such as biologists to enhance their experimenting capabilities with the use of computer programs. SEEK uses Wolfram Mathematica in its evaluation techniques, which is an advanced symbolic computation program. The drawbacks are that Mathematica is not portable and free of cost, which is why only authenticated people can contribute to the shared resources. The notion of the project is to attempt these tasks of computation with iPython Notebook, which is free and portable for everyone in all parts of the world. The expected challenge is to deal with its numeric computation algebra nature.

Page 3: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

3

Acknowledgement I would like to appreciate Professor Carole Goble for her guidance and support towards the achievements of my project. Her contribution is not only topical to this project but my whole third year performance. I admire the lessons she gave me on computer science and life, which I’ll be holding onto for rest of my life.

Also I would like to thank Alan R. William for his technical and moral support throughout the course of the project. Finally I also thank the Student Support Office and the Ethics, Department of Computer Science for their eloquent guidelines for all the sections and ethics of development.

Page 4: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

4

Contents 1. Introduction……………………………………………………...6 1.1 Context………………………………………………………6 1.2 Main Objective of The Project……………………………..6 1.3 Motivations………………………………………………….7 1.4 Report Structure…………………………………………….7 2. Theory Behind The Development…………………………….8 2.1 What is System biology Experiments……………………..8 2.2 What is SEEK?.........................................................................9 2.3 Summary……………………………………………………12 3. Computer Algebra Systems…………………………………..13 3.1 Mathematica………………………………………………...13 3.1.1 Advantages of Mathematica………………………....16 3.1.2 Popularity of Mathematica…………………………..17 3.2 iPython Notebook……………………………………….....18 3.2.1 iPython Advantages……………………………….….18 3.3 Summary…………………………………………………...18 4. Development……………………………………………………19 4.1 Developing a computation package for SEEK…………..25 4.2 Curve Fitting………………………………………………...27 4.3 Summary…………………………………………………….28 5. Evaluations and Testing………………………………………..29 5.1 User Interface Considerations……………………………..29 5.2 Final Comparisons………………………………………….29 5.3 Challenges with iPython…………………………………...31 5.4 Summary…………………………………………………….32 6. Conclusion and Reflection 6.1 Future Works………………………………………………...33 6.2 Personal Shortcomings……………………………………...33 6.3 What have I gained?...............................................................33

Page 5: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

5

6.4 Final Remarks…………………………………………34 7. References………………………………………………….35 8. Appendices…………………………………………………37 Appendix 1…………………………………………………37 Appendix 2…………………………………………………38

Page 6: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

6

Chapter1:Introduction1.1 Context The purpose of this chapter is to give an introduction into the project and discuss the motivation behind it. It will then give a brief overview of the main objectives and the requirements essential for this project. It will be followed with an outline of all the sections of the report and a description of its final outcomes. 1.2 Main Objective of the Project The main objective of the project is to integrated with the Jupyter Electronic Lab Notebook[1] for a more interactive document experience, or export an investigation of a SEEK experiment[2]. Deliverables:

1. A specification of the export format and content of a SEEK project 2. Exploring Possibility of integrating Jupyter Ipython with SEEK and

replacing Mathematica[3].

Figure1:Inthisthearrowrepresentstheproject’sobjectives,thatis,replacingMathematicawithiPythoninlaunchingbiologysystemexperimentsinSEEK.

1

1.JupyterElectronicLabNotebook:https://ipython.org/2.SEEK4science:http://www.seek4science.org/3.WolframMathematica:http://www.wolfram.com/mathematica/

Page 7: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

7

Figure 1 shows the already established capability of SEEK interface with Wolfram Mathematica. The aim of the project is to explore the same in Jupyter Ipython notebook that has exciting advantages over Mathematica, discussed later in this report. In SEEK, the biology system files can be launched in Wolfram Mathematica with the provided Mathematica notebooks and data files for a particular experiment. That experiment is then displayed in the form of pretty outputs. 1.3 Motivation I have been always intrigued by the projects like SEEK for Science which function in combination of various science application fields and computer systems. System biology was an entirely new topic for me and I wanted to explore it. This project is not inclined to the biological sector of the SEEK experiments but the computer programs used for executing the data analysis and displaying the values and making it more customized for the scientists. Throughout the project term, I cherished the exceptional guidance from my supervisor Carole Goble and tried my best to make use of her utterly important feedback. 1.4 Report Structure Section 2: This section provides an understanding of important concepts that constitute the backbone of this project. It details the features of system biology and SEEK experiments, giving a fair description of what files these experiments consist of. Section 3: This section is about the two major technologies explored in the project, Wolfram Mathematica Notebook and Jupyter Ipython Notebook. These two computation environments are so similar, yet so different. These similarities and dissimilarities are also discussed. Section 4: This section details the development part of the project, how the target was visualized and acted upon to obtain the desired results. Also the section discusses the differences between the two approaches – Wolfram Mathematica and Jupyter iPython. Section 5: This section examines the obtained results and other considerations such as user interface and feedback from other developers working on SEEK. Section 6: This section summarizes the report with a conclusion and reflection of the whole process of working on this project. It also sum up all the project and personal achievements including personal reflection.

Page 8: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

8

Chapter 2: Theory Behind the Development 2.1 What is a System Biology Experiment? Systems biology is the method of developing computational and mathematical models of complex biology. In other words, biological systems are represented by mathematical formulae and their data. An engineering approach is emerging, which can be applied to the available biological scientific research. It is an inter-disciplinary field of study, which is based on biology that concentrates on complex interactions within biological systems, using a multi dimensional approach to the research.(Sysbio.harvard, 2016) This field is driven by the technology that allows us to penetrate much deeper and wider into how microelements act when subjected to experimental perturbations. This has allowed us to build even more detailed quantitative models of biological functions, which has given us an insight into applications ranging from biotechnology to human diseases.(Systembiology.org,2016)

Figure2:Thisfigureisanillustrationofasystembiologyexperiment,whichbringstogetherthemostadvancedbiology,technologyandcomputationtechniques.

Page 9: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

9

It is collaborative, because it integrates many scientific disciplines[4] 2– biology, computer science, engineering, bioinformatics, physics and others – to predict the behavior of these systems which change over time and different conditions, and has increased the rate of development of the solutions to the most pressing health and environmental issues of the world.(Systembiology.org,2016) One of the most fundamental conceptions related to systems biology is that determining complex biological problems always needs newer technological developments to explore newer attributes of data. Also as data types changes, so do the peculiar analytical tools. Computation is driven by the required technology, which is driven by biology. This noble cycle exists only in a cross-disciplinary environment, consisting of biologists and computer scientists who come together to find the solutions for these challenges. One of the most popular examples of applied system biology is the Human Genome Project[5], which has caused all these scientists to collaborate in various ways to implement computations on complicated problems in the field of genetics. The aim of systems biology is to unearth rising features and model them. This circuitous approach generally involves the development of mechanistic models and the reorganization of dynamic systems from the quantitative properties of their primitive building cells. A cellular network is modeled mathematically using methods coming from physical kinetics and chemical theories. To compute the large number of parameters, variables and constraints in cellular networks, we require some numerical and computational techniques. 2.2 What is SEEK? SEEK is a storage platform designed to facilitate heterogeneous data and model storage and sharing, across multi-group scientific projects. SEEK was developed as part of SysMO, a pan-European initiative to record and describe dynamic molecular processes in unicellular organisms: from laboratory to mathematical models.(The University Of Manchester,2016) SEEK is growing with the requirements of different projects.

24.https://www.systemsbiology.org/about/what-is-systems-biology/5.https://en.wikipedia.org/wiki/Human_Genome_Project

Page 10: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

10

Figure3:ThisisanexemplarysystembiologymodelforTPIcharacteristicsKinetics,italsohasaMathematicanotebookcontainingtheexpressionsandconstraintsfortheexperimentandadatafilecontainingalltheexperimentalvalues.

SEEK has gained its acclaim as the central hub for the system biology community due to its capability to store and share different varieties of data, ranging from collections to publications, aiding in both laboratory and computational experiments. Figure 3 shows an experiment in SEEK for TPI Kinetic Characteristics. Along with this, a Mathematica notebook and data file are provided.Key Features:

1. Data Catalogue – SEEK’s most significant feature is its data catalogue which consists of Experiments, Publications, raw Datasets, Models, Presentations and Standard Operating Procedures (SOPs)(SEEKfeatures,2015). The catalogue is indexed according to projects and the researchers associated to them. SEEK supports data in various formats to allow flexibility and hence encouraging scientists to share their work with other people.

2. One of the most popular features SEEK possess is viewing the models within the browser, which means we don’t need to download the files of models and data spreadsheets.

3. It is a dynamic service and aims to expand its computation functionalities provided for various data types and formats as the requirements arise. Whenever SEEK comes across something that it does not appear to support such as a data type or format, a request can be placed to extend SEEK for this data.

4. ISA and Interlinking – Any study which focuses on a wider scale, the investigation is funded overall. This feature is allowed in SEEK only because the developers of SEEK have used ISA structure as the backbone of the SEEK.

Page 11: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

11

• I: Investigation refers to highly advanced examination of the experiments undertaken by researchers. These investigations lay out the main objectives of the projects in SEEK.

• S: Study represents all the models and contributions made by various researchers, generally used to provide reasoning for various questions during the experiments.

• A: Assay stands for a particular experiment, allowing the reseachers to establish relations in data files and computation program files like Mathematica notebooks and all the essential technology required for performing the experiment.

5. It is believed that most of the aspects of this ISA framework are apt for the description of the experiments and hence allowing framework to be so flexible and extensive.

6. The SEEK software is open source. It is distributed under the BSD license and the whole source code is available on GitHub.(The University of Manchester, 2008)

Data formats in SEEK Data files in SEEK can be in various formats. It covers: • data generated during high-throughput experiments. • data arising from low throughput, cumulative experiments in the

form of: 1. raw data, i.e. parts of non-replicated original data which is not

quantified 2. Results of an experiments, 3. Data generated from computations, this involves proportionality

with raw data. 4. Data files containing images.

• Biological modeling generated data • Models generating from various approaches in experimentation • Data such as parameters and constraints for the experiment • Validating data • Metadata which is required to define the functionality of computation

programs • Data files containing processes used for designing the experiments

and generating all the required data files.(Seek4Science.org)

Page 12: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

12

2.3 Summary SEEK for science is a system biology experiments’ catalogue where researchers share their observations with other researchers and carry out experiments on a much wider scale than ever. The chapter also details what type of files constitutes these experiments. Generally, the user is provided with a Mathematica notebook and a data spreadsheet. The Mathematica notebook contains the mathematical expression for the experiment and data files containing the values for the variables. The next chapter focuses on these notebooks.

Page 13: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

13

Chapter3:ComputerAlgebraSystems

Figure4:Thisfigureprovidedageneralillustrationofacomputeralgebrasystem.

A computer algebra system (CAS) can be defined as a software program, which allows computation in a way that is similar to the traditional manual computations methods, evaluating mathematical expressions in the conventional manner. CAS systems are divided in two classes:

• Specialized CAS, and • General purpose CAS.

The specialized CAS are used for computations for highly specific mathematical problems, for instance number theory or elementary mathematics. General-purpose computer algebra systems can be used to work in any field that requires any general manipulation of mathematical expressions. It’s not specific and can be used for various different projects. Biology system models comprises of experimental data and multiple mathematical expressions. These mathematical expressions aid to the theoretical descriptions of the life science and help in proving the biological laws. All these expressions are polynomial with multiple variables; combined with trigonometry, calculus and other parts of functional mathematics. These expressions require optimization and simplifications and so on. The supported numeric domain could be real or imaginary, rational or irrational, integral or algebraic. CAS is made into packages, which can be used to perform objects in algebra and symbolic mathematics. Also they provide language to implement them with an environment to use that language.

Page 14: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

14

These packages provide CAS for required user interface and graphics capability along with huge libraries of efficient algorithms, data structure and a fast kernel. 3.1 What is Mathematica? Mathematica is a computer algebra program, which is used in many scientific, engineering, mathematical, and computing fields. It a symbolic mathematical computation program and is developed by Wolfram Research of Champaign, Illinois. It uses Wolfram language in its programming.

Figure5:ThisisaMathematicaNotebookforTPIcharacteristicsKinetics,providedbymysupervisor.

Mathematica can be explained in terms of its two parts, the kernel and the front end. The kernel interprets the expressions and symbols of the notebook, which is written in Wolfram Language code and generates results. 3.1.1 Advantages and Disadvantages of Mathematica Mathematica have numerous advantages over any other mathematical computation program available because of its advanced set of packages, which make mathematics more traditional in terms of its computation techniques (Wolfram/Mathematica, 2010). Here are the most important advantages:

Page 15: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

15

• Fast programming: Small simulations are quicker to code in a mathematica notebook, results in reducing time and effort. Code is very elegant.

• User Friendly Output: The output is pretty due to fancy customization features with the help of the UI packages.

• Easy Debugging: Errors can be spotted easily. Syntax error are harder to fix but functional errors are easy to handle.

• Displaying Data efficiently: Unlike any other programming languages, we can choose the format for displaying the data. We can view the results in multiple ways without altering our computation part of the notebooks.

With the above stated advantages, Mathematica has some significant disadvantages:

• Slower Computation: Computation in Mathematica is slower than other available computer algebra systems like Matlab[6] etc.

• Syntax needs to be learned: Mathematica has a rigid syntax and error messages are confusing. Any issues related to syntax of the Mathematica code is hard to comprehend and solve.

• Tricky User Interface: Sometimes the end user using Mathematica without any background faces difficulty with the user interface option and the process of availing them.

• It is a proprietary software program: It is protected by extremely complicated and functional online registration and password schemes because it is a paid CAS. It costs a lot to organiz3ations to be provided for its members.

36:http://uk.mathworks.com/products/matlab/?requestedDomain=uk.mathworks.com

Page 16: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

16

3.1.2 Popularity of Mathematica Why Mathematica is gaining popularity?

Figure6:Mathematicausershareisshowninthisfigure.Itisacceptedinalmostscientificfields,mainlyEngineering,Computerscience,Lifesciences,Mathematicalsciencesandphysicalsciencesetc.

Built on powerful core principles, Mathematica's scope has expanded at an increasing rate over two decades—making it now a system of unique breadth and depth(as in figure 6).

• Mathematica's unified architecture allows it to integrate a remarkable range of areas normally served only by disparate specialized software systems—and to make feasible connections to define whole new levels of functionalities.

• With its broad approach, Mathematica provides a very high level of computation by allowing users to manipulate formulas and equations symbolically, spanning numerical precision, automatically selecting optimal algorithms, and integrating active computation and interfaces to all computations.

• Recently, Mathematica unified design has allowed the creation of major new generations of algorithms which have greatly extended the

Page 17: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

17

range of computations that can be done with new levels of efficiency even in seemingly straightforward problems.

• Mathematica's extensive programming language Wolfram, support for modern web and grid designing and dynamic interface creation features and collaboration with other available systems making it even better for big development environments that provide stable long-term platform for existing and future projects.(Mathematica25)

3.2 Ipython Notebook Ipython Notebook acts as a command shell mainly used for interactive computing in many programming languages. It caters to multiple requirements dealing with shell syntax including rich text introspection, tab completion and all its history.

Figure7:ThisfigureisanexemplaryiPythonNotebook.Eachnotebookhasaseparatekernelandeachcellgeneratesoutputsasshownabove.

IPython features are as follows: • Interactive shells • A browser-based notebook for code, text, mathematical expressions,

and other code

Page 18: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

18

• It provides support for interactive data visualization and use of GUI toolkits.

• It is flexible and embeddable. • It also has tools for parallel computing.(Ipython, wikipedia)

IPython notebook are web based computation environment which makes creation of IPyhton programming interactive. It is basically a JSON document which contains a completely ordered list of inputs and outputs cells. These cells can contain text lines, code lines, mathematical expressions, plots and rich media. These notebooks are connected to the notebook kernel, also called IPython kernel. According to the recent version release, there are 49 IPython-compatible kernels for programming languages including Python, R, Haskell and Julia.(IPython, Wikipedia) 3.2.1 iPython Advantages The most important advantages iPython users have are:

• Supports Many languages: The user is not required to have a background in iPython and can write code in the language of their choice.

• Sharing of notebooks: iPyhton is a portable platform and files can be shared among people from different places and languages.

• Interactive Widgets: The widgets available in iPython makes the output more interactive for the user. These widgets are capable of adding features, which are not available in other CAS system.

• Integration of Big Data: Some mathematical and scientific libraries are available to aid manipulation of big data files and evaluate complex mathematical expressions and produce results.

• iPython is BSD licensed and can be downloaded and used by anybody. It is free of cost for everybody.

3.3 Summary Mathematica is an advanced symbolic computation system and allows user a simple way of producing output but iPython has some remarkable benefits, of which the most significant being portable and absolutely free for everybody and everywhere. Other privileges are better user interaction and multiple language support.

Page 19: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

19

Chapter 4: Development 4.1 Chapter Overview The development process consisted of: 1.Converting a real Mathematica notebook for a Systems Biology project to an iPython notebook, using Python and Javascript. 2. Porting the built in functions into iPython packages 3. Developing an exclusive package for the curve fitting, which can be reused for all the similar kind of problems 4. Devising a methodology on such conversions for other people to follow.

4.2 Developing A Computation Package for SEEK The system biology projects in SEEK consist of various model files and data files. These model files are based on the theoretical formulas of the experiments, which evaluate results using the data provided in data files. These formulas are polynomial formulas with many variables and constraints. Constraints are generally physical conditions under which these experiments are executed. Using all the provided conditions and values the notebooks are evaluated in Mathematica. These Mathematica notebooks are programmed to function for a specific requirement. Any change in the requirements and it becomes harder for the user to deal with the Mathematica notebooks.

Figure8:Thischartdepictsthenotebookconversioninpartswhichisfurtherexplainedbelow.

Page 20: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

20

Every notebook can be divided into three parts comprising of :

• Reading the data From data files • Select the desirable values • Produce the results

All these steps in Mathematica are not very user friendly due to it negligible system interaction with the user while evaluation. In Mathematica, the data is read from a specific file, which is embedded in its code. A user without any programming background will definitely face difficulty in making changes in the code to read the data. In figure 9, it is shown how data is fetched in any Mathematica notebook.

Figure9:ErrorBarPlotisthenameofthepackage,andImportfunctionisusedtostorethepathofthedatafileinthevariabledataTPI.

Whereas in iPython, we can make use of widgets to implement a much more user friendly environment where users can select the data file they want without needing to make changes in the code. Also by making a few changes in the code of widgets, we can transform it into a file upload method required for a specific task or make it read the files in whatever format we want (figure 9). In the translated iPython notebook, user can select a data file and evaluate the notebook with that data.

Figure10:iPyhton’sfileuploadwidget

Next step in notebook evaluation is selecting the desirable values for the calculations. The data files have numerous rows and columns, only a few

Page 21: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

21

are required for replacing the variables in the expressions given in the notebook. In Mathematica notebook, the row and column numbers are embedded in the code of the notebook and cannot be changed according to user’s wishes.

Figure11:Thiscodefilterstheessentialdatafromtheunwanteddatavalues,anditisembeddedinthecodeofthisMathematicanotebook.

Since iPython is an interactive notebook, we can add some interactive features to make this step more user oriented and easily changeable. Some of these features are buttons, checkboxes, radio buttons, selection boxes etc. We can always make use of widgets, that are special features which add element of interaction to develop a better and more user-friendly graphic user interface for the notebook. It can be in the form of a button or a scroll bar and the controls are software components, which involve some sort of direct manipulation to produce the required results in a more decent manner. Widgets allow us to have user computer interaction, which appears as a part of the interface of the particular notebook and is rendered by the kernel of that particular notebook. The theme of this notebook can be visualized as a center of its inventive design and is responsible for creating a sense of belongingness. The most significant advantage of developing or using a widget toolkit is that it allows reusing of code for different tasks and maintains consistency throughout the whole notebook. Generally HTML and Javascript are used extensively to develop such widgets. They facilitate storing of the output and displaying the results within the notebook. The list of widget can be seen in figure 12.

Page 22: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

22

Figure12:ThisisthelistofwidgetsprovidedbyiPython.html.Thesearethemostcommonlyusedwidgets.

Some methods are required to filter the required data from the rest of the rows and columns in the data files. These methods can do such operations based on the obligations to produce the results. For instance, FindString() method allows user to select the required columns for the variable used in the given mathematical expression and produce the row numbers which don’t have string values in them. This prevents producing undefined resultant values and causing exceptions while performing calculations. Other functions, which have similar function are for removing certain values and so on.

Figure13:ThispartofMathematicanotebookcontainsthemathematicalexpressionfortheexperiment.Itcalculatestheresultvalueandgenerateacurvebasedontheprovidedconstraints.

The final step in notebook evaluation is to produce the result. It can further be divided into three steps :

• Replacing variable in the expression and produce a curve fit • Minimizing and maximizing according to the constraints • Confidence Intervals • Plotting

Page 23: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

23

All the mathematical expressions in such notebooks have to go through all these steps to produce the final result and display it. To implement these methods in iPython, methods in python were written using some of the available packages on online version control systems. PyPI[7] - the Python Package Index is the very popular among python community, It is a repository of software for the Python programming language(figure 13). There are currently 794464 packages here.[8]

Figure14:ThisisPyPi,themostpopulariPythonpackagesrepository.[8]

Some of the useful packages and libraries for scientific data analysis are :

1. Tk Package This library is aimed to offer display of message boxes and pop up blocks for user interface. Additionally Tkinter can help us write simple dialogues to interact with the user and ask him/her for input in form of integers, floating values and strings. One of the most useful modules is tkMessageBox which helps in providing an interface for the messages dialogues. The package generally requires arguments in the form of the following public symbols - Dialog , askInteger, askfloat and askstring. (wiki ython, 2015) 47.https://pypi.python.org/pypi/ipython8.https://pypi.python.org/pypi

Page 24: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

24

2. Numpy Package

Numpy is one of the fundamental packages in python which aims to simplify the programming for scientific computations. It treats data as a powerful N-dimensional array object and provides many sophisticated functions for computation. This package also contains tools for allow integration C/C++ and Fortran. Numpy is immensely useful for linear algebra and various number capabilities. Numpy also allow us to store data in a multi dimensional container of generic data, which means users can define arbitrary data types, allowing it to readily integrate with other variety of databases. (Numpy developers,2016)

3. Scipy Package This is a python library, which is open source, available for and used by scientist and others to employ big data integration for their scientific computing. Scipy contains efficiently programmed modules which can be used for special functions, FFT(fast fourier transformation), interpolation, signal and image processing and so on. (scipydevelopers, 2016)

4. Matplotlib library Matplotlib is a python 2-D plotting library which is used for producing high quality data analysis figures and then converting them into a variety of hardcopy formats. It can be used across many platforms and such computing algebra systems, most important of which are the python scripts and Ipython shell. It is popular for making hard things easier, mainly producing plots for different types of functions. Some of the plots which can be generated using matplotlib are bar charts, histograms, error charts, power spectra and scatterplots etc using only very few line of code. (Matplotlib)

5. Astropy packages The Astropy is a collection of many software packages, which are written in Python language and designed for astronomy specifically. It is single and free core package for astronomical mathematical expressions manipulation. It has become popular due to the increase of popularity of python among astronomers. It is very useful for unit and physical quantity conversions and playing with coordinates and time transformations. (Atsropy.org)

6. Inspect module This module has several useful functions to help us get the information about the current running objects such as modules, classes, methods, functions, trace back, frame objects and code objects. For instance, user wishes to obtain the path address of the notebook currently running, or

Page 25: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

25

analyze its contents or retrieve the code for any method or maybe format the list of argument of the function or maybe obtain all the information for display. Summing up all its important functions - checking data types, retrieving source code, inspecting functions or classes and obtaining the interpreter stack. (Python.org/inspect)

7. Os module This module allows user to access a portable way to deal with operating system functions. One of the most useful functionalities are : to read and write files or manipulate paths or read the lines in a particular file using command line and create and delete temporary files and directories. (Python.org/os) 4.2 Curve Fitting 4.2.1 Significance of Fitting a curve Curve fitting is a process by which a mathematical function is constructed into a curve, which fits to a series of points subjected to some constraints. Two techniques can be employed to implement curve fit for a curve :

• Interpolation - where the curve fits the data points perfectly • Smoothing - where the curve fits the data points approximately

For such cases, the regression analysis is becoming popular in which the uncertainty is observed with random error and focused on questions of statistical inference. The figure below provides a common example of fitting a Gaussian curve in Ipython.

Figure15:Thisisanexemplarycurvefitting,itgivesasmoothcurveoverarangeofuneveninputs.

Page 26: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

26

Why curves are fitted? • Fitted curve are used for better data visualization. • To infer result values for the unavailable values in input. • It also summarizes the relationships among the variables in the

expression. On the basis of the degree of the expression the curve can be divided into following sub classes (Minitab,2016):

• Linear curve fitting • Polynomial curve fitting • Non-linear curve fitting

Examining the type of the mathematical expression and its properties such as its degree, number of unknown variables and size of data available is important for knowing the most suitable technique to fit the curve. Regression analysis is the technique in which one or more independent variables and dependent variables are examined and a relationship is established between them, considering that the best-fit parameters are estimated using least-square method. In mathematics, there are two basic types of regression(figure 16) (Minitab,2016):

• Linear - where the parameters are linear after fitting using the Least Square normal equations

• Non-linear - where the parameter are non-linear after fitting using the Least Square normal equations.

Figure16:Thisfigureillustratesthedifferenttypesofregressivecurves:Linearandnon-linear.

Page 27: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

27

All this theory is significant to understand the type of mathematical computation involved in the experiment. The list of packages explored and can be used for various types of data, is provided in the appendix. 4.2.2 Curve-fit method The Mathematica notebook provided to me contained a system biology experiment on TPI Kinetic distribution and it had a mathematical expression, this - This mathematical expression has about six variables. Data was given in the form of a spreadsheet with nine columns and a hundred and twenty five rows. The output was to be displayed in the form of a curve, which fits all these input values, preferably regressively. Following list of curve fitting packages were tried to produce the desired output but each of them have their limitations. Finally a not very complex method was written to work for cases like this, with multi-variable and data in form of a multidimensional array. The activity flow for the method goes as shown in figure 17.

Figure17:thesearethesequentialstepsfortheexclusivelydevelopedcurvefittingmethodfortheprovidedMathematicanotebook.

Page 28: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

28

4.3 Summary In this chapter, the theory aims at developing a common ground for all the system biology experiment and form a larger picture, which will allow the users to use the already developed methods to be used in more than one experiment, either in form of a package or simple importing a notebook.

Page 29: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

29

Chapter 5 Evaluation and Testing 5.1 User Interface Consideration Each individual cell in the iPython notebook is executed in a sequential order for the desired User Interaction with the system. The system evaluates every cell and computes the result and the result is further used in evaluation of following cells in the notebook. This ensures that user can make selections for arguments for various methods used to finally compute the results. As demanded, iPython surely reaches a better level of user satisfaction than a Mathematica notebook. The interface is kept simple and understandable considering that users have different technical qualifications. Every notebook has its separate kernel and this ensures that information of two simultaneously running notebooks don’t interfere. The user can always add more features depending on the requirements of the system biology experiment worked upon. Consistency is maintained throughout the code to make it reusable without any need of making changes in the methods. 5.2 Final Comparison The Jupyter notebook allows the user to change the values of the variable or the variable themselves without any changes in the code. The final notebook developed was compared empirically with the results of the Mathematica notebook for the same system biology experiment. Each method in Mathematica was developed in iPython, which generates the same result. The following table gives the list of those methods and the final outcome.

Page 30: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

30

Table1

Mathematica Ipython Comparisons1.Need:Loadsafilewherefilenameisem-beddedinthecode

FileUpliadWidget:Canloadanyfile,browsinginthefolderallowed

InIpython,Abettermoreinteractiveop-tion,datafilescanbechosenforanyexperiment

2.Import:import-ingthepathofthenotebook

Inspect,OS:moduleusedtoobtainthepathofthenotebook

Doestheexactsamething

3.[[col-umn]]/Datafile:em-beddedincode

retrieveColumn(path,wsname)+addedawidgettoselectthecolumns

Addeduserinterac-tioninIpythonforusercomfort

4.Position(data,String):whichcanbeusedforvariousdatatypes

FindStringPosition:particu-larlyforstrings,out-putthepositionforuserstoverify

inMathematica,themethodcanbeusedforvariousotherdatatypes

6.MathematicalEx-pressiom,writteninatraditionalway

Methodrequiredforcomputationchosen

Complexityofthemethodincreaseswiththenumberofmultipli-cationsanddivisionintheexpression

7.ModelFit+Mini-mizing+ConfidenceInterval

Variousmethodscanbechosenthemostsuitableneedstobechosen

Curvefitmod-ulewhichhandlestheexceptionscarefully

8.Plot

matplotlibpackage

Thecurveisemperi-callycomparedtothedesiredcurv

Page 31: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

31

Each element of the additional User Interactive computing was tested and some considerations were made :

• The User is asked to provide feedback when observing the added interactive qualities to the notebook.

• They were asked to upload different data files to test the same notebook.

• They were asked about the result display quality at the end. 5.3 Challenges of working with iPython

• Values can cause many different types of Errors. When you provide a data file, the values in data are likely to cause exceptions if not in the required format. This will interrupt the evaluation process many times and user will have to make changes to those values in the data files to allow the mathematical expressions in the model run smoothly. This creates a very uncomfortable scenario. The solution to this issue is to catch all the possible for all the variables. Also the numerator and denominator of the expression should be dealt with caution due to their ability to cause ZeroDivisionError. Such errors are runtime errors and hence can be fixed only by evaluating the notebook repeatedly. Some other exception most likely to occur are ValueError, Incompatible type error etc.

• iPython is an Numerical computation system, not a symbolic computation System.

Whereas Mathematica is a very advanced symbolic computation system, which allows the users to deal with expressions containing variables without any given value and can be manipulated using symbols. It makes a comfortable environment for complex mathematical expression. In iPython you can deal with data with a multi dimensional array but other complex mathematical functions like curve fitting and minimize/maximize cause trouble. Some of these end up causing too much data exceptions.

• Finding the most suitable library/package In iPython, there are infinite number of libraries and packages available on version control systems which is seen as an advantage because it is completely reusable and can be used by anyone and anywhere. But it adds to the trouble of looking for the apt package for the mathematical expression in a specific biology experiment. The most suitable library requirement changes with the change in the number of variable, size of data files and the mathematical operations involved in the expression. These considerations are important because this can make the mathematics part of the notebook very easy or very hard.

Page 32: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

32

• Multiple variable Computation is complex in iPython

As stated above, the computation solution depends highly on the complexity of the mathematical expression the user has to deal with. As number of variables increase, so does the complexity and degree of the expression, which takes a lot of patience and understanding. Whereas Mathematica is designed for such advanced mathematics and allows user to work with ease. 5.4 Summary IPython can implement all the equivalent Mathematica operations, considering all the challenges are met while development. The developed methods can be built into a package and used repeatedly and shared with other developers as well, by virtue of iPython being portable and free of cost.

Page 33: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

33

Section6ReflectionsAndConclusion6.1FutureWorksThisprojectmainlyinvolvedexploringthepossibilitiesofintegratingJupyterIpythonNotebookwithSEEKandreplacingMathematica.Thefutureworkscanbebasedonextendingtheoutcomesofthisprojectby:

• DevelopingmoremethodsinIpythonthatareexactlyequivalenttothoseinMathematica

• Thepackagescanbedevelopedthatarecompatiblewithmorethanonekindofexperimentsandreducetheeffortofuseroflookingintomanyofthem.

• ThepackagesdevelopedcanbesharedonlinewiththeIpythoncommunityforthemtodevelopoverit.

• WecanalwaysaddseparateGUIinterfacewitheachcellofJupyternotebook,dependingontherequirementoftheproject.

6.2PersonalShortcomings

• Theinitialplanningwasalotmoretimeconsumingthanexpectedbecauseunderstandingthethemeoftheprojectwasachallenge.

• Everythingworkedfineduringtheprocessandallthedeadlinesweremet,inbothsemesters.

• Comparisonwasalsoquitechallengingconsideringthereisnoeasywayotherthancomparingthehugedatavaluesandcomparingcurvesempirically.

6.3WhathaveIgained?

• IhavebeenabletocomeacrossandlearnaboutSEEKandpopularComputeralgebrasystemslikeMathematicaandiPython,whichisusedforhugedataintegrationandcomputations.

• Timemanagementisaskillemphasizedcontinuouslyduringmytimeatuniversitylevel.Alotmorecouldhavebeenattemptedifworkedefficiently.

• Thewholeprojectwaslooselybasedonmakingiteasierforscientistwithalmostnegligibletechnicalknowledge,henceotherSEEKdevelopers’feedbackandconsiderationswereextremelyimportant.

Page 34: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

34

• Thedevelopmentprocesswasunique.Icouldn’tfollowthesameetiquettes,whichareusedtoworkonanysoftwaredevelopmentproject.

• IhavegainedproficiencyinPythonlanguageduringthisproject.

6.4FinalRemarks

Examining the definition of the project is an important part of initial strategizing and grasping the objective of the whole theme. This includes identifying functional and non-functional requirements and the phase of requirement gathering is a continuous phase throughout the development process. As a computer scientist, it is vital to expand your programming skills by making yourself familiar with new languages and technologies. Furthermore, it is essential to recognize useful feedback from experienced programmers as well as the target users.

Page 35: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

35

References

1. Sysbio.harvard,(2016) Department of System Biology, HMS, [Online]AccessedOn22ndMarch,2016,Availablefrom: http://sysbio.med.harvard.edu/

2. Systembiology.org,(2016)InstituteofSystemBiology,[Online]AccessedOn25thMarch,2016,Availablefrom:https://www.systemsbiology.org/about/what-is-systems-biology/

3. SEEK.org,(2008)SEEK4Science,AboutSeek,[Online]AccessedOn31stMarch,2016,Availablefrom:http://www.seek4science.org/about

4. SEEKfeatures,(2008)SEEK4Science,SeekFAQ,[Online]AccessedOn17thApril,2016,Availablefrom:http://www.seek4science.org/faq

5. TheUniversityOfManchester,(2008)DatainSEEK,[Online]AccessedOn22ndMarch,2016,Availablefrom:http://www.seek4science.org/about_us

6.Wolfram/Mathematica,(2010)ModernTechnicalComputing:WolframMathematica,[Online]AccessedOn5thApril,2016,Availablefrom:-http://www.wolfram.com/mathematica/

7.Mathematica25,ThedistributionofMathematicausersin1990,[Online]AccessedOn1stApril,2016,Availablefrom:

http://www.mathematica25.com/8. IPython,Wikipedia,JupyterIpythonNotebook[Online]AccessedOn1st

December,2015,Availablefrom:https://en.wikipedia.org/wiki/IPython9. NumpyDevlopers,(2016)Numpy[Online]AccessedOn2ndFebruary,

2016,Availablefrom:http://www.numpy.org/10. Scipydevelopers,(2016)Scipy.org[Online]AccessedOn1stNovember,

2015,Availablefrom:http://scipy.org/11. Matplotlib,Matplotlib[Online]AccessedOn1stNovember,2015,

Availablefrom:http://matplotlib.org/12. Astropy.org,Astropy[Online]AccessedOn22ndDecember,2015,

Availablefrom:http://www.astropy.org/13. Python.org/inspect,Inspect–inspectliveObjects[Online]AccessedOn

1stApril,2016,Availablefrom:https://docs.python.org/2/library/inspect.html

14. Python.org/OsOs–MiscellaneousOperatingSystemInterfaces[Online]AccessedOn31stMarch,2016,Availablefrom:https://docs.python.org/2/library/os.html

15. Minitab,(2010)CurveFittingwithLinearandNonlinearRegression[Online]AccessedOn25thApril,2016,Availablefrom:http://blog.minitab.com/blog/adventures-in-statistics/curve-fitting-with-linear-and-nonlinear-regression

Page 36: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

36

16. Minitab,(2010)Typesofregressionanalyses[Online]AccessedOn1stApril,2016,Availablefrom:http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/basics/types-of-regression-analyses/

Page 37: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

37

Appendix 1 Packages studied for Curve Fitting

• scipy.optimize.leastsq Minimize the sum of squares of a set of equations.

1. 1-D Data array 2. Does not support more than 1 variable to fit

• scipy-data fitting module

This package intentionally does not specify dependency versions. 1. \item A polynomial fit only for 2-D graph plotting

• Lmfit package Least-Squares Minimization with Bounds and Constraints The user writes a function to be minimized as a function of these Parameters, and the scipy.optimize methods are used to find the optimal values for the Parameters, needed to flatten n- dimensional arrays into 1-D to implement fitting

• Kmpfit

Most suitable package for multi variable formulas but, 2. Does not support ValueError and ZeroDivisionError for ipython 3. Not very suitable for Long decimal numbers and complex formulas

• PyModelFit

PyModelFit focuses on the simpler tasks of 1D curve-fitting, including a GUI interface to simplify interactive work

• Symfit : Focuses on Fitting on a special type of curve like Gaussian

Curves only. Works for Fit (LeastSquares)

1. (Non)LinearLeastSquares 2. Likelihood 3. Minimize/Maximize

• DataPipeline 1.2 For Desktop and command line application (python)

Page 38: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

38

Appendix2MathematicaNotebookResult

Figure18:ThisfigureisascreenshotoftheresultsgeneratedbyMathematicaNotebookforTPIKineticsexperiment.P.T.O

Page 39: Add that to my notebookstudentnet.cs.manchester.ac.uk/resources/library/3... · 1 Add that to my notebook: Reporting my Systems Biology Experiments Saloni Rao ... The main objective

39

IPyhtonNotebookResults

Figure19:ThisfigureshowstheresultantcurvesforTPIKineticsexperiment,whichwereempiricallycomparedtotheMathematicaresults.