Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Build and evaluate an interactive dashboard to visualize lung
adenocarcinoma data
A study submitted in partial fulfilment
of the requirements for the degree of
MSc Data Science
at
THE UNIVERSITY OF SHEFFIELD
by
Hailing Lu
September 2016
Build and evaluate an interactive dashboard to
visualise lung adenocarcinoma data
Abstract
Background
Substantial web-based portals have been built to offer various types of visualisations and analyses for exploring genomics datasets via interactive dashboards. These visualisations support not only scientists with cancer-related expertise, but also those without professional computational or bioinformatics background. However, seldom researches have evaluated if these visualisations are useful enough for cancer research scientists and how helpful they are.
Aim
This research aims to create an interactive dashboard for visualising lung adenocarcinoma data and then develop measurable approaches for evaluating its effectiveness in order to identify potential problems of existing visualisations for future dashboard development. Methods
The dashboard was created by visualising genomics datasets of The Cancer Genome Atlas (TCGA). Code packages provided by TCGA were used and conducted on Rstudio and the dashboard was published on the application platform of Rshiny. The dashboard was then assessed through an evaluation form completed by 17 participants from the department of Oncology & Metabolism, Biomedical Science and MSc Data Science programme in the University of Sheffield.
Results
The academic research (or working) experience, domain knowledge as well as visualisation familiarity present different impacts for visualisation and dashboard interpretation. Participants with data science expertise illustrate a better skill for locating answers while those with oncological or biomedical knowledge could think more out of only searching for answers but dashboard function development for further research utilisation. Annotation and help note also demonstrate significance for better dashboard comprehension.
Conclusions
When designing the dashboard for visualising genomics datasets, design principles should
focus more on meeting the requirement of users with different domain knowledge. This
study can be improved by using larger datasets and evaluation samples, wider analysis and
visualisation methods, more dimensional indicators and approaches for evaluating.
Acknowledgement
I would like to gratefully acknowledge various people who have offered support and
inspired me when I worked on this dissertation. First, I would like to thank all the lecturers
who made contributions to the courses of MSc Data Science programme, especially my
supervisor, Dr Gianluca Demartini and the programme coordinator, Professor Paul Clough. I
have learned a lot during this year. Second, I would like to thank my dear parents and sister,
friends and classmates who have encouraged me when I felt frustrated. Third, I would like
to thank Phil Chapman and Ketaki B Patil, who provided useful suggestions for my
dashboard development. Fourth, I would like to thank all the students, academic staff and
professors who assessed the dashboard for me. Finally, I would like to thank Marcin Kosiński,
who designed the useful RTCGA packages for visualisations and patiently answered my
questions on his blog.
Contents
Abstract
Acknowledgement
List of Figures
List of Tables
1.0 Introduction and Context 1
2.0 Research aims and Objectives 2
2.1 Aim 2
2.2 Objectives 2
3.0 Literature review 3
3.1 Heat maps 4
3.2 Networks 5
3.3 Survival Plots 5
3.4 Genomic Coordinates 6
3.5 Scatter Plots and Box Plots 6
3.6 Bar Charts 7
3.7 Evaluation 7
3.7.1 Case Studies 7
3.7.2 Evaluation Methods 10
3.8 Techniques and Data 10
4.0 Methodology 11
4.1 Research plan summary 11
4.2 Dataset Description 11
4.2.1 Term Explanation 11
4.2.2 General Description 12
4.2.3 Sub-dataset Description 14
4.3 Dashboard Development 15
4.3.1 Platform and Packages 15
4.3.2 Visualisation Creating 15
4.3.3 Dashboard Publishing and Updating 16
4.3.4 User Interface Designing 17
4.4 Design Evaluation and Questionnaire 19
4.4.1 Background Information Section 20
4.4.2 Evaluation Section & Dashboard usage impression Section 20
4.4.3 Select participants and Conduct assessment 22 4.5 Research Ethics 22
4.5.1 Questionnaire and Participants 22
4.5.2 Datasets and Data storage 23
5.0 Results 24
5.1 Participant Background Information 24
5.2 Dashboard description and Evaluation results 26
5.2.1 Survival Plot Section 1: “Different cancer types” 28
5.2.2 Survival Plot Section 2: “Comparison in different genes” 32
5.2.3 Box Plot Section 37
5.2.4 Heat map Section 41
5.2.5 The whole impression of this dashboard for participants 46
5.2.6 Further validation 48
6.0 Discussion and Dashboard design recommendations 55
6.1 Participants 55
6.2 Academic research (working) experience impacts 55
6.3 Domain knowledge familiarity impacts 55
6.4 Dashboard Comprehension 56
6.5 Dashboard design recommendations 56
7.0 Conclusions and Recommendations 58
7.1 Conclusions 58
7.2 Recommendations for future research 59
7.2.1 For data samples and participants 59
7.2.2 For the interactive dashboard 59
7.2.3 For the evaluation methods 59
8.0 References 60
9.0 Appendix 64 9.1 Mutated genes in lung adenocarcinoma 64 9.2 Template of the email for inviting participants for dashboard evaluation 65
9.3 Research Ethics 66
9.3.1 Ethics Application Form 66
9.3.2 Ethics Approval Letter 72
9.3.3 Consent Form 73
9.4 Questionnaire 75
9.5 Details of participants’ current profession 83
List of Figures
Figure 4.1 Case distributions of LUAD and LUSC in Lung Cancer Project of TCGA
Figure 4.2 Demographic distributions of cases within LUAD
Figure 4.3 Portal interface of TCGA (CDC) Data Portal
Figure 4.4 Portal interface of cBioPortal
Figure 4.5 Interface of the dashboard built in this study
Figure 4.6 Question examples of visualisation creating experience
Figure 4.7 Question examples for case study 1
Figure 4.8 Question examples for case study 2
Figure 4.9 Question examples for case study 3
Figure 4.10 Question examples for evaluating the graph itself
Figure 4.11 Question examples for evaluating the dashboard utility
Figure 5.1 Distribution of age, education level, profession year experience
Figure 5.2 Frequency of creating data visualisations
Figure 5.3 Preference of using visualisations and text
Figure 5.4 First page of the dashboard
Figure 5.5 Help note agreement
Figure 5.6 “Different cancer types” section: Two cancer type comparison (left) and three cancer type comparison
Figure 5.7 Answer locating distribution of “Three cancer type comparison”
Figure 5.8 Easy level agreements of graphs in the “Different cancer types” section
Figure 5.9 Agreement of too much information for graphs in the “Different cancer types” section
Figure 5.10 Possible elements that could cause erroneous answers of the “Different cancer types” section
Figure 5.11 Agreement of comparison among different cancers
Figure 5.12 Insight gaining agreement of the graphs in the “Different cancer types” section
Figure 5.13 “Comparison in different genes” section: MET gene (left) and TP53 gene
Figure 5.14 Agreement of comparison between mutation type and wild type
Figure 5.15 Answer locating distribution of “TP53” survival plot
Figure 5.16 Easy level agreements of graphs in the “Comparison in different genes” section
Figure 5.17 Agreement of too much information for graphs in the “Comparison in different genes” section
Figure 5.18 Insight gaining agreement of the graphs in the “Comparison in different genes” section
Figure 5.19 Possible elements that could cause erroneous answers of the “Comparison in different genes” section
Figure 5.20 Box Plot Section: LUAD & LUSC & BRCA & ACC (left) and LUAD & LUSC & BRCA & BLCA
Figure 5.21 Answer locating distribution of LUAD & LUSC & BRCA & ACC box plot
Figure 5.22 Easy level agreements of graphs in the Box Plot section
Figure 5.23 Utility agreements of graphs in the Box Plot section
Figure 5.24 Insight gaining agreement of the graphs in the Box Plot section
Figure 5.25 Possible elements that could cause erroneous answers of the Box Plot section
Figure 5.26 Heat map Section: LUAD & LUSC & BRCA & ACC (left) and LUAD & LUSC & BRCA & BLCA
Figure 5.27 Utility agreements of graphs in the Heat map section
Figure 5.28 Insight gaining agreement of the graphs in the Heat map section
Figure 5.30 Answer locating distribution of LUAD & LUSC & BRCA & BLCA heat map
Figure 5.31 Easy level agreements of graphs in the Heat map section
Figure 5.32 Agreement of too much information for graphs in the Heat map section
Figure 5.33 Possible elements that could cause erroneous answers of the Heat map section
Figure 5.34 Familiarity of the similar dashboard
Figure 5.35 Willingness of using the dashboard in the future works (or study)
Figure 5.36 Interpretation measures of different visualisation types
Figure 5.37 Interpretation measures of participants with different profession experience
Figure 5.38 Interpretation measures of participants with different levels of visualisation familiarity
Figure 5.39 Interpretation measures of participants with different domain knowledge
Figure 5.40 Willingness of participants with different domain knowledge
Figure 5.41 Willingness of participants with different frequencies of creating visualisations
List of Tables
Table 3.1 Main visualisations in the selected portals
Table 4.1 Tem Dictionary
Table 4.2 Summary of datasets building process
Table 4.3 Summary of main packages used for building the dashboard
1
1.0 Introduction and Context
The changed DNA letters in normal cells can result in cancer cells and the cancer genome is
made up from those changed genetic letters in chromosomes (The Cancer Genome Atlas,
2016). Due to the multidimensional characteristics of genomics data, large-scale cancer
genomic projects, such as the International Cancer Genome Consortium (ICGC), Cancer Cell
Line Encyclopaedia (CCLE), and The Cancer Genome Atlas (TCGA), are set up for gathering
and researching cancer genome data. These data are able to be analysed and interpreted
due to the improvement in genomics technologies (Chin, Hahn, Getz & Meyerson, 2011).
These computational and statistical advances in analysing cancer genomics data drive the
evolvement of visualisation techniques and tools for this type of data.
Numbers of web-based portals have been built to visualise genomics data for those cancer
genomic projects (Klonowska et al, 2015). These portals offer various types of visualisations
by representing them via interactive visual dashboards. It is stated that by visualising and
exploring the genomics data and its clinical information, cancer researchers can gain a
better understanding of the mechanisms and character of specific genes or cancer types
(Schroeder, Gonzalez-Perez & Lopez-Bigas, 2013). Moreover, Schroeder, Gonzalez-Perez and
Lopez-Bigas (2013) also emphasise that these visualisations help the process of discovering
potential cancer drug targets.
These visualisations facilitate not only cancer scientists with different cancer-related fields,
but also those without professional computational or bioinformatics background
(Klonowska et al, 2015). However, it is argued that some bioinformatics and programming
knowledge is still necessary to efficiently utilise these visualisations (Schroeder, Gonzalez-
Perez & Lopez-Bigas, 2013). Although some portals have made great improvement for
providing helpful analyses and visualisations, seldom researches have evaluated if these
visualisations are useful enough for cancer research scientists and how helpful they are. This
means that not all the existing visualisations on those portals are useful or effective for
cancer researchers. The knowledge barrier between cancer scientists and experts with
computational and bioinformatics background is not easy to be minimised. Hence, this
research tries to minimise this barrier by creating and then assessing an interactive
dashboard for visualising cancer genome data and its clinical data. Specifically, after
comparing existing visualisations, potential visualisations will be created before the
assessment.
In order to reduce bias for comparison and evaluation, this research will implement more
than one visualisations based on a single cancer type, lung adenocarcinoma. Before
implementing potential visualisations, several oncological research questions based on lung
adenocarcinoma will be designed as the case studies. Then visualisations will be created to
support these case studies. For evaluation, a questionnaire will be designed and then
completed by both research staff and students from the department of Oncology &
2
Metabolism and Biomedical Science and students from MSc Data Science programme in the
University of Sheffield. This evaluation is designed to find out what aspects would probably
have influences on interactive dashboard usage in order to provide suggestions for
genomics research online portal development.
2.0 Research aim and Objectives
2.1 Aim
This research aims to create an interactive dashboard for visualising lung adenocarcinoma
data and then develop measurable approaches for evaluating its effectiveness in order to
identify potential problems of existing visualisations for future dashboard development.
2.2 Objectives
Explore different visualisations for visualising genomics datasets
Build an interactive dashboard by integrating those selected visualisations
Assess the effectiveness of different visualisations based on specific cancer research
case studies
Identify probable issues of the interactive dashboard from the evaluation results
Provide suggestions for this type of dashboard development
3.0 Literature review
3
3.0 Literature review
Lung cancer can be categorised into two main types, small cell lung cancer (SCLC) and non-
small cell lung cancer (NSCLC), which accounts for around 12% and 87%, respectively
(Cancer Research UK, 2016). A survey in 2015 reported that Lung cancer was revealed to be
the leading cause of cancer-associated mortality in USA (Siegel, Miller & Jemal, 2015). Lung
adenocarcinoma is one of the frequently diagnosed subtypes in NSCLC in 2002 (Travis, 2002).
Hence, lung adenocarcinoma is chosen as the research target in this dashboard building and
evaluation, aiming to provide help for further research potentially.
To better interpret the multidimensional oncogenomics data, a number of web-based and
user-friendly portals have been built to support scientists from various cancer-related fields
(Klonowska et al, 2015). Most of these portals offer varieties of visualisation tools for
oncogenomics analyses which includes mutations, gene/protein expression, copy number
variation (CNV), and survival analyses (Schroeder, Gonzalez-Perez & Lopez-Bigas, 2013; Gao
et al, 2013). To visualise these genetic events, different types of graphical representations
are used, including heat maps, networks, survival plots, scatter plots, genomic coordinates,
volcano plots etc. (Klonowska et al., 2015; Gao et al, 2013; Schroeder, Gonzalez-Perez &
Lopez-Bigas, 2013). The separate paragraphs listed below are reviews of each commonly
used visualisation, elaborating their principles and utilities in cancer and oncology research.
Main visualisation types of different cancer research portals were summarized in Table 3.1
based on their versions from 2016.
Table 3.1 Main visualisations in the selected portals
Visual Types Portals (Links)
Heat maps
cBioPortals
http://www.cbioportal.org/
Tumorscape
http://www.broadinstitute.org/
UCSC Genome Bioinformatics
https://genome.ucsc.edu/
COSMIC
http://cancer.sanger.ac.uk/cosmic
Oncomine
https://www.oncomine.org/resource/login.html
Networks cBioPortals
Oncomine
Survival Plots cBioPortals
3.0 Literature review
4
UCSC Genome Bioinformatics
PPISURV
http://www.bioprofiling.de/GEO/PPISURV/ppisurvD.html
Oncomine
Genomic
coordinates
cBioPortals
Tumorscape
UCSC
Scatter Plots
cBioPortals
Oasis
http://www.oasis-genomics.org/
Oncomine
Box Plots
cBioPortals
Oasis
Oncomine
Bar Plots
COSMIC
Oasis
IntOGen
https://www.intogen.org/search
ICGC
https://dcc.icgc.org/
Volcano Plots Oasis
Tree maps cBioPortals
3.1 Heat maps
Clustered heat map, which is stressed to be the most popular visual display of genomics
data (Rajaram & Oono, 2010; Wilkinson & Friendly, 2012), can “compact large amounts of
information into a small space to bring out coherent patterns in the data” (Weinstein, 2008).
By representing genomics data in a matrix, columns and rows of a heat map can be
clustered hierarchically without limiting the order of data (Schroeder, Gonzalez-Perez &
Lopez-Bigas, 2013). For the patches of colour in a heat map, it reveals different levels of
value measuring the relationships among vertical axes (e.g. gene) and horizontal axes (e.g.
tumour samples) (Weinstein, 2008). In other words, this makes it possible to group and
compare genomes in distant loci (Schroeder, Gonzalez-Perez & Lopez-Bigas, 2013).
3.0 Literature review
5
For those selected portals in the Table 3.1, most of them use heat maps for similar
presenting purposes, representing various types of alterations in genes as well as somatic
mutations across tumour types, for instance (Schroeder, Gonzalez-Perez & Lopez-Bigas,
2013, Stephens, 2009). Additionally, clustering the expression data or copy number
variations can grasp cancer subtypes or reveal subtype interactions (Vaske et al., 2010;
Czubak et al., 2015). Furthermore, molecular profile features are able to be visualised by
heat map as well (Cancer Genome Atlas Research Network, 2014).
In some portals, circle map, another type of heat map, is used to demonstrate relationships
among genes in a circular plot rather than rectangular one. The summary of copy number
variations of lung adenocarcinoma can be shown in circular plots (Klonowska et al., 2015). In
addition, each circle map can represent one type of gene and every circle map representing
different genes makes up a network layout showing connections among them (Schroeder,
Gonzalez-Perez & Lopez-Bigas, 2013).
However, a prominent limitation of the heat map still needs considered. Those nonlinear
relationships among genes are difficult to be extracted, which may result in
misunderstandings (Weinstein, 2008). Some portals try to combine with other graphs to
display the hiding structure relationships among genes. For example, using circle map as
nodes to create a network diagram or constructing pathway diagrams to derive functional
relationships among genes (Vaske et al., 2010).
3.2 Networks
Network diagrams present possible connections of targets (nodes) without restrictions of
the order of data. Compared with heat map, functional relationships among different
entities can be demonstrated in the network diagrams, such as genes and protein
expressions (Cline et al, 2007). In the gene networks, the neighbour genes are ranked by
genomic alteration frequency before being represented, which minimises the complexity of
multidimensional genomics data (Gao et al, 2013). In addition, by using networks, potential
cancer drug targets and cancer drivers are able to be discovered as well (Schroeder,
Gonzalez-Perez & Lopez-Bigas, 2013).
3.3 Survival Plots
Kaplan-Meier analyses can be visualised as survival plots if the clinical data are available
(Klonowska et al, 2015). Survival plots can be used to compare impacts of different entities,
such as gene mutations, copy number and expression of genes. Vaske et al. (2010)
combined survival plots with the heat map to identify which kind of clusters gained from the
heat map have the most influence on the survival of patients. Survival ability comparison
can be made within a plot to gain more insights from different oncogene-driven lung
cancers (Arcila et al., 2015). The overall survival outcome of a specific gene in breast cancer
3.0 Literature review
6
is compared between mutated cases and wild-type cases in order to know more about the
impacts of that gene to breast cancer (Ciriello, Cerami, Sander & Schultz, 2012).
In addition, survival plots and boxplots are able to be combined to answer the same
question. Czubak et al. (2015) compared the effects of the copy number and expression of
different genes by using survival analysis and revealed the results of correlation analysis of
copy number types and expression level via boxplots. Gao et al. (2013) queried gene
mutations in cancer and demonstrated the comparisons between patients with and without
such mutations in survival plots.
3.4 Genomic Coordinates
Genomic coordinate, illustrating the alterations attached to their genomic positions is
another common visualisation for oncogenomics data (Schroeder, Gonzalez-Perez & Lopez-
Bigas, 2013). It is stated that the genomic coordinate is suitable to demonstrate the loci and
frequency of all alterations in a single graph, making the best of horizontal space (Gao et al.,
2013; Yin, Cook & Lawrence, 2012). In cBioPotal, the genomic coordinate is used as one of
the summary graphs to show how gene alterations locate in a specific gene (Klonowska et
al., 2015). Cancer Genome Atlas Research Network (2014) combined chromosome
coordinates and pie charts to identify potential candidate driver genes. Similarly, based on a
novel cancer gene, somatic mutations of non-small cell lung carcinoma (NSCLC) were
visualised in a genomic coordinate to show its gene expression (Yin et al., 2014). In addition,
OASIS, an open-access cancer research portal, combines genomic coordinate and table to
display tumour samples with somatic mutations and copy number alteration (CNA) data
(Fernandez-Banet et al., 2016). On Tumorscape, a cancer program resource gateway,
genomic coordinate is used to visualise chromosomal regions and then is combined with
heat map to show genomic relationships based on their specific location (Gao et al., 2013).
3.5 Scatter Plots and Box Plots
By comparing the boxplots of different cancer project portals, it can be found that the
boxplot is commonly used to present the basic distribution of gene expressions classified by
different entities. For instance, it is illustrated that expressions of a specific cancer gene can
be grouped into normal and tumour samples or differential cancer types (Wu et al., 2015;
Fernandez-Banet et al., 2016; Czubak et al., 2015; Yin et al., 2014). Correlation analysis
between copy number alterations and gene expression of a specific cancer type or gene can
be visualised by box plots (Czubak et al., 2015; Klonowska et al., 2015; Gao et al., 2013). The
box-and-whisker plots demonstrate and compare the distribution of mRNA expression for
each subtype (Jordi et al., 2012).
Scatter plots are useful to display discrete genetic events, correlations between genes, gene
expressions or copy number alterations, for instance (Gao et al., 2013). Both Suh et al. (2014)
3.0 Literature review
7
and Yin et al. (2014) presentd the correlations between mRNA expressions by scatter plots
which were also used by Klonowska et al. (2015) to show the correlations between copy
number alterations and mRNA expression. Gene expressions and copy number ratios can be
the entities in scatter plots (Fernandez-Banet et al., 2016). Moreover, a combination plot
with scatter plots and boxplots is created to visualise both correlations and distribution of
gene expressions (Jordi et al., 2012).
3.6 Bar Charts
According to the features of oncogenomics data and clinical information, bar plots suit for
categorising genetic data (e.g. genes expression or copy number alterations) based upon
specific cancer questions. For example, bar plots are able to compare gene expression
patterns classified by different categories (e.g. male and female, normal and tumour
samples) (Fernandez-Banet et al., 2016; Suh et al., 2014). Cancer Genome Atlas Research
Network (2014) used bar plot to classify mutation patterns across different samples in order
to identify new candidate driver genes. Additionally, it is useful to indicate alterations such
as mutations, deletion and amplification of a specific gene across different cancer types by
bar plots (Fernandez-Banet et al., 2016; Gao et al., 2013; Klonowska et al., 2015; Stephens et
al., 2009). What’s more, the proportion or frequency of mutated genes of a specific cancer
type can be visualised by bar plots as well (Cancer Genome Atlas Research Network, 2014;
Devarakonda, Morgensztern & Govindan, 2015).
3.7 Evaluation
3.7.1 Case Studies
In order to implement and evaluate visualisations more efficiently and practically, three case
studies were designed according to previous case studies associated with cancer researches
and lung adenocarcinoma researches.
As is mentioned, cancers appear because of mutations in genes that lead to abnormal
increase of cells. However, not all the mutations contribute to tumour development
because some of them are “passenger” mutations instead of “driver” mutations (Ciriello,
Cerami, Sander & Schultz, 2012). Additionally, although over 400 such “cancer genes” have
been identified and recorded, there still potential cancer genes need to be detected (Futreal
et al., 2004). Thus, to identify the drivers more precisely is of vital importance in
oncogenomics research (Gonzalez-Perez & Lopez-Bigas, 2012). However, to distinguish
cancer drivers is one of the solutions as well as challenges to capture mutation patterns of
cancers. For the reason that mutation validation by experiments are not able to deal with
the soaring capacity for mutation identification, computational approaches to distinguish
cancer drivers are needed (Gonzalez-Perez & Lopez-Bigas, 2012).
3.0 Literature review
8
For identifying those significant mutated genes, common methods are to detect recurrent
mutated genes. Mutation mutually exclusive is one of the useful means using this principle
to distinguish the driver genes (Klonowska et al., 2015). During the mutation mutually
exclusive process, the possibility of gene occurrence (amplification or deletion) is calculated
across substantial number of tumour samples. Then, the features of gene expression (gene
product: RNA or protein) are also considered to filter candidate genes (Ciriello, Cerami,
Sander & Schultz, 2012).
To support these two processes, survival plot as well as box plot can be used. For survival
plot, the survival analysis can represent the survival characteristics of amplified genes,
which are able to illustrate the survival differences between genes and then identify gene
drivers finally (Ciriello, Cerami, Sander & Schultz, 2012). For box plot, it can represent the
features of gene expression over different cancers. Driver genes can be overexpressed,
which means that their gene expression is out of normal size (Santarius, Shipley, Brewer,
Stratton & Cooper, 2010). Hence, it is significant to demonstrate the gene expression
features of different cancers in order to detect gene drivers across various cancers more
precisely.
To understand the biology of cancers, cancer profiling researches are verified to indicate
significant impact in numerous ways (Cowin, Anglesio, Etemadmoghadam & Bowtell, 2010).
Consequently, according to what has been discussed above, Case study 1, 2 and 3 were built
to support the procedure of identifying cancer drivers as well as reveal the molecular profile
characteristics of lung adenocarcinoma.
Case study 1: To support the detection of lung adenocarcinoma drivers
For lung adenocarcinoma (one of non-small-cell lung cancer types), its mutational landscape
particularly differs from small cell lung cancer (Devarakonda, Morgensztern & Govindan,
2015). Therefore, by comparing the mutational features with other cancer types, the
specific mutation patterns of lung adenocarcinoma can be recognised. As the survival
characteristics is one of the mutation features, the comparison of survival analysis results
between lung adenocarcinoma and other cancer types can be implemented. It can offer
help for the procedure of identifying driver genes of lung adenocarcinoma.
In addition to comparing survival characteristics among different cancer types, it is also
regular to make comparison among different gene types. TP53 gene and MET gene has been
significantly identified as mutated genes of lung adenocarcinoma (Devarakonda,
Morgensztern & Govindan, 2015) (details shown in Appendix 9.1). Consequently, the
survival characteristics of these two genes can be shown in survival plot to distinguish their
differences.
3.0 Literature review
9
Case study 2: To support the identification of overexpressed genes of lung adenocarcinoma
It is stressed that gene amplification sometimes occurs with gene overexpression (Prelich,
2012). That is also one of the reason that why gene expression should be used to filter
candidate genes when implementing driver gene identification (Ciriello, Cerami, Sander &
Schultz, 2012). In addition, it is accentuated that correctly establishing the gene expression
level is of vital importance to gene therapy strategies (Prelich, 2012). This implies that
identifying the overexpressed genes can not only provide help with recognition of driver
genes, but also support the process of gene therapy tactics.
For lung adenocarcinoma, identifying the epidermal growth factor receptor (EGFR)
mutations can lead to therapeutic opportunities (Landi, Minuti & Salvini, 2013). Additionally,
MET gene is also distinguished as one of the important genes for resisting EGFR, which
indicating the potential tactics to overcome one of the challenges of lung adenocarcinoma
cure (Landi, Minuti & Salvini, 2013). Furthermore, it is highlighted that 25% to 75% of cases
in NSCLC indicated the frequent MET overexpression (Olivero et al., 1996). Thus, to visualise
the MET expression of lung adenocarcinoma can be conducted to illustrate its MET
expression features.
Case study 3: To visualise molecular profile features of lung adenocarcinoma
Lung cancer is consistent stressed that it is a “molecularly heterogeneous disease”
(Devarakonda, Morgensztern & Govindan, 2015, paragraph 8). Also, tumours of cancer are
so complex that they might be diagnosed differently from the molecular viewpoint
(Gonzalez-Perez & Lopez-Bigas, 2012). Consequently, considering what has been mentioned
above as well as the significantly different mutational landscape of lung adenocarcinoma
(Devarakonda, Morgensztern & Govindan, 2015), being able to comprehend the molecular
profiling of lung adenocarcinoma is beneficial to its identification (Cancer Genome Atlas
Research Network, 2014).
Heat map is an important visualisation tool for profiling the cancer genome. Six molecular
subtypes of ovarian cancers were identified in a heat map by correlating tumour samples (x-
axis) with genes (y-axis) and gene expression (patches of colour) (Cowin, Anglesio,
Etemadmoghadam & Bowtell, 2010). Similarly, it is also emphasized that molecular subtypes
in lung adenocarcinomas can be revealed by heat maps (Cancer Genome Atlas Research
Network, 2014). Moreover, circle maps representing alteration landscape can also be tried
in lung adenocarcinoma which is similar as the breast cancer visualisations (Vaske et al.,
2010). Based on previously practices mentioned above, the heat map presenting molecular
profile features of lung adenocarcinoma should be built in this case.
3.0 Literature review
10
3.7.2 Evaluation Methods
It is underlined that the wide varieties of visualisation tools aim to support cancer
researchers, especially for those without computational and bioinformatics expertise.
However, the development of the evaluation for these visualisations seems not to catch up
with the speed of its visualisations development. Specific means for evaluating genetics
visualisations are limited.
Controlled experiments are existing evaluation methodologies for measuring effectiveness
of visualisations (Saraiya, North & Duca, 2005). Even though there is a common restriction
that the predefined tasks limit the sights of evaluators, the controlled experiment is still
commonly used due to its skills of informative tasks testing (Saraiya, North & Duca, 2005).
Thus, this research will use the methodology of controlled experiments by designing a
questionnaire based on the three case studies presented above.
As the traditional procedures required, environment and background of evaluators will be
controlled (Isenberg, Zuk, Collins & Carpendale, 2008), evaluating time and scores will be
calculated as the measure of effectiveness (Saraiya, North & Duca, 2005). There are a lot of
methods introducing how to evaluate an information retrieval platform for presenting
visualisations (Morse, Lewis & Olsen, 2000). It is stated that the visual taxonomy of the
platform will inference the effectiveness of understanding the graphs. Consequently, this
evaluation would try to design measurable questions for assessment so that different
variables could be compared.
3.8 Techniques and Data
This research will mainly use RStudio and RShiny to visualise lung adenocarcinoma data
while SPSS and Tableau will be used as the assistant tools. Using RShiny, it is convenient to
visualise the statistical analysis if the codes are well designed. As there are varieties of
genetics related R packages for visualisations available online, most of the visualisations
introduced above are able to find out suitable and similar references. The R package for
visualising the heat maps, networks, survival plots, scatter plots, box plots and bar plots can
be approached from RPubs (RPubs, 2015). The codes for creating circle maps and genomic
coordinates are summarised in ggbio package (Yin, Cook & Lawrence, 2012). Moreover,
websites such as Github and Gitools provide instructions and examples to visualise genomic
data.
For data, this research will use the cancer genomic data sets of The Cancer Genome Atlas
(TCGA) portal which has data covering various types of cancers as well as genes. So far,
there have been 39 cancer types including 14,531 cases in total, within which there are 585
lung adenocarcinoma cases (The Cancer Genome Atlas, 2016). As this research only focuses
on lung adenocarcinoma, only lung adenocarcinoma data will be accessed for visualisations.
4.0 Methodology
11
4.0 Methodology
4.1 Research plan summary
In order to make the dashboard more practical for cancer researchers and create an
effective evaluation process, four problems need to be solved in this research.
Step 1: To explore potential visualisations based on existing studies;
Step 2: To design suitable case studies for researching lung adenocarcinoma;
Step 2: To clean data sets for visualising;
Step 3: To conduct potential visualisations based on the three case studies;
Step 4: To develop and publish the interactive dashboard;
Step 5: To design the questionnaire and then evaluate the interactive dashboard.
As numbers of portals for visualising and analysing the cancer projects are available online,
the principles and features of existing visualisations have be summarised in literature review.
To gain a better summary, principles of each type of visualisations used on those portals
should be understood at first. Then, by combining other visualisations presented in previous
literary researches, all of the potential visualisations can be summarised to meet the
requirements of current cancer researches.
Graphical expression aims to interpret characteristics of data more directly, which can offer
supports for solving practical problems. Therefore, three case studies are designed
according to current lung adenocarcinoma research problems. These three studies not only
can be used to visualise the chosen dataset more practically, but also offer benchmarks for
later evaluation.
Before building the dashboard, some calculation and integration of data sets extracted from
TCGA should be done. During the building and updating process, great help was received
from “Github”, “Rpub”, Phil Chapman (a staff from Cancer Research UK) and fellow student
Ketaki B Patil.
After controlling potential influences, a questionnaire testing utility of those visualisations
and the dashboard were designed for different case studies. After the evaluation,
assessment will be analysed and compared to identify the effectiveness of the dashboard.
4.2 Dataset Description
4.2.1 Term Explanation
The table below depicts the explanation of term abbreviation used in the dashboard and
some of the information has also been shown in the dashboard as the help note. What
should be mentioned here is that the information of LUAD and LUSC is derived from
4.0 Methodology
12
National Cancer Institute (2016). The information about genes is derived from Gene Cards
(2016).
Table 4.1 Tem Dictionary
Term abbreviation Term Note
LUAD
Lung adenocarcinoma
It is one of the non-small cell lung
cancers accounting for about 40%
of all lung cancers.
LUSC
Lung squamous cell carcinoma
Another one of non-small cell lung
cancers accounting for about 30%
of all lung cancer.
BRCA Breast cancer
Information derived from TCGA. BLCA Bladder carcinoma
ACC Adenoid cystic carcinoma
MET MET gene It is a protein coding gene.
TP53 Tumour protein p53 It is a protein coding gene.
Mut.disease Mutant disease
Information derived from RTCGA. WILDorNOINFO Wild type disease or no information
ZNF500 expression Zinc finger protein 500 It is a protein coding gene.
LUAD.rnaseq RNA sequence of Lung adenocarcinoma
4.2.2 General Description
As is mentioned, this study used the cancer genomic data sets of TCGA, in which there are
1093 lung cancer cases. From Figure 4.1, there are 585 lung adenocarcinoma cases
representing 54% and Figure 4.2 shows demographic information of lung adenocarcinoma
cases. It can be seen that cases of male and female are nearly equal and most of donors are
still alive. Within them, 75% are not Hispanic or Latino, 24% did not report and only 1% are
Hispanic or Latino. For race distribution, donors who are white represent 75% and black or
African Americans rank third with 10% while only 2% of them are Asians.
4.0 Methodology
13
Figure 4.1 Case distributions of LUAD and LUSC in Lung Cancer Project of TCGA
Figure 4.2 Demographic distributions of cases within LUAD
585, 54%
504, 46%
Lung Cancer Project
LUAD LUSC
Alive 64%
Dead 36%
Vital Status
Female 54%
Male 46%
Gender
75%
24% 1%
Ethnicity
Not hispanic or latino
Not reported
Hispanic or latino
75%
13%
10%
Race
White
Not reported
Black or african american
Asian
Other
4.0 Methodology
14
4.2.3 Sub-dataset Description
As is mentioned in the literature review, there are 39 cancer types in the TCGA data sets
covering from 2011 to 2016. In this study, 5 cancer types across six years were used,
including LUAD, LUSC, BRCA, BLCA and ACC. Within these 5 cancer types, RNA sequence
datasets, Clinical datasets as well as Mutation datasets were extracted.
After extracting these three kinds of sub-datasets of five cancer types, some calculation and
dataset integration were done before developing the dashboard. Each step is shown as
follow and the final results are presented in the Table 4.2. After the dataset combination,
specific columns (gene type or expression type) were extracted when needed.
Step 1: The Clinical datasets of LUAD and LUSC were integrated;
Step 2: The Clinical datasets of LUAD, LUSC and BRCA were integrated;
Step 3: The Mutation datasets of LUAD and ACC were integrated;
Step 4: The Clinical datasets of LUAD and ACC were integrated;
Step 5: The RNA sequence datasets of LUAD, LUSC, BRCA and ACC were integrated;
Step 6: The RNA sequence datasets of LUAD, LUSC, BRCA and BLCA were integrated.
Table 4.2 Each step of datasets building process
Final data sets Separated data sets Notes
“luadlusc.survinfo” LUAD.clinical + LUSC.clinical Integrated by column
“admin.disease_codde”
“luadluscbrca.survinfo”
LUAD.clinical + LUSC.clinical
+ BRCA.clinical
Integrated by column
“admin.disease_codde”
“luad_acc.clinical_mutations”
LUAD.clinical + ACC.clinical +
LUAD.mutations +
ACC.mutations
Integrated by column
“admin.disease_codde”;
Filter by “MET” or “TP53” gene;
Filter by “Mutated” or “Wild” type
“luad_lusc_brca_acc.rnaseq”
LUAD.rnaseq + LUSC.rnaseq
+ BRCA.rnaseq + ACC.rnaseq
Integrated by column
“admin.disease_codde”;
Filter by “MET” and “ ZNF500”
“luad_lusc_brca_blca.rnaseq”
LUAD.rnaseq + LUSC.rnaseq
+ BRCA.rnaseq +
BLCA.rnaseq
Integrated by column
“admin.disease_codde”;
Filter by “MET” and “ ZNF500”
“luad_lusc_brca_acc.1rnaseq”
LUAD.rnaseq + LUSC.rnaseq
+ BRCA.rnaseq + ACC.rnaseq
Integrated by column
“admin.disease_codde”;
Filter by “MET”;
4.0 Methodology
15
Calculate log1p (MET);
“luad_lusc_brca_acc.2rnaseq”
LUAD.rnaseq + LUSC.rnaseq
+ BRCA.rnaseq + ACC.rnaseq
Integrated by column
“admin.disease_codde”;
Filter by “MET”;
Calculate log1p (MET);
4.3 Dashboard Development
4.3.1 Platform and Packages
To build the dashboard, R shiny was used and all codes were implemented by building
“server.R” and “ui.R”. There are seven main packages used, which are summarised in Table
4.3. During the building process, it was of great help to utilise the tutorials on Rpub as well
as the instructions on Github.
Table 4.3 Summary of main packages used for building the dashboard
Packages Description Accessed from
RTCGA,
RTCGA.clinical,
RTCGA.mutations,
RTCGA.rnaseq
Nine specific packages
were used in the study:
boxplotTCGA, checkTCGA,
datasetsTcga,
heatmapTCGA, infoTCGA,
kmTCGA, mutationsTCGA,
readTCGA, survivalTCGA
All of these packages are offered by TCGA which
contain various cancer related analysis.
Authors:
Marcin Kosinski, [email protected]
Przemyslaw Biecek, [email protected]
Witold CDhodor, [email protected]
shiny “Web application
framework for R”
Rstudio
ggvis “Visualise a data set with a
ggvis graphic”
rsconnect
“Deployment interface for
R Markdown documents
and Shiny applications”
4.3.2 Visualisation Creating
Using all the data sets and R packages described above, along with design guidelines on
“Github” as well as tutorials on “Rpub”, considering the requirement of three case studies
designed in literature review, three kinds of visualisations were chosen for the final
interactive dashboard.
4.0 Methodology
16
For case study 1, survival plot was chosen to visualise the results of Kaplan-Meier survival
analysis in order to reach the aim of demonstrating survival characteristics of lung
adenocarcinoma.
For case study 2, even though heat maps, scatter plot, networks as well as circle map are
able to represent gene expression patterns, as is stated and summarised in literature review,
box plot was finally used in this dashboard. There are two reasons for this decision. Heat
map, networks and circle map are multi-dimension visuals. They can represent gene
expression patterns and correlation with one more variable at the same time, which means
that much information would be contained into one graph. This may lead to time consuming
and more difficulties for users to understand the visualisations. For the scatter plot and box
plot, because of the fewer dimensions, both of them can visualise data more directly.
Scatter plot is regularly used to emphasize the correlation between two variables, such as
gene and gene expression (Gao et al, 2013). Box plot highlights the distribution of the
results, medians, lower and upper quartiles for instance.
Thus, to visualise the MET expression of lung adenocarcinoma as well as other comparison
cancer types, a cancer with highest MET expression could be recognised, which could
potentially indicate the gene overexpression. In addition, by doing so, the results of case
study 2 can be used to compare the results demonstrated in case study, so that they are
able to be used as validation of previous research of gene drivers detection.
For case study 3, heat map is finally used to visualise the molecular profile features of lung
adenocarcinoma. It is noted that
“The cancer genome is characterized by point mutations, aberrant methylation and
gene expression patterns, and altered chromosomal and genomic structure including
changed DNA copy number and gene fusions” (Cowin, Anglesio, Etemadmoghadam
& Bowtell, 2010, chapter 2. paragraph 1)
This reveals that cancer profiling contains varieties of features. In order to keep consistence
and allow comparison among different case studies, visualisations of gene expression
pattern were conducted.
4.3.3 Dashboard Publishing and Updating
During the dashboard building process, problems were solved via email accessing from Phil
Chapman, the cancer research scientist from Cancer Research UK as well as direct help from
fellow student Ketaki B Patil. In addition, a meeting with Phil Chapman was arranged for
further improvement of the initial dashboard. This meeting lasted for over an hour and
numerous observations and feedbacks were received. Each amendment is listed as follow:
1. The note of technical terms can be explained before the dashboard evaluation;
4.0 Methodology
17
2. Some “radio buttons” functions in the user interface could be changed into “select
box” functions because more potential choices can be added for future or a larger
scale study and hidden selection can show a better user interface.
3. The dashboard could be published by being conducted in R Markdown, which makes
it accessible and more convenient for participants to evaluate the dashboard via the
Internet.
4. The datasets could be downloaded and then added into the dashboard before
running the codes, which would enable the dashboard to run faster.
4.3.4 User Interface Designing
In order to create a more user friendly interface, the current portal interface of TCGA Data
Portal as well as cBioPortal were compared. It can be found that the main sidebar of TCGA
Data Portal locates on the left of screen while in cBioPortal, it locates on the top as a
navigation menu. As the screenshots depict:
Figure 4.3 Portal interface of TCGA (CDC) Data Portal
4.0 Methodology
18
Figure 4.4 Portal interface of cBioPortal
For the dashboard complemented here, as there are not so much selection could be offered
in the sidebar and meanwhile, the graphs need to be demonstrated here are not as big as
the one shown in TCGA Data Portal. Hence, combining the advantages of sidebar and
navigation menu function and the suggestions from Phil Chapman and Ketaki B Patil, the
final dashboard is shown as follow. As this interactive dashboard has been published via
shinyapps.io (a platform for publishing Shiny web applications), here is the link of the
interactive dashboard: http://tcgaplotappluad.shinyapps.io/TCGA_Plot_Validation
4.0 Methodology
19
Figure 4.5 Interface of the dashboard built in this study
4.4 Design Evaluation and Questionnaire
As is mentioned in the literature review as well as dashboard development, three case
studies were built and all of the visualisations were built to achieve the aims of each case
study. Thus, some of questions designed in the evaluation questionnaire are also related to
the case studies. This evaluation questionnaire contains three sections which are
background information, dashboard evaluation and impression of using the dashboard. The
separated parts below present the elaboration of questions in each section (the
questionnaire is attached in Appendix 9.4).
4.0 Methodology
20
4.4.1 Background Information Section
This section aims to gather the demographic information (e.g.: gender), the profession and
the visualisations using experience of participants. For the reason that the dashboard built
here aims to support cancer research scientists and students for their research, the related
knowledge would influence their judgement and assessment criteria. The example
demonstrated below is one of the questions for revealing visualisation experience.
Figure 4.6 Question examples of visualisation creating experience
4.4.2 Evaluation Section & Dashboard usage impression Section
These two sections aim to find out if the dashboard can help participants to locate answer
of specific question, if graphs are clear enough for them to understand, if these graphs could
help to the know more about lung adenocarcinoma and finally, if the dashboard could help
their further study. Consequently, it can be summarised that two kinds of questions are
included: questions for evaluating the dashboard utility for supporting case studies and
questions for evaluating graphs themselves.
As three types of visualisations were built for supporting three case studies respectively, the
evaluation section consists of three parts which are “Survival Plot”, “Heatmap” and “Box
Plot”. The following three questions are examples designed according to each case study
and the other two are used for indicating the graph and dashboard utility.
Figure 4.7 Question examples for case study 1
4.0 Methodology
21
Figure 4.8 Question examples for case study 2
Figure 4.9 Question examples for case study 3
Those three figures above are question examples for finding out if those graphs achieve the
aims of each case study. The level of these questions were not designed to too difficult for
participants to answer but due to the different difficulty levels of understanding different
graphs, questions also have different difficulty levels.
Figure 4.7 is the question of one of the survival plots, aiming to find out if the participants
could recognise the survival features of TP53, which links back to the requirement of case
study 1. Figure 4.8 relates to the box plot section, aiming to indicate if the participants could
identify the MET expression differences of different cancers, which links back to case study
2. The question in Figure 4.9 is more difficult as it is a question from the heat map, which is
used to visualise the relationship among MET gene, ZNF500 expression and each cancer
RNA sequence, linking back to case study 3.
The two figures below are examples assessing utility of each graph and the dashboard. Most
of the questions in these two sections were designed as linear scale, which are more precise
and objective to illustrate the level of participants’ agreement instead of only offering fixed
answers. Also, the scores of each question could be measured and compared.
4.0 Methodology
22
Figure 4.10 Question examples for evaluating the graph itself
Figure 4.11 Question examples for evaluating the dashboard utility
4.4.3 Select participants and Conduct assessment
As this dashboard is designed for helping cancer research scientists and students,
participants with oncology related background or who are conducting cancer research are
priorities. In addition, data scientists or clinical data managers should also be targeted.
However, after searching on the website of the University of Sheffield, it can be found that
some biomedical scientists could be contributing into the cancer research. Consequently,
samples were selected from the staff list of the department of Oncology & Metabolism and
Biomedical Science. Students of MSc Data Science programme were selected as samples as
well. In addition, one of the candidates is the students who graduated from the MSc Data
Science last year and now is working in the Cancer Research UK.
In order to improve possibility of the response, each email was designed according to the
research interests of each candidate and then sent as invitation attached with the research
statement, the ethics approval letter and the links of evaluation form and dashboard (the
template of the email is attached in Appendix 9.2).
4.5 Research Ethics
This study has been identified as “low risks” and approved on ethics grounds by Information
School. The ethics application form, ethics approval letter and consent form are attached in
the Appendix 9.3.
4.5.1 Questionnaire and Participants
Even if the questionnaire consists of demographic information of the participants, all the
information will be anonymised and no identity information will be demonstrated. Even
though this research will involve human participations, the questions are not culturally or
politically sensitive.
4.0 Methodology
23
4.5.2 Datasets and Data storage
This research, as is described above, used data from TCGA, which offers open-access cancer
project data and donor sample data. All of the donor data are anonymous without showing
any personal information to protect confidentiality. As TCGA is a public website for
visualisations and accessing data, signing the consent form is not necessary. All the
genomics datasets were downloaded from TCGA website and stored in the online drive of
my university account ([email protected]). The questionnaires were sent out by “google
form” spreadsheet and stored in my university account as well. After submitting the
dissertation, all the data will be sent to the database of Information School.
5.0 Results
24
5.0 Results
5.1 Participant Background Information
Most of participants have rich experience in doing research with high education level. Figure 5.1 presents the background information of
participants, in which is distribution of age, education level and their profession experience, respectively. As this study mainly focuses on
participants with rich academic experience or cancer research related background, most of them are over 26 years old (58.8%), in which
participants in “26 -35” and “36 -45” account for the same proportion (17.6%) while “46 – 55” and “56 – 65” have the same percentages
(11.8%) as well. 41.2% of them have attained a postdoctoral degree and there are 52.9% of them with master degrees, respectively. Most of
them have been in their current profession for less than 1 year (41.2%) and 29.4% has more than 10 years professional experience. They were
also required to write down their current profession (shown in Appendix 9.5). It can further approve that they have rich research experience in
their profession, which it is significant for this dashboard evaluation to gain more valuable and helpful suggestions.
Figure 5.1 Distribution of age, education level, profession year experience
Using data visualisations shows a great demand and necessary for participants’ daily work. Figure 5.2 shows the level of their demands to
create the visualisations (e.g.: bar charts, box plots, survival plots) in their work or study. It indicates that most of participants (47.1%) always
5.0 Results
25
need to do so and 47.1% of them often need to so as well. When asking if they rely more upon the visualisations or the text to present their
work, 41.2% of them rely more on visualisations and 29.4% stays neutral (Figure 5.3).
Figure 5.2 Frequency of creating data visualisations
Figure 5.3 Preference of using visualisations and text
5.0 Results
26
5.2 Dashboard description and Evaluation results
Figure 5.4 First page of the dashboard
For the first page of the dashboard (Figure 5.4), the explanation of term abbreviation used in this dashboard is presented as the help note in
order to help users have a general understanding before exploring the dashboard. This help note is also shown at the bottom of the graphs
when exploring each graph. When asking whether they agree that the note is helpful for better understanding the graphs (Figure 5.5), 70.6%
of them agree with it, in which there are 35.3% of them completely agree.
5.0 Results
27
Figure 5.5 Help note agreement
Figure 5.6 “Different cancer types” section: Two cancer type comparison (left) and three cancer type comparison
5.0 Results
28
5.2.1 Survival Plot Section 1: “Different cancer types”
Survival Plot section is divided into two sections including “Different cancer types” and “Comparison in different genes” and Figure 5.6 presents two of the graphs in the “Different cancer types” section. The first graph compares the survival patterns of LUAD and LUSC, because both of them belong to non-small cell lung cancers. It would be meaningful to find their differences or similarities. The right one makes comparison among LUAD, LUSC and BRCA, which means that a totally different type of cancer, breast cancer was added into this comparison. It can be illustrated that LUAD and LUSC have a much lower survival probability then breast cancer, which is also generally believed in medical research (Siegel, Miller & Jemal, 2015).
For case study 1, its task is to visualise the survival characteristics of LUAD. This means that what has been presented here achieved the aim of
case study 1. However, in the evaluation result of the right one graph (Figure 5.7), only 41.2% of participants could recognise that LUAD has
the lowest survival probability when comparing with the other two types of cancers. 23.5% of them were unsure which should be the right
answer. That is why there are 35.3% participants staying neutral when asking if they agree that the graphs are easy enough to understand (see
Figure 5.8 below). Nevertheless, what should be emphasized here is that there are still more than half (59%) participants think that the graphs
are not difficult to comprehend and within which, 23.5% of them completely agree with it. Moreover, only 23.6% of them suppose there is too
much information in these graphs (Figure 5.9 below), 47% of them disagree with it. Hence, all these evidences indicate that survival plot itself
is not the main problem leading to the confusion.
Figure 5.7 Answer locating distribution of “Three cancer type comparison” (top right)
5.0 Results
29
Figure 5.8 Easy level agreements of graphs in the “Different cancer types” section (graphs shown in Figure 5.6)
Figure 5.9 Agreement of too much information for graphs in the “Different cancer types” section (graphs shown in Figure 5.6)
5.0 Results
30
In order to find out what may cause confusing or erroneous answers, a multiple choice question was asked and its results are presented in
Figure 5.10 It can be found that most participants (52.9%) agree that the legend and labels of these graphs could result in the errors.
Specifically, one of the participant explained that a fixed time frame should be defined when making comparison as LUAD and LUSC survival
changes over time. In addition, the shading regions representing standard error of the mean (SEM) ought to be defined on the legend of the
graph as well. For the labels, the unit for time axis should be given to avoid misunderstandings. In addition to the legend and labels, the
unfamiliarity of the medical terms is mentioned to be one of the causes because it takes time to match the terms in the graphs with the help
note at the bottom to locate the answer.
Figure 5.10 Possible elements that could cause erroneous answers of the “Different cancer types” section
The reason of why there are three graphs in this section is that it would useful to figure out if it is more useful for cancer researchers to
compare more types of cancers in one time. From Figure 5.11, 64.7% participants agree that it is useful to make comparison among different
cancers. However, it is stated by Phil Chapman that it sometimes depends on what kind of problems needed to be solved. Maybe that is why
there are 23.5% of them staying neutral and 11.8% of them disagree with it. Hence, this implies that those existing online portals for visualising
genomics data sets should provide various choices to meet different needs of scientists. In this case, it may useful to offer function that could
present graphs with or without different cancer comparison at the same time.
The legend of the graph
The title of the graph
The labels of the graph
None of them
Other
5.0 Results
31
Figure 5.11 Agreement of comparison among different cancers
When being asked if these graphs have derived insight into researching LUAD, half of participants (29.4% - 4; 23.5% - 5) agree that, while 23.5% (5.9% - 1; 17.6% - 2) disagree with it (Figure 5.12).
Figure 5.12 Insight gaining agreement of the graphs in the “Different cancer types” section
5.0 Results
32
5.2.2 Survival Plot Section 2: “Comparison in different genes”
Figure 5.13 “Comparison in different genes” section: MET gene (left) and TP53 gene
Similarly, the second section of “Survival Plot” aims to represent survival patterns of LUAD but in different gene types this time (Figure 5.13).
For these two graphs, one of them is to demonstrate the survival patterns of MET gene in LUAD and ACC (left) and the other one is to show
TP53 gene (TP53). For the reason that, both of them have been significantly identified as one of the mutated genes of LUAD though TP53
showed a higher significance (Devarakonda, Morgensztern & Govindan, 2015). Thus, it may be meaningful to know what the difference of their
survival patterns is in LUAD. In addition, the gene situation of both LUAD and ACC were divided into mutated type and wild type. This result
can be found from Figure 5.14, there are 64.7% participants suppose it is useful to compare mutation type and wild type and still 29.4% of
them stay neutral about it though.
5.0 Results
33
Figure 5.14 Agreement of comparison between mutation type and wild type
Figure 5.15 Answer locating distribution of “TP53” survival plot
5.0 Results
34
Compared with the first question in last section, Figure 5.15 demonstrates a higher accuracy in this question. 70.6% (>41.2%) participants
could identify that ACC in mutated TP53 has the lowest survival probability. After comparing individual response, it is interesting to find that 6
participants who located the wrong answer the last section gained the correct answer in this section. In addition, more participants (82.3% >
58.8%) agree that the graphs in this section are easy enough to read (Figure 5.16). Moreover, only 5.9% agrees that these graphs present too
much information even though there are still more than half of them stay neutral (Figure 5.17). What should be highlighted is that no one
disagree that these graphs are of no use for deriving insight into researching LUAD and there 70.5% of them indicate that they are helpful
(Figure 5.18). All these evidences demonstrate a better comprehension of participants in this section.
Figure 5.16 Easy level agreements of graphs in the “Comparison in different genes” section (graphs shown in Figure 5.13)
5.0 Results
35
Figure 5.17 Agreement of too much information for graphs in the “Comparison in different genes” section (graphs shown in Figure 5.13)
Figure 5.18 Insight gaining agreement of the graphs in the “Comparison in different genes” section
5.0 Results
36
There may be two reasons. First, in this case, those four survival curves do not overlap or cross too much while in last section, the survival
curves of LUAD and LUSC crosses once during the whole time period. This may result in less confusion. Second, participants may get more
familiar with the survival plot after exploring the last section.
Although more the accurate answers are shown, elements that may result in erroneous answers still exist. Within them, label of the graph is
mentioned most (52.9%) and legend come second (35.3%) (Figure 5.19).
Figure 5.19 Possible elements that could cause erroneous answers of the “Comparison in different genes” section
The legend of the graph
The title of the graph
The labels of the graph
None of them
Other
5.0 Results
37
5.2.3 Box Plot Section
Figure 5.20 Box Plot Section: LUAD & LUSC & BRCA & ACC (left) and LUAD & LUSC & BRCA & BLCA
In the “Box Plot” section, there are two graphs and each of them compares the results of logarithm of MET among four types of cancer RNA
sequence at one time. As is stated in the methodology, this section aims to achieve the task of case study 2. This means that the box plot here
is used to visualise the MET expression feature of LUAD as well as other compared cancer types. By implementing this visualisation, a cancer
with highest MET expression could be recognised, which could potentially indicate the gene overexpression (Prelich, 2012). In Figure 5.20, the
label of x-axis “logarithm of MET” stands for the MET expression feature. As “LUAD.rnaseq” has the highest logarithm of MET median, this
implies that MET gene is more probable to overexpress in LUAD than other cancer types included here.
Compared with the “Survival Plot” section 1, the accuracy of the case associated question is higher in this section. It can be seen from Figure
5.21, those chose the right answer “LUAD” account for 70.6% (>41.2%). Additionally, no one disagree that these graphs are easy enough to
5.0 Results
38
understand (presented in Figure 5.22). All these imply that, in this dashboard, box plot itself is to a degree, easier than survival plot for
participants to read.
Figure 5.21 Answer locating distribution of LUAD & LUSC & BRCA & ACC (left) box plot
Figure 5.22 Easy level agreements of graphs in the Box Plot section (graphs shown in Figure 5.20)
5.0 Results
39
In order to find out how useful of the box plot presented here is, questions related to this were asked. 76.5% of respondents agree that it is useful to visualise its gene feature in this way (Figure 5.23). Besides, 76.4% respondents agree that they could gain insight in to researching LUAD by these graphs (Figure 5.24), which implies that the box plots visualised here have achieved the task of case study 2 though some elements still need to be improved (Figure 5.25).
Figure 5.23 Utility agreements of graphs in the Box Plot section
Figure 5.24 Insight gaining agreement of the graphs in the Box Plot section
5.0 Results
40
Figure 5.25 Possible elements that could cause erroneous answers of the Box Plot section
For those who got the wrong answer or were not sure about the answer, most of them suppose that the labels (52.9%) of these graphs are
more likely to result in the wrong answer. Specifically, some of the participants mentioned that the font of the labels is too small to identify. In
addition, one of the researches advised that the spots and the different length of each line should be elaborated here. Furthermore, one of the
participants, who is a research scientist working over 10 years suggested that the subscripted “.rnaseq” is not necessary in the label. Even
though the data used here is the RNA sequence data, it is still used to represent the MET expression features of LUAD when showing the label
“LUAD.rnaseq”, for example. In addition, it may cause confusion for users of the meaning between “LUAD” and “LUAS.rnaseq”. Therefore, it is
more direct to use “LUAD” instead of “LUAD.rnaseq” for better understanding.
The legend of the graph
The title of the graph
The labels of the graph
None of them
Other
5.0 Results
41
5.4 Heat map Section
Figure 5.26 Heat map Section: LUAD & LUSC & BRCA & ACC (left) and LUAD & LUSC & BRCA & BLCA
The same at last section, “Heatmap” section consist of two graphs and each of them contains data from four different cancer types. Within
three dimensions of data, y-asix shows four intervals of the MET expression data, x-axis shows RNA sequence data of four cancer types while
the other gene expression, ZNF500 is presented as colour of each patch. Hence, the heat map here is used to reveal the correlation of ZNF500
expression and MET expression among four cancer types. As is stated above, the task of case study 3 is to visualise molecular profile features
of LUAD and visualising gene expression pattern was finally chosen to be conducted. In this case, the gene expression correlation of each
cancer types is one of their molecular profile features.
5.0 Results
42
In order to validate if the heat map used in the dashboard is useful or not for visualising the molecular profile features of LUAD, related
questions were asked. From Figure 5.27, 64.7% of respondents agree that it is helpful for understanding LUAD molecular profile patterns. In
addition, more than half (58.8%) of them agree that these visualisations have derived insight into researching LUAD.
Figure 5.27 Utility agreements of graphs in the Heat map section
Figure 5.28 Insight gaining agreement of the graphs in the Heat map section
5.0 Results
43
From Figure 5.29, it can be demonstrated that the accuracy of the case related question in this section is much lower than the last section.
Only one respondent could identify that the ZNF500 expression of BLCA has the highest correlation with MET gene. In addition, respondents
who found it difficult to the graphs represent 29.4%, which is much higher than the “Box Plot” section (0%). There are 17.6% of them agree
that it is too much information for them, which is more than the “Box Plot” section (11.8%) as well.
Figure 5.30 Answer locating distribution of LUAD & LUSC & BRCA & BLCA (right) heat map
5.0 Results
44
Figure 5.31 Easy level agreements of graphs in the Heat map section (graphs shown in Figure 5.26)
Figure 5.32 Agreement of too much information for graphs in the Heat map section (graphs shown in Figure 5.26)
5.0 Results
45
Most of them (52.9%) suppose that the labels of the graph could lead to wrong answers and the legend and title rank second (23.5%) and third
(17.6%) which is shown in Figure 5.33. What should be stressed is that, no one mentions that the title would be a problem in last two sections.
Hence, the title and legend of these graphs could be the main cause to errors in this case. Specifically, one of the respondents, who is an
academic researcher with experience for 6 – 10 years stated that, the heat map is difficult to interpret what kind of information it is intended
to present and why.
Figure 5.33 Possible elements that could cause erroneous answers of the Heat map section
The legend of the graph
The title of the graph
The labels of the graph
None of them
Other
5.0 Results
46
5.5 The whole impression of this dashboard for participants
From Figure 5.34, it can be illustrated that more than half (52.9%) of the participants are quite familiar with the similar dashboard
(visualisations) prior to the one presented here and 29.4% of them are not familiar with it. After using this dashboard, 68.8% of them state that
they are willing to use it in their future work or study if needed and those stay neutral and rejected account for 25% and 6.3%, respectively
(Figure 5.35). Some of them provide the reason explaining why and all the responses can be summarised as follow.
1. This kind of dashboard could be a useful learning tool. It not only simplifies the process of interpreting, but also provides for the users
that what types of visualisations and analyses could be used for these types of datasets as a reference. Especially for those who are
beginners for learning about analysing genomic data.
2. This dashboard needs to be more flexible and sophisticated. One of the respondents stressed that this dashboard can do the job as a
basic one with limited data, but it would not be needed again due to its limited information and functions. In addition, it is suggested
that it will make it more functional if users could import external data into the dashboard for comparison. Thus, it is more useful if it
has function allowing users to import their data for visualisation by applying the survival plot, heat map and box plots options.
3. Much more annotation and guidance are required. As is mentioned above, the legend, labels, and titles of the graphs in this dashboard
need to be improved for better understanding. In addition, one of them also highlighted that it takes time for beginners to recognise
different cancer types and gene types when locating answers. This means that annotation and guidance on the dashboard is not
enough.
5.0 Results
47
Figure 5.34 Familiarity of the similar dashboard
Figure 5.35 Willingness of using the dashboard in the future works (or study)
5.0 Results
48
5.6 Further validation
As is mentioned above, box plot shows a better interpretation among participants, to further validate and compare in a quantity way, scores
were calculated. In the questionnaire, the same structure of questions was asked in each section of graphs. Thus, the average scores of one of
three questions were calculated for comparison. The specific questions include:
1. Which level do you agree that these graphs are easy enough to read (or understand)?
2. Which level do you agree that these graphs present too much information?
3. Which level do you agree that these graphs have derived insight into researching lung adenocarcinoma?
Figure 5.36 Interpretation measures of different visualisation types
This figure provides further and more direct validation that participants find it easier to understand and gain insight from the box plot and the
second survival plot. In the first measure “Easy level agreement”, the second survival plot and the box plot gain the same score (3.88), which is
slightly higher than the first survival plot (3.76) and much higher than heat map (3.24). In the second measure “Too much information”, the
3.88
2.65
3.94
3.24
2.76
3.29
3.88
2.47
3.88
3.76
2.65
3.47
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Easy level agreement
Too much information
Insight gaining agreement
Survival 1
Survival 2
Heat map
Box Plot
5.0 Results
49
second survival plot received the lowest score (2.47) and box plot and the first survival plot come second (2. 65), which means that more
participants agree that heat map (2.76) contains too much information for them. The last measure, asking about how much insight gaining,
box plot got the highest score (3.94) and the heat map is the lowest (3.29). This could indicate that the difficulty to interpret the graph itself
would have direct impact for information expression.
Figure 5.37 Interpretation measures of participants with different profession experience
Figure 5.37 has the same measures as last figure, but in different categories. As participants were asked how long have they been in their
current profession, the comparison can be made to find out if their profession experience would influence their understanding of the
visualisations. What should be focused here is that, for those who have profession experience for less than 1 year, demonstrate lower scores in
the “Easy level agreement” and “Insight gaining agreement” and higher scores in “Too much information”. For those who have been worked
for more than 10 years, received the highest scores in insight gaining but still found that those graphs consist of too much information for
3.68
2.75
3.68
3.77
2.80
3.61
4.00
2.53
3.75
3.25
2.50
1.75
3.72
2.64
3.77
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
Easy level agreement
Too much information
Insight gaining agreement
More than 10 years
6 to 10 years
3 to 6 years
1 to 3 years
Less than 1 year
5.0 Results
50
them when comparing those who have been in their profession for 3 to 10 years. This means that, even though they could not interpret the
visualisations completely, they still can use their professional experience to understand the graphs. It is interesting to find that people who
have 3 to 6 years professional experience obtained a better interpretation results overall.
Figure 5.38 Interpretation measures of participants with different levels of visualisation familiarity
3.77
2.80
3.61
4.00
2.53
3.75
3.68
2.75
3.68
3.69
2.63
3.65
4.25
1.00
5.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
Easy level agreement
Too much information
Insight gaining agreement
1
2
3
4
5
5.0 Results
51
In order to find out how familiarity to data visualisations would impact this dashboard interpretation, measures were conducted for the
following question:
According to this question scales, “1” in the legend stands for those rely more on visualisations while “5” represents those prefer to use text.
Although, there is no obvious linear trend in each measure, some findings are still significant to be stressed here. Participant who relies
completely on visualisations in the work illustrates the highest scores (5.00) in insight gaining and lowest scores (1.00) in “Too much
information” measure. The other side, those who completely prefer to use text for presenting have the lowest scores (3.61) in insight gaining
and agreed most (2.80) with too much information in the graphs. For the “Easy level agreement” measure, most preference for visualisations
shows the highest scores (4.25) as well.
5.0 Results
52
Figure 5.39 Interpretation measures of participants with different domain knowledge
For this figure5.39, the measures are categorised by participants with different knowledge expertise. In the category of “Academic staff &
students”, it stands for participants who are selected from the staff list of the department of Oncology & Metabolism and Biomedical Science.
“Data Science” represents the participants from the MSc Data Science while “Others” only contains one participant who is a data scientist
having been worked in Cancer Research UK for a year. This comparison is used to find out how the different domain knowledge would
influence the interpretation of these visualisations. From Figure X, “Others” presents the best comprehension with highest scores in both “Easy
level agreement” (4.00) and “Insight gaining agreement” (4.00) and lowest scores (2.00) in “Too much information” measure. However, what
should be emphasized is that, as it only represents the results of one person, it is not statistically significant.
Moreover, it is important to find that participants with data science expertise show better understanding than those with cancer or biomedical
related knowledge. This may due to the reason that, in this dashboard, users were not required to explore the genomics data sets. They could
use the help note for basic understanding thus, they do not need to comprehend the original data sets when interpreting the datasets. In
addition, as they are familiar with data visualisations and data analysis, they could firstly recognise what kind of analysis was used in the
4.00
2.00
4.00
4.00
2.59
3.72
3.69
2.63
3.65
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
Easy level agreement
Too much information
Insight gaining agreement
Academic staff & students
Data Science
Others
5.0 Results
53
graphs to understand what is expressed in the visualisations. Maybe that is also why they found it easier for interpretation (4.00) and slightly
more insight gaining (3.72) than those with cancer or biomedical relevant knowledge.
However, after exploring the comments for recommendation, most of the suggestions of participants with cancer or biomedical associated
knowledge focus much on the functional improvement instead of only visualisation base improvement. This means that, these types of
participants could think more out of the visualisation and dashboard base, but the data sets as well as the functions related development.
Figure 5.40 Willingness of participants with different domain knowledge
Willingness to use the dashboard in the future work was asked among participants with different domain knowledge. It can be seen that
participants with data science expertise obtained higher scores (4.63 > 4.19) than those are from the department of Oncology & Metabolism or
Biomedical Science. This means that the former is more willing to utilise this type of dashboard than the latter when necessary.
4.19
4.63
3.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00
Academic staff & students
Data Science
Others
Willingness to use it in the future
5.0 Results
54
Figure 5.41 Willingness of participants with different frequencies of creating visualisations
The willingness was also measured among participants with different data visualisations creating frequencies. The question for categorising is
shown as follow:
It can be illustrated that all the participants involved in the study need to create data visualisation regularly as the minimum frequency is “3”.
From Figure 5.41, those who most frequently create data visualisations (5) gained highest scores (4.19) in usage willingness, which is slightly
higher (4.19 > 4.13) than those in the second rank of frequency (4). Those who create visualisations not that much in their work received the
lowest scores (3.00) in usage willingness.
3.00
4.13
4.19
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
3
4
5
Willingness to use it in the future
55
6.0 Discussion and Dashboard design recommendations
6.1 Participants
The background of participants involved in this dashboard evaluation covers a broad and
balanced range of attributes, which improves the research diversity for this study. For the
age range, except the most proportion of 18-25 (41.2%), others with a wide range from 26
to 65 years old, have similar number of respondents. In addition, the number of students
and people who are working represent the same proportion (47.1%). The education level of
participants ranges from master to postdoctoral. Their profession covers data science
student, data scientist, professor/academic staff/research technician from Oncology &
Metabolism department and Biomedical Science department. Moreover, the time for
dedicating in their profession ranges from less than 1 year to more than 10 years.
6.2 Academic research (working) experience impacts
Academic research experience or working experience shows a positive impact not only for
better interpreting the visualisations and using the dashboard, but also for insight gaining.
As is mentioned shown in Figure 5.37, those with less professional experience demonstrate
a lower level of agreement of insight gaining from the visualisations than those who have
been worked for more than 10 years. In addition, those with rich research experience
expressed less agreement that the visualisations contain too much information.
However, the gap of each type is small, except for the only one respondent who has been
worked for 6 to 10 years. This may due to the reason that even though participants cover
different majors, all of them possess portion of the required knowledge to interpret those
visualisations. Additionally, the help note covered the shortage of some participants to a
certain extent. Furthermore, the limited sample size may have inferences to the study
significance as well.
6.3 Domain knowledge familiarity impacts
It is interesting to find that participants with data science background illustrate a bit more
positive results than those who are experts in cancer or biomedicine. This result, to some
extent, can prove statement of Klonowska et al. (2015) that those visualisations on
genomics research portals are able to facilitate those without bioinformatics background.
This is also the ambition of those existing cancer research portals. However, after further
exploring, for those only with data science background, cannot completely understand the
meanings of the data even if they could choose the right answers.
What should be emphasized here is that the one who possesses both domains of knowledge
(“others” in Figure 5.37) presents a much better interpreting result. This, at the same time,
provides evidence for the argument of Schroeder, Gonzalez-Perez, Lopez-Bigas (2013),
programming and bioinformatics knowledge are both necessary to efficiently using the
56
visualisations. Consequently, to completely interpret this dashboard, not only requires
general knowledge of interpreting visualisations, but also demands knowledge related to
oncology, genetics, cancer or biomedicine.
Additionally, visualisation usage is also another domain knowledge, which could affect the
comprehension to the visualisations. As is shown in Figure 5.38, the one who completely
prefer to use visualisations when presenting the work presents apparent higher insight
gaining agreement, much less agreement of too much information and higher agreement of
easy interpretation. After further exploring, this participant is professor of bioinformatics
imaging, which indicates a frequent usage of visualisations. This can be linked to the study
of Ellard (2015) that, both the familiarity with the dataset and the experience of dashboard
usage could influence uses’ confidence in referencing the information gained from the
dashboard. That is why the participants involved in this study were selected from
department of Oncology & Metabolism, Biomedical Science and Data Science as the gap of
domain knowledge could be recognised, to a degree.
6.4 Dashboard comprehension
In addition to domain knowledge and associated technical expertise, there are other
elements could make contribution to comprehend the dashboard more efficiently. In the
evaluation, the unclear annotation and limited guidance was commented by participants.
Even if for those who expertise in cancer or biomedical research, they still need clear
annotation for them to interpret the dataset and analysis methods used in the visualisations,
not mention to beginners. One of the students commented that she does not understand
the information in the visualisations and another one mentioned that it was time consuming
to identify the labels of cancer types and gene types when locating the answers.
Furthermore, the annotation and guidance (help note) should be designed according to
different types of graphs. In this dashboard, the second survival plot and box plot
demonstrate better interpretation results and heap map obtained the worst. This means
that more elaboration for the heat map should be provided for the users in this case.
The improvement of comprehension results of the second survival plot may also result from
skill of learning by exploring. It is stated by Elias (2012) that, the design principles of the
dashboard would have influence on learning process of users because the interactive
dashboard is a storytelling tool. Hence, when considering the how to arrange the
visualisation, designing the annotation or guidance of the interactive dashboard, the
potential links and relationship should be considered.
6.5 Dashboard design recommendations
The annotation and guidance (help note) should be clear and precise. In this dashboard, due
to the unclear legend and labels of the graphs, participants found it difficult to quickly locate
the answers of the questions and understand the visualisations as well. Specifically, as is
57
discussed above, the annotation and guidance are better to meet the users with different
domain knowledge as well as technical expertise, which would also have influences on
visualisation comprehension.
Interactive dashboard is more than just combination of visualisations. It can be used as a
self-learning tool to simplify the information extracting process from tremendous and
multidimensional data (Few, 2007), such as genomics data. It is commented by participants
that the dashboard created here not only offer help for interpreting this type of data sets,
but also provides the approaches for visualising and analysing genomics data sets.
In addition, some participants mentioned that the dashboard here should be more
functional and intelligent. For example, it can be improved as a transferred tool, which is
able to import data from existing genomics research portals such as ICGA, cBioPortal, as well
as the data from users. In this way, users could use the visualisation designed in this
dashboard to make comparison and are accessible to explore the data, which would
improve the comprehension level.
The design of the dashboard for visualising genomic data should meet the different
requirement of scientists from different domain. As is presented in Figure 5.39, for
participants with different domain knowledge, they expressed different level of willingness
to use this kind of dashboard for their future study or work. Those with data science
background are more willing to use this dashboard. It may due to the reason that, for those
without cancer or bioinformatics related expertise, using the interactive dashboard to
process data and access the visualisations is more direct and less expertise exploring.
58
7.0 Conclusions and Recommendations
7.1 Conclusions
By exploring various existing online portals for analysing and visualising genomics datasets,
substantial beautiful and sophisticated interactive visualisations can be found. This indicates
that great progress for visualising multidimensional genomics data has been made, which
has been helping cancer research scientists for more efficient decision making. However,
seldom researches have been conducted to evaluate whether these visualisations or portals
are useful enough for scientists, whether its usage efficiency varies from different types of
scientists and how helpful they are. Thus, this study has used the genomics datasets of TCGA
to create an interactive dashboard, which contains several visualisations aiming to provide
help for researching lung adenocarcinoma. Then, the systematic assessment was set up for
evaluating the efficiency of the dashboard. Some issues of the dashboard were identified
and they could be used for developing and improving the existing cancer research online
portals.
Academic research experience or working experience is identified to have positive impacts
on comprehending the visualisations as well as manipulating the interactive dashboard.
Those with rich profession experience illustrated less agreement with too much information
in the visualisations, for instance. In addition, compared with participants with oncological
or biomedical related expertise, those from MSc Data Science programme demonstrate
more positively for locating the right answers. However, participants with oncological or
biomedical associated knowledge could think more out of the dashboard interface
improvement and provided suggestions of how to make the dashboard more intelligent and
practical.
In addition to domain knowledge, previous visualisation usage could affect the comprehension
to the visualisations as well. Also, the type of graph illustrates different interpretation
results. Furthermore, the annotation and guidance show a significant position for better
visualisation understanding, because the indistinct legend and labels in the dashboard would result
in inefficient dashboard usage. Consequently, due to the domain knowledge and technical expertise
impacts, the annotation and guidance (help note) should be designed according to different
types of graphs and users.
It is interesting to find that, during the dashboard assessment, this interactive dashboard
was used more than a visualisation tool, but a self-learning tool for learning cancer related
knowledge as well as approaches for analysing and visualising genomics data sets. For the
reason that, some participants could recognise more functional future development for this
type of dashboard and suggested that being able to import data for more exploration and
make comparison with exiting datasets (e.g.: ICGC, TCGA) could be an active area. Other
participants, for example, expressed that it offers the potential methods for visualising and
analysing these types of datasets as a reference.
59
7.2 Recommendations for future research
7.2.1 For data samples and participants
The sample size should be increased for more solid results and further significantly
statistical analysis as the involved participants were limited in this study. This, to some
extent limits the reliability of the findings in this study. Larger dataset and much more types
of cancer data should be involved as this dashboard only focused on several cancer types
and gene types.
7.2.2 For the interactive dashboard
Broader range of analysis approaches could be conducted for evaluation, which also means
that wider varieties of visualisations should be created and need to be validated by users.
Before evaluation, more functions could be designed and added for a more functional and
intelligent dashboard.
7.2.3 For the evaluation methods
The approach to evaluate the dashboard should be improved by designing and using more
reliable and dimensional assessment indicators. In addition to using questionnaire, more
kinds of assessing approaches (e.g.: interview, real-time monitoring) could be conducted to
improve the measure process.
60
8.0 References
Arcila, M. E., Drilon, A., Sylvester, B. E., Lovly, C. M., Borsu, L., Reva, B., ... & Ladanyi, M.
(2015). MAP2K1 (MEK1) mutations define a distinct subset of lung adenocarcinoma
associated with smoking. Clinical Cancer Research, 21(8), 1935-1943. Retrieved from:
http://clincancerres.aacrjournals.org/content/21/8/1935.short
Cancer Genome Atlas Research Network. (2014). Comprehensive molecular profiling of lung
adenocarcinoma. Nature, 511(7511), 543-550. Retrieved from:
http://www.nature.com/nature/journal/v511/n7511/abs/nature13385.html
Ciriello, G., Cerami, E., Sander, C., & Schultz, N. (2012). Mutual exclusivity analysis identifies
oncogenic network modules. Genome research, 22(2), 398-406. Retrieved from:
http://genome.cshlp.org/content/22/2/398.short
Cline, M. S., Smoot, M., Cerami, E., Kuchinsky, A., Landys, N., Workman, C., ... & Hanspers, K.
(2007). Integration of biological networks and gene expression data using Cytoscape. Nature
protocols, 2(10), 2366-2382. Retrieved from:
http://www.nature.com/nprot/journal/v2/n10/abs/nprot.2007.324.html
Cowin, P. A., Anglesio, M., Etemadmoghadam, D., & Bowtell, D. D. (2010). Profiling the
cancer genome. Annual review of genomics and human genetics, 11, 133-159. Retrieved
from: http://www.annualreviews.org/doi/full/10.1146/annurev-genom-082509-141536
Czubak, K., Lewandowska, M. A., Klonowska, K., Roszkowski, K., Kowalewski, J., Figlerowicz,
M., & Kozlowski, P. (2015). High copy number variation of cancer-related microRNA genes
and frequent amplification of DICER1 and DROSHA in lung cancer. Oncotarget, 6(27), 23399-
23416. Retrieved from: http://europepmc.org/abstract/med/26156018
Devarakonda, S., Morgensztern, D., & Govindan, R. (2015). Genomic alterations in lung
adenocarcinoma. The Lancet Oncology, 16(7), e342-e351. Retrieved from:
http://www.sciencedirect.com/science/article/pii/S1470204515000777
Elias, M. (2012). Enhancing User Interaction with Business Intelligence Dashboards (Doctoral
dissertation, Ecole Centrale Paris). Retrieved from:
https://tel.archives-ouvertes.fr/file/index/docid/969170/filename/elias_thesis.pdf
Fernandez-Banet, J., Esposito, A., Coffin, S., Horvath, I. B., Estrella, H., Schefzick, S., ... &
Roberts, P. (2016). OASIS: web-based platform for exploring cancer multi-omics data. Nature
methods, 13(1), 9-10. Retrieved from:
http://www.nature.com/nmeth/journal/v13/n1/full/nmeth.3692.html
61
Few, S. (2007). Dashboard confusion revisited. Perceptual Edge, 1-6. Retrieved from:
http://perceptualedge.com/articles/03-22-07.pdf
Futreal, P. A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., ... & Stratton, M. R.
(2004). A census of human cancer genes. Nature Reviews Cancer, 4(3), 177-183.
Retrieved from: http://www.nature.com/nrc/journal/v4/n3/full/nrc1299.html
Gao, J., Aksoy, B. A., Dogrusoz, U., Dresdner, G., Gross, B., Sumer, S. O., ... & Cerami, E.
(2013). Integrative analysis of complex cancer genomics and clinical profiles using the
cBioPortal. Science signaling, 6(269), pl1. . Retrieved from:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4160307/
Gonzalez-Perez, A., & Lopez-Bigas, N. (2012). Functional impact bias reveals cancer drivers.
Nucleic acids research, gks743. Retrieved from:
http://nar.oxfordjournals.org/content/early/2012/08/14/nar.gks743.short
Isenberg, P., Zuk, T., Collins, C., & Carpendale, S. (2008, April). Grounded evaluation of
information visualizations. In Proceedings of the 2008 Workshop on BEyond time and errors:
novel evaLuation methods for Information Visualization (p. 6). ACM. Retrieved from:
http://dl.acm.org/citation.cfm?id=1377974
Jordi, B., Giordano, C., Nicolas, S., Kavitha, V., Adam, A.M., Sungjoon, K., …& Levi, A.G.
(2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug
sensitivity. Nature, 483 (7391), 603-607. Retrieved from:
http://www.nature.com/nature/journal/v483/n7391/full/nature11003.html
Klonowska, K., Czubak, K., Wojciechowska, M., Handschuh, L., Zmienko, A., Figlerowicz,
M., ... & Kozlowski, P. (2015). Oncogenomic portals for the visualization and analysis of
genome-wide cancer data. Oncotarget. Retrieved from:
http://europepmc.org/abstract/med/26484415
Landi, L., Minuti, G., & Salvini, J. (2013). MET overexpression and gene amplification in
NSCLC: a clinical perspective. Dovepress, 4, 15-25. Retrieved from:
https://www.dovepress.com/met-overexpression-and-gene-amplification-in-nsclc-a-clinical-
perspect-peer-reviewed-article-LCTT
Morse, E., Lewis, M., & Olsen, K. A. (2000). Evaluating visualizations: using a taxonomic
guide. International Journal of Human-Computer Studies, 53(5), 637-662. Retrieved from:
http://www.sciencedirect.com/science/article/pii/S1071581900904129
62
Olivero, M., Rizzo, M., Madeddu, R., Casadio, C., Pennacchietti, S., Nicotra, M. R., ... &
Comoglio, P. M. (1996). Overexpression and activation of hepatocyte growth factor/scatter
factor in human non-small-cell lung carcinomas. British journal of cancer, 74(12), 1862.
Retrieved from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2074802/
Prelich, G. (2012). Gene overexpression: uses, mechanisms, and interpretation. Genetics,
190(3), 841-854. Retrieved from: http://www.genetics.org/content/190/3/841
Rajaram, S., & Oono, Y. (2010). NeatMap-non-clustering heat map alternatives in R. BMC
bioinformatics, 11(1), 1. Retrieved from:
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-45
RPubs. (2015). Visualisations Vignette for Cancer Cell Lines Package, Retrieved December 2,
2015 from http://rpubs.com/chapmandu2/120228
Santarius, T., Shipley, J., Brewer, D., Stratton, M. R., & Cooper, C. S. (2010). A census of
amplified and overexpressed human cancer genes. Nature Reviews Cancer, 10(1), 59-64.
Retrieved from: http://www.nature.com/nrc/journal/v10/n1/full/nrc2771.html
Saraiya, P., North, C., & Duca, K. (2005). An insight-based methodology for evaluating
bioinformatics visualizations. Visualization and Computer Graphics, IEEE Transactions on,
11(4), 443-456. Retrieved from:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1432690&tag=1
Schroeder, M. P., Gonzalez-Perez, A., & Lopez-Bigas, N. (2013). Visualizing multidimensional
cancer genomics data. Genome Med, 5(1), 9. Retrieved from:
http://repositori.upf.edu/handle/10230/23185
Siegel, R. L., Miller, K. D., & Jemal, A. (2015). Cancer statistics, 2015. CA: a cancer journal for
clinicians, 65(1), 5-29. Retrieved from:
http://onlinelibrary.wiley.com/doi/10.3322/caac.21254/
Stephens, P. J., McBride, D. J., Lin, M. L., Varela, I., Pleasance, E. D., Simpson, J. T., ... &
Greenman, C. D. (2009). Complex landscapes of somatic rearrangement in human breast
cancer genomes. Nature, 462(7276), 1005-1010. Retrieved from:
http://www.nature.com/nature/journal/v462/n7276/full/nature08645.html
Suh, S. S., Yoo, J. Y., Cui, R., Kaur, B., Huebner, K., Lee, T. K., ... & Croce, C. M. (2014). FHIT
suppresses epithelial-mesenchymal transition (EMT) and metastasis in lung cancer through
modulation of microRNAs. PLoS Genet, 10(10), e1004652. Retrieved from:
http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004652
63
Travis, W. D. (2002). Pathology of lung cancer. Clinics in chest medicine, 23(1), 65-81.
Retrieved from: http://www.sciencedirect.com/science/article/pii/S0272523103000613
Vaske, C. J., Benz, S. C., Sanborn, J. Z., Earl, D., Szeto, C., Zhu, J., ... & Stuart, J. M. (2010).
Inference of patient-specific pathway activities from multi-dimensional cancer genomics
data using PARADIGM. Bioinformatics, 26(12), i237-i245. Retrieved from:
https://bioinformatics.oxfordjournals.org/content/26/12/i237.full
Weinstein, J. N. (2008). Biochemistry. A postgenomic visual icon. Science (New York, NY),
319(5871), 1772-1773. Received from:
http://science.sciencemag.org/content/319/5871/1772
Wilkinson, L., & Friendly, M. (2012). The history of the cluster heat map. The American
Statistician. Retrieved from:
http://cda.psych.uiuc.edu/multivariate_fall_2013/lincoln/heatmap.pdf
Wu, Y., Liu, H., Shi, X., Yao, Y., Yang, W., & Song, Y. (2015). The long non-coding RNA HNF1A-
AS1 regulates proliferation and metastasis in lung adenocarcinoma. Oncotarget, 6(11), 9160.
Retrieved from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4496209/
Yin, T., Cook, D., & Lawrence, M. (2012). ggbio: an R package for extending the grammar of
graphics for genomic data. Genome Biol, 13(8), R77. Retrieved from:
http://download.springer.com/static/pdf/
Yin, S., Yang, J., Lin, B., Deng, W., Zhang, Y., Yi, X., ... & Zhao, G. (2014). Exome sequencing
identifies frequent mutation of MLL2 in non–small cell lung carcinoma from Chinese
patients. Scientific reports, 4. Retrieved from:
http://www.nature.com/articles/srep06036?WT.ec_id=SREP-631-20140819
64
9.0 Appendix
9.1 Mutated genes in lung adenocarcinoma (source from: Devarakonda, Morgensztern & Govindan, 2015)
65
9.2 Template of the email for inviting participants for dashboard evaluation
66
9.3 Research Ethics
9.3.1 Ethics Application Form
67
68
69
70
71
72
9.3.2 Ethics Approval Letter
73
9.3.3 Consent Form
74
75
9.4 Questionnaire
76
77
78
79
80
81
82
83
9.5 Details of participants’ current profession
Information School.
Access to Dissertation
A Dissertation submitted to the University may be held by the Department (or School) within which the Dissertation was undertaken and made available for borrowing or consultation in accordance with University Regulations.
Requests for the loan of dissertations may be received from libraries in the UK and overseas. The Department may also receive requests from other organisations, as well as individuals. The conservation of the original dissertation is better assured if the Department and/or Library can fulfill such requests by sending a copy. The Department may also make your dissertation available via its web pages.
In certain cases where confidentiality of information is concerned, if either the author or the supervisor so requests, the Department will withhold the dissertation from loan or consultation for the period specified below. Where no such restriction is in force, the Department may also deposit the Dissertation in the University of Sheffield Library.
To be completed by the Author – Select (a) or (b) by placing a tick in the appropriate box
If you are willing to give permission for the Information School to make your dissertation available in these ways, please complete the following:
(a) Subject to the General Regulation on Intellectual Property, I, the author, agree to this dissertation being made immediately available through the Department and/or University Library for consultation, and for the Department and/or Library to reproduce this dissertation in whole or part in order to supply single copies for the purpose of research or private study
(b) Subject to the General Regulation on Intellectual Property, I, the author, request that this dissertation be withheld from loan, consultation or reproduction for a period of [ ] years from the date of its submission. Subsequent to this period, I agree to this dissertation being made available through the Department and/or University Library for consultation, and for the Department and/or Library to reproduce this dissertation in whole or part in order to supply single copies for the purpose of research or private study
Name Hailing Lu
Department Information School
Signed Hailing Lu Date 31/8/2016
To be completed by the Supervisor – Select (a) or (b) by placing a tick in the appropriate box
(a) I, the supervisor, agree to this dissertation being made immediately available through the Department and/or University Library for loan or consultation, subject to any special restrictions (*) agreed with external organisations as part of a collaborative project.
*Specialrestrictions
(b) I, the supervisor, request that this dissertation be withheld from loan, consultation or reproduction for a period of [ ] years from the date of its submission. Subsequent to this period, I, agree to this dissertation being made available through the Department and/or University Library for loan or consultation, subject to any special restrictions (*) agreed with external organisations as part of a collaborative project
Name
Department
Signed Date
THIS SHEET MUST BE SUBMITTED WITH DISSERTATIONS BY DEPARTMENTAL REQUIREMENTS.
\\studata05\home\LI\Lip15hl\ManW7\Downloads\DISSERTAION\Dissertation\Content\Confirmation of Address after Completion Form.doc
Information School
Address & First Employment Destination Details
Name:
Programme:
Registration No:
CONFIRMATION OF ADDRESS
The University student record system has the capability to record up to three different addresses for
you at any one time: HOME address, TERM-TIME address and CORRESPONDENCE address.
Please note that the University and the School will use the details on your HOME address record
when posting out information to you, e.g. Statement of Results, Notification of Degree Results,
Degree Ceremony information etc. It is therefore YOUR RESPONSIBILITY to ensure this
address information is up to date so that communications we post to you get to you.
However, if you know that you will be staying at a temporary/other address rather than your home
address after completing the programme, and you wish correspondence to be sent to this
temporary/other address it is YOUR RESPONSIBILITY to add the address details to the
CORRESPONDENCE address on your student record. If you have provided a Correspondence
address this will be used instead of the Home address to post out communications to you. (See
www.shef.ac.uk/ssid/record/correspondence.html for more information) The Schools will not send
out correspondence to your TERM address once you have completed your studies with us.
You can check and update all your address information by logging on to the SSID website at
www.shef.ac.uk/ssid/record/pin.html. You will need your UCARD PIN number to access your
student record.
I confirm that I have checked, and updated if necessary, my HOME address
details on my University student record tick to confirm
I wish correspondence to go to a temporary/other address and not my Home
address, and I have therefore provided CORRESPONDENCE address details on
my University student record. tick to confirm
Alumni Information
I agree that the Information School may pass details regarding the result of my
degree to the relevant Professional Body Organizations for the purposes of
assisting these organisations in deciding my eligibility for the award of
professional accreditation. tick to agree
I would like my name to go on the
School’s Alumni webpage.
I would like my name and email address to go on
the School’s Alumni webpage
My email address:
tick to agree
tick to agree
I would like to be forwarded details of any job vacancies received by the School.
My email address is tick to agree
\\studata05\home\LI\Lip15hl\ManW7\Downloads\DISSERTAION\Dissertation\Content\Confirmation of Address after Completion Form.doc
First Employment Destination Details for School Records
It is very important that we receive information on how and where students have gained their first
employment after finishing their programme of study with the Information School. Prospective and
current students find it very useful to know the types of jobs/sectors that graduates of our
programmes can expect to go into and we provide summarised information on the School website
at www.shef.ac.uk/is/careers for each separate programme. The School also has to provide a
summary report to the University Faculties on first destinations as part of Teaching Quality
Assessment, Therefore any updated information you can give us is vital and very much appreciated.
Employer’s name
and address (inc.
telephone number
and email if
available):
Job Title:
Brief Job
Description:
Starting Date:
Please tell us where
you saw this post
advertised:
Is this employment:
Full-time or Part-time
Is this employment:
Permanent or Temporary
Are you returning to an
existing post: YES or NO
Would you be interested in participating in careers talks within the iSchool?
My email address is________________________________________________
(please do not use your Sheffield University email address)
tick to agree
Signed u Date 31st
AUGUST, 2016