14
1 CAVER: Algorithms for Analyzing Dynamics of Tunnels in Macromolecules Antonin Pavelka, Eva Sebestova, Barbora Kozlikova, Jan Brezovsky, Jiri Sochor and Jiri Damborsky Abstract—The biological function of a macromolecule often requires that a small molecule or ion is transported through its structure. The transport pathway often leads through void spaces in the structure. The properties of transport pathways change significantly in time; therefore the analysis of a trajectory from molecular dynamics rather than of a single static structure is needed for understanding the function of pathways. The identification and analysis of transport pathways are challenging because of the high complexity and diversity of macromolecular shapes, the thermal motion of their atoms, and the large amount of conformations needed to properly describe conformational space of protein structure. In this paper, we describe the principles of the CAVER 3.0 algorithms for the identification and analysis of properties of transport pathways both in static and dynamic structures. Moreover, we introduce the improved clustering solution for finding tunnels in macromolecules, which is included in the latest CAVER 3.02 version. Voronoi diagrams are used to identify potential pathways in each snapshot of a molecular dynamics trajectory and clustering is then used to find the correspondence between tunnels from different snapshots. Furthermore, the geometrical properties of pathways and their evolution in time are computed and visualized. Index Terms—tunnel, pore, channel, pathway, macromolecule, molecular dynamics, CAVER, Voronoi diagram, Delaunay triangulation, average link hierarchical clustering 1 I NTRODUCTION Biological macromolecules, such as proteins and nu- cleic acids, play essential roles in life processes. An understanding of their structure and function is cru- cial for uncovering the principles of life [1], the de- velopment of new drugs [2] and applications in in- dustry [3]. Computational analysis and visualization of macromolecules is important due to their complex structure, and high cost and limitations of laboratory analyses. The function of a macromolecule often requires that a small molecule or ion is transported through its structure. The transport pathway usually leads through void spaces in a structure [4]. Properties of transport pathways change significantly in time, therefore the analysis of an ensemble of structures A. Pavelka, B. Kozlikova and J. Sochor are with the Human Computer Interaction Laboratory, Department of Computer Graphics and Design, Faculty of Informatics, Botanick´ a 68a, Brno, 602 00, Czech Republic. E-mail: [email protected] (AP); kozlikova@fi.muni.cz (BK); sochor@fi.muni.cz (JS) E. Sebestova, J. Brezovsky and J. Damborsky are with the Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Faculty of Science, Masaryk University, Kamenice 5/A13, Brno, 625 00, Czech Republic. E-mail: [email protected] (ES); [email protected] (JB); [email protected] (JD) We would like to thank David Bednar (Masaryk University, Czech Republic) and Lukas Daniel (Masaryk University, Czech Republic) for testing; Jan Stourac (Masaryk University, Czech Republic), Jaroslav Bendl (Brno University of Technology, Czech Republic) for providing the scripts for visualization of pathway dynamics in VMD and PyMOL software and Laboratory of Advanced Network Technologies (Masaryk University, Czech Republic) for providing a computer for computational time measurements. (e.g., trajectory of a molecular dynamics) rather than of a single static structure is needed for the under- standing of tunnel function [5]–[9]. Previously, we released the CAVER 3.0 application and in [9] we discussed its biochemical relevance and briefly described the principles of calculations. This paper aims to provide the readers with the detailed description of the CAVER 3.0 algorithms that was not published before to offer the possibility for reproduc- tion, comparison, and further development of these approaches. Furthermore, we present an improved clustering solution, which is available in the version CAVER 3.02 and allows to analyze very large sets of conformations containing hundreds of thousands of tunnels. The paper first focuses on the input structures and the basic properties of their transport pathways and briefly introduces the Voronoi diagrams. Then the related work is discussed. The core of the paper is formed by the detailed description of all steps of our CAVER approach. Next, the properties of the detected tunnels and their visualization are discussed. Finally, the limitations of our approach are discussed and the future work is outlined. 1.1 Structures of Biological Macromolecules The majority of the experimentally determined three- dimensional structures of proteins is archived in a single repository – the Protein Data Bank (PDB) [10]. The coordinates of individual atoms can be retrieved for most of the structures stored in the PDB. Struc- tures containing thousands of atoms are the most

CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

1

CAVER: Algorithms for Analyzing Dynamics ofTunnels in Macromolecules

Antonin Pavelka, Eva Sebestova, Barbora Kozlikova, Jan Brezovsky, Jiri Sochor and Jiri Damborsky

Abstract—The biological function of a macromolecule often requires that a small molecule or ion is transported through itsstructure. The transport pathway often leads through void spaces in the structure. The properties of transport pathways changesignificantly in time; therefore the analysis of a trajectory from molecular dynamics rather than of a single static structure isneeded for understanding the function of pathways. The identification and analysis of transport pathways are challenging becauseof the high complexity and diversity of macromolecular shapes, the thermal motion of their atoms, and the large amount ofconformations needed to properly describe conformational space of protein structure. In this paper, we describe the principlesof the CAVER 3.0 algorithms for the identification and analysis of properties of transport pathways both in static and dynamicstructures. Moreover, we introduce the improved clustering solution for finding tunnels in macromolecules, which is included in thelatest CAVER 3.02 version. Voronoi diagrams are used to identify potential pathways in each snapshot of a molecular dynamicstrajectory and clustering is then used to find the correspondence between tunnels from different snapshots. Furthermore, thegeometrical properties of pathways and their evolution in time are computed and visualized.

Index Terms—tunnel, pore, channel, pathway, macromolecule, molecular dynamics, CAVER, Voronoi diagram, Delaunaytriangulation, average link hierarchical clustering

F

1 INTRODUCTION

Biological macromolecules, such as proteins and nu-cleic acids, play essential roles in life processes. Anunderstanding of their structure and function is cru-cial for uncovering the principles of life [1], the de-velopment of new drugs [2] and applications in in-dustry [3]. Computational analysis and visualizationof macromolecules is important due to their complexstructure, and high cost and limitations of laboratoryanalyses.

The function of a macromolecule often requiresthat a small molecule or ion is transported throughits structure. The transport pathway usually leadsthrough void spaces in a structure [4]. Propertiesof transport pathways change significantly in time,therefore the analysis of an ensemble of structures

• A. Pavelka, B. Kozlikova and J. Sochor are with the Human ComputerInteraction Laboratory, Department of Computer Graphics and Design,Faculty of Informatics, Botanicka 68a, Brno, 602 00, Czech Republic.E-mail: [email protected] (AP); [email protected] (BK);[email protected] (JS)

• E. Sebestova, J. Brezovsky and J. Damborsky are with the LoschmidtLaboratories, Department of Experimental Biology and Research Centrefor Toxic Compounds in the Environment RECETOX, Faculty ofScience, Masaryk University, Kamenice 5/A13, Brno, 625 00, CzechRepublic.E-mail: [email protected] (ES); [email protected] (JB);[email protected] (JD)

We would like to thank David Bednar (Masaryk University, CzechRepublic) and Lukas Daniel (Masaryk University, Czech Republic) fortesting; Jan Stourac (Masaryk University, Czech Republic), JaroslavBendl (Brno University of Technology, Czech Republic) for providingthe scripts for visualization of pathway dynamics in VMD and PyMOLsoftware and Laboratory of Advanced Network Technologies (MasarykUniversity, Czech Republic) for providing a computer for computationaltime measurements.

(e.g., trajectory of a molecular dynamics) rather thanof a single static structure is needed for the under-standing of tunnel function [5]–[9].

Previously, we released the CAVER 3.0 applicationand in [9] we discussed its biochemical relevance andbriefly described the principles of calculations. Thispaper aims to provide the readers with the detaileddescription of the CAVER 3.0 algorithms that was notpublished before to offer the possibility for reproduc-tion, comparison, and further development of theseapproaches. Furthermore, we present an improvedclustering solution, which is available in the versionCAVER 3.02 and allows to analyze very large sets ofconformations containing hundreds of thousands oftunnels.

The paper first focuses on the input structures andthe basic properties of their transport pathways andbriefly introduces the Voronoi diagrams. Then therelated work is discussed. The core of the paper isformed by the detailed description of all steps of ourCAVER approach. Next, the properties of the detectedtunnels and their visualization are discussed. Finally,the limitations of our approach are discussed and thefuture work is outlined.

1.1 Structures of Biological Macromolecules

The majority of the experimentally determined three-dimensional structures of proteins is archived in asingle repository – the Protein Data Bank (PDB) [10].The coordinates of individual atoms can be retrievedfor most of the structures stored in the PDB. Struc-tures containing thousands of atoms are the most

Page 2: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

2

M

A

B

T1

T2

Fig. 1: Tunnels T1 and T2 connecting buried site A inmacromolecule M with an exterior space.

common, but some entries can contain even hundredsof thousands of atoms. The growth of the number ofstructures in the PDB is continuously increasing. Atthe end of 2014, the PDB held 105,400 structures and9,651 new structures were deposited during that year,while 9,381 structures were deposited in 2013.

A single conformation of a molecular structure isfrequently modeled as the union of balls in R3. Thedynamical behavior of a molecular system can berepresented as time series of such unions of balls.Typically, atoms remain the same, and only their co-ordinates change in time. The ensembles of structurescomposed of hundreds of thousands of snapshots arenot unusual.

1.2 Transport Pathways in Macromolecular Struc-turesMacromolecules can contain different types of innerpathways. The terms tunnel, channel, or pore denotethe pathways in macromolecular structures that areused for the transport of small molecules and ionsinside or through the structure. The terms channeland pore are most often used to describe transportpathways through biological membranes [9], [11]. Theterm tunnel can have two meanings. It denotes apathway connecting two buried sites in a macro-molecular structure [12] (e.g., different active sitesin multi-functional enzymes). In the second case thetunnel is a pathway connecting a buried site (e.g., theactive site of an enzyme) with an exterior solvent.The CAVER algorithms presented in this paper arefocusing on this type of tunnels. A simplified exampleinvolving macromolecular structure M , buried site Aand two tunnels T1 and T2 is depicted in Figure 1. Thecenterline of tunnel T1 connects the buried site A withthe point B on the molecular surface. For the purposeof geometry-based analysis of tunnels, we define thetunnel as follows.

Let M be the set of balls representing a macromolec-ular system (its atoms). Then, the tunnel connectingpoints A and B is a union of so called empty balls

A

B

Fig. 2: The tunnel connecting points A and B. Thedashed line represents the tunnel centerline, the bluediscs denote the balls of the tunnel and the violet discsrepresent atoms from M .

such that: (i) the centers of these balls form a curveconnecting A and B, (ii) the radius of each ball isthe maximum possible but such that the intersectionof this ball and atoms from M is empty and (iii) thecenterline between A and B lies within the volumetricboundary of M (e.g., the convex hull). The exampleof a tunnel is depicted in Figure 2. The minimum ofradii of all empty balls forming the tunnel is calledtunnel bottleneck radius. To compute these tunnels, weuse the Voronoi diagram introduced in the followingsection.

1.3 Voronoi DiagramsLet the atoms be represented by M = {M1, . . . ,Mk}– a set of balls Mi = (Ci, ri) in R3 with centers Ci

and radii ri. We assume that no ball is completelycontained inside another ball, even though intersec-tions between balls are allowed. Then, the signeddistance of a point X ∈ R3 to a ball Mi is defined asd(X,Mi) = dist(X,Ci)−ri, where dist is the Euclideandistance. The Voronoi region for a ball Mi is the set ofpoints

VRi = {X∈ R3| ∀Mj ∈M, j 6= i : d(X,Mi) ≤ d(X,Mj)}.

The additively weighted Voronoi diagram (AWVD) forM is the set of Voronoi regions {VR1, . . . , VRn}. Fur-thermore, a point shared by four or more Voronoiregions is called Voronoi vertex and a curve sharedby three Voronoi regions is called Voronoi edge [13].If all the balls have the same radius, we speak aboutordinary Voronoi diagram (VD), or Voronoi diagram ofpoints, which are in this case the centers of the balls.More information about the properties of VDs andalgorithms for their construction can be found in [13]–[16].

1.4 Related WorkIn this section we present several existing techniquesfor tunnel computation.

Geometry-based tunnel identification algorithmscan be used for several different purposes. First, asingle pathway going through the whole structure

Page 3: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

3

and connecting two surface sites can be analyzed. Forthis purpose, the tools HOLE [17] and PoreWalker [18]were developed.

Second, the aim can be to identify all putativepathways in a structure. This type of analysis can beperformed by Chunnel [19], which performs topolog-ical analysis of the structure using a grid, or the toolpresented by Lindow et al. [20] using AWVD. Thetool simplifies the graph corresponding to AWVD byapplying a series of filters. Then the paths connectingtwo user-specified points can be computed on thefiltered graph. The utilization of AWVD for analysisof biomolecules was also proposed in [13], [21], [22]and implemented in tools BetaTunnel, BetaVoid andBetaMol. Another implementation of AWVD is theawVoronoi project following the approach describedin [23].

The third possible usage of the geometry-basedalgorithms is the detection of pathways connectinga user-specified buried site with an exterior space.The CAVER 3.0 tool described in this paper fallsinto this last category, along with CAVER 1.0 [4],[24], CAVER 2.0 [25], MOLE 1.2 [26], MOLE 2.0 [27]and MolAxis [28]. CAVER 1.0 finds tunnels using agrid, while CAVER 2.0, MOLE 1.2 and MOLE 2.0 useVD of atom centers. MolAxis models large atoms bymultiple balls for better approximation of AWVD ofatoms of different radii. However, there are only fewtools and algorithmic approaches providing means toanalyze these pathways in dynamical structures. Soin the rest of this section we will describe the existingsolutions to this task.

MOLE 1.2 [26] offers the most similar functionalityto work presented in this paper. It uses clusteringfor finding the correspondence between tunnels fromdifferent conformations. The similarity of tunnels iscomputed by comparing the sets of atoms liningthe tunnels, but no information about the clusteringalgorithm is available. Experiments revealed that theclustering depends on the ordering of the tunnels [9].The clustering of tunnels performed by MOLE 1.2was not able to clearly separate tunnels into clusterscorresponding to known transport pathways [9]. Fur-thermore, the identification of tunnels in MOLE 1.2is based on the assumption that the differences inradii of different atoms are negligible and uses theordinary VD of atom centers, which can lead to theunderestimation of the tunnel bottleneck radius by asmuch as 50% of its actual value [9].

The suitable choice of the algorithm for the clus-tering of tunnels was investigated by Benes et al.[29]. The geometrical similarity of two tunnels wascomputed using the distance function derived fromthe Hausdorff distance. However, their clustering ex-periments were performed on hundreds of molecularconformations which is too limited with respect to thelength of current simulations reaching up to tens ofthousands of conformations. Tracing of the shape of

a selected pathway from a single snapshot throughtime has been also described by Benes et al. [30].Their method searches in the closest neighborhoodof a single fixed pathway. However, it overlooks thesecond closest tunnel, which can be nearly identical tothe initial tunnel, but wider than the tunnel actuallyfound in a given snapshot. A completely differentapproach to geometry-based tunnel detection in dy-namical structures was proposed by Benes et al. in2011 [31]. The pathways are assembled from individ-ual cavities, which were detected in the consecutivesnapshots and are overlapping geometrically. A sim-ilar principle was also used to analyze and visualizedynamics of pathways and cavities by Lindow et al.[32], [33].

2 ALGORITHMS

This section describes the algorithms implementedin the CAVER 3.02 tool for tunnel discovery andanalysis. The essential inputs are a single structureor the ensemble of its conformations and the positionof a buried site which should be connected with anexterior space via detected tunnels. The workflow oftunnel discovery first constructs the Voronoi diagramwhich is used for identification of all tunnels ful-filling the input parameters. These tunnels are thenpostprocessed and clustered in order to study theirdynamics. The outputs of the process consist of thegeometry of detected tunnels, their visualization, andtheir properties. The individual steps of the workflowwill be described in the following sections.

2.1 Construction of the Voronoi DiagramThis section first focuses on the construction of theapproximate AWVD using ordinary VD. Within thisprocess a special case can occur which is described aswell. Then the derivation of VD from the Delaunaytriangulation is discussed.

2.1.1 Approximation of the Additively WeightedVoronoi DiagramFor the representation of a macromolecule, the AWVDis the most appropriate structure. The better avail-ability of implementations of the ordinary VD forpoints led us to prefer approximation of the AWVDby ordinary VD.

The utilization of ordinary VD of atom centers leadsto considerable error. To take into account that atomshave different radii and limit this error, we identifythe smallest atom with the radius r and approximateall greater atoms by multiple balls with the radius r.

All such atoms are approximated by 13 balls ofradius r. The icosahedron is placed to the centerof the atom, 12 balls are centered at vertices of theicosahedron and one ball is placed at the centre ofthe atom. The size of the icosahedron is maximalsuch that all the 12 balls lie inside the atom. The

Page 4: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

4

Fig. 3: Atoms approximated by balls of equal radius.The first line depicts atoms of hydrogen, oxygen,nitrogen, carbon, sulfur and phosphorus. The sec-ond line shows their approximated representations byballs of radii corresponding to the hydrogen atom.The third line shows the approximation of the sameatoms by the balls with radius of oxygen, which isthe smallest atom in many available structures thatdo not include information about hydrogen atoms.

result for atoms commonly present in biomoleculesis shown in Figure 3. The idea was inspired by theapproximation approach used by MolAxis which isdescribed in detail by Yaffe [34].

The maximum difference between the surface of theatom with the 1.8 A radius (van der Waals radius ofsulphur) and its approximation by balls of the 1.2 Aradius (van der Waals radius of hydrogen) is identicalfor icosahedron and dodecahedron. The differenceis smaller than 0.18 A. As the computational timeof Voronoi diagram construction is almost half foricosahedron, we decided to use this platonic solid.

The advantages of our approach in comparisonwith approaches based on corrections of ordinaryVD by shifting its planes [35] are the simplicity andthe fact that the difference between the exact andthe approximated atom surface is limited by a smallconstant. On the other hand, representing an atomby several balls results in higher computational timeand memory costs. For the structure of the usual sizeof 7,000 atoms, the computation took 6 seconds andrequired 600 MB of RAM. For probably the greatestavailable structure with biological tunnel, the humanribosome containing over 230,000 atoms, the compu-tation took 20 minutes and required 28 GB of RAM(tested on Intel Core i7-4960X 3.60GHz).

2.1.2 Special Case Precautions

When constructing the Voronoi diagram, usually nofive centers of balls lie on a common empty sphere.Then each Voronoi vertex is connected by Voronoiedges to at most four other Voronoi vertices (seeFigure 4 – in 2D, the vertices have three neighbors).Otherwise, a special case may occur and a Voronoivertex can belong to more than four Voronoi edges.Due to limited accuracy of floating point numbers,

even inexact special cases can be problematic. Somespecial case occurred in nearly every structure. Be-cause dealing with special cases complicates bothalgorithms for Voronoi diagram construction and datastructures for its representation, we implemented sev-eral precautions discussed below to make the occur-rence of the special case practically negligible. Afterthese precautions were implemented, no special casewas detected in the test set of hundreds of structures.

a) b)Mi Mi

Fig. 4: a) Illustration of the 2D special case of aVoronoi vertex with four neighbors. b) If the disc Mi

representing an atom is moved so that its center is nolonger co-circular with the centers of the other atoms,two vertices with three neighbors each replace the onewith four neighbors.

The first precaution is the placement of a ball atthe center of each icosahedron, which removes thespecial Voronoi vertex at this center. The second oneis choosing a pseudorandom rotation of each pla-tonic solid. This precaution also makes the error ofthe approximation independent on the choice of thecoordinate system. As the third and most importantprecaution, the coordinates of each ball are changedby a small value, smaller than 0.001 A (the coordinatesin PDB files contain at most three digits after thedecimal point). When computing VDs for moleculardynamics, the seed for generating the pseudo-randomnumbers is deterministically derived from the PDBidentifier of each structure.

2.1.3 Triangulation

As a result of the previous step, we obtain a set ofballs of equal radii as an input for VD construction.The Quickhull algorithm [36] is used to construct theDelaunay triangulation (DT). The VD is then con-structed by exploiting its duality to the DT. Voronoivertices are constructed for each tetrahedron as thecenters of spheres circumscribed to the tetrahedra(see Figure 5). Every two Voronoi vertices from theneighboring tetrahedra are connected by a Voronoiedge. Each edge is represented by its vertices and bythe empty ball that is chosen arbitrarily as one of thefour balls nearest to the edge (e.g., the edge E1 andthe disc Mi in Figure 5). This empty ball will be laterused to estimate how far the points on the edge arefrom the neighboring atoms.

Page 5: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

5

V1

E1

Mi

V2

Fig. 5: Construction of Voronoi diagram from Delau-nay triangulation. The dashed circle is circumscribedto the bold triangle and the Voronoi vertex V1 formsthe center of this circle. E is an example of a Voronoiedge. The violet discs represent atoms.

2.2 Geometrical Properties of a Voronoi Edge

Tunnels are identified as the cheapest paths in a VDgraph. The cost of the path is defined as a sum ofthe costs of its edges. Let AB be the Voronoi edgeconnecting Voronoi vertices A and B. The cost of theedge AB is defined as

cost(AB) =

∫ dist(AB)

0

r(l)−z dl. (1)

The value r(l) is the maximum radius of the ballof the tunnel that does not collide with atoms andwhich is centered on AB in the distance l from A. Theparameter z is a non-negative real number that allowsusers to choose the desired geometrical properties oftunnels. When z is set to 0 for all edges on the path,only the length of the path is taken into account andthe shortest path is reported first. On the other hand,when z is set to its maximum allowed value of 100, thelength of the path will become practically negligiblein comparison with r(l), i.e., the wider paths will bepreferred over the short ones. By default, the value isset empirically to z = 2 to give the priority to edgesforming paths that are both wide and short [4], [24],[26], [28].

The integral is enumerated using the trapezoidalrule with a uniform grid. During the process, for eachedge the minimum r(l) is stored and later used asan estimate of the maximum radius of the empty ballthat can travel along the edge without intersecting anyatom. This minimum is called edge bottleneck radius.

2.3 Finding a Starting Vertex

Most usually, tunnels lead from/to the site where achemical reaction can occur. Atoms are usually lessdensely packed in this site which allows the betterfit of reactants. Users can then specify the position ofthis reaction site. However, this position often doesnot match with the ideal position of the reaction sitewhich is both in the vicinity of the user specifiedsite and as far from protein atoms as possible. Thiscan lead to the underestimation of the radius of the

tunnel in its start. In other words, the ideal positioncorresponds to the center of the maximal possiblesphere which fits to the void space containing theuser defined position. To detect the ideal position, wefind the Voronoi vertex which fulfills the above statedtwo requirements. This vertex is then used as the firstpoint of the tunnel centerline. The following algorithmis used to find this point.

As an input of the algorithm, the coordinates of thepoint Sinitial have to be provided by the user, eitherdirectly or in the form of a set of atoms. In the lattercase, Sinitial is computed in each conformation sep-arately as the centroid of the centers of the specifiedatoms.

Then let qi be the maximal radius of the ball cen-tered at a Voronoi vertex Vi such that the ball does notintersect any atom. The starting vertex Sstart is identi-fied as the Voronoi vertex Vi such that dist(Vi, Sinitial)is smaller than the user-defined parameter d and qiis larger than another user defined value, choosingthe vertex closest to Sinitial in case when more suchvertices exist. If no such vertex exists, the same criteriaare applied for value d = ddefault, which can happenwhen the user sets too small value of d. If even sucha vertex does not exist, the user-defined parametersare not used at all and the vertex closest to Sinitial ischosen to be the starting vertex Sstart.

2.4 Vertices Stopping Search

The geometrical identification of transport pathwaysceases to be meaningful on the interface betweenthe macromolecule and the exterior solvent, wherethe geometrical constraint of space by atoms is nolonger the major factor limiting the movement ofsmall molecules. In this section, several sets of Voronoivertices are defined, which are used to avoid thedetection of tunnels in space that is opened to theexterior solvent.

First, Voronoi vertices with at most three neighborsin 3D, hereafter referred to as border vertices, areidentified. For example, every vertex except for V2 inFigure 5 (2D example) is a border vertex.

A set of vertices entirely excluded from tunnel com-putation is called outer vertices (see vertices markedwith triangles in Figure 7). They are identified as thosethat can be reached by a spherical probe of radiusrS (default value 3 A) from border vertices. This isachieved by the depth-first search performed on thegraph composed of vertices and edges of VD startingfrom a set of border vertices.

The macromolecular surface contains many cleftsthat are open to the exterior space, but too narrowto be reached by the probe of radius rS . Considerthe situation depicted in Figure 6. Tunnel centerlineT4 should not be considered a valid tunnel becauseit goes through a cleft that is opened to the exteriorspace. Furthermore, if the identification of this type

Page 6: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

6

of tunnels is not prevented, they would prevail inthe results, making it difficult to separate them fromrelevant tunnels T1, T2, and T3.

T1 T2T4T3

SstartSstart

Fig. 6: Tunnels close to the surface – the tunnelcenterline T4 should not be considered a valid tunnel,in contrast to the centerlines T1, T2, and T3.

This motivates the introduction of a new set of ver-tices called the shallow vertices (see squared verticesin Figure 7) which is an extension of the set of outervertices. The construction of the set of shallow verticesis described by Algorithm 1 and in the remainderof this paragraph. The neighborhood of each outervertex V is examined, and an empty ball of maximumradius such that it does not intersect with any atomsis centered at V . The ball radius is then increased bythe user-specified value depth (default 4 A). Then, thespace inside the enlarged ball is explored as follows.Each vertex that can be reached from V by a sphericalprobe of radius rB (default 0.9 A) is added to the setof shallow vertices.

Each vertex which is neither outer nor shallow iscalled inner vertex. The inner vertices (see verticesrepresented by circles in Figure 7) are located inthe interior of the macromolecule, while the shallowvertices are closer to the exterior.

2.5 Tunnel IdentificationNow we can proceed with the identification of tunnelsin a single structure. Only Voronoi edges with radiusgreater or equal to the user-specified minimal tunnelbottleneck radius rB (default 0.9 A) are used for that.We will use two graphs (V,E) and (W,F ), whereV and W are sets of vertices and E, F are sets ofedges. All four sets are defined below and illustratedin Figure 8.

V is a set of vertices that are either inner verticesor connected to an inner vertex. E represents a set ofedges, where each edge connects two vertices of Vwhere at least one is an inner vertex. W stands for aset of vertices that are either shallow but not outer, orouter and connected to a shallow vertex. Finally F isa set of edges, where each edge connects two verticesof W where at least one is not an outer vertex.

The shallow vertex connected by a Voronoi edge toat least one inner vertex is called shallow boundaryvertex (double squared vertices in Figure 8). The outervertex connected by a Voronoi edge to at least one

outer

shallow

inner

Vertices

Sstart

Fig. 7: An example of a run of Algorithm 1 showingdifferent types of vertices. The dashed curved linerepresents molecular surface, atoms are displayed asviolet discs, full black lines represents Voronoi edgesof radius above a user-specified threshold value,while dashed straight line represents edges belowthe threshold and Voronoi vertices are denoted asdots (inner vertices), squares (shallow vertices) andtriangles (outer vertices).

shallow vertex is called outer boundary vertex (doubletriangled vertices in Figure 8).

The tunnels are then identified by a two step pro-cedure. In the first step, the Dijkstra algorithm isused to find the lowest cost path (edges valued usingequation 1) in graph (V,E) from a starting vertexSstart to every reachable shallow boundary vertex. Inthe second step, each path H is extended separately –the Dijkstra algorithm is used to find the single lowestcost paths in the graph (W,F ) connecting the lastvertex of H with an outer boundary vertex. Then, onlyif an outer boundary vertex was reached, the path isconsidered to be a centerline of a tunnel.

The second step prevents reporting tunnel variantsthat branched from the tunnel of lower cost near to theprotein surface. This can be viewed as an elaborationof the approach used in MolAxis [28], where thebranching of tunnels beyond the sphere centered atthe starting point is prohibited. The shapes of manystructures are not spherical and the starting point isoften far from the protein center. Thus there are manystructures where users have no choice but to eithermiss the variants on one side of the structure or tobe overwhelmed by many nearly identical tunnelson the other side, which are closer to the startingpoint. This is solved by our approach as it prohibitsbranching beyond the surface that follows the shapeof the analyzed structure.

The Dijkstra algorithm cannot identify a path that

Page 7: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

7

Algorithm 1: Identify Shallow Verticesg – number of vertices in Voronoi diagramOuter – a set of indices of outer verticesV [g][4] – an array of integers, V [a][b] is the index ofb-th neighboring vertex of the a-th vertexRedge[g][4] – an array of floating point numbers,Redge[a][b] is the bottleneck radius of the edgeconnecting vertices with indices a and V [a][b]Rvertex[g] – an array of vertex radii, Rvertex[i] is theradius of the maximum empty ball centered at thevertex idepth – a parameter limiting the distance of a shallowvertex from a maximum empty ball centered at anouter vertex (default 4 A)rB – minimal tunnel bottleneck radius (default 0.9 A)Shallow[g] – an output array, Shallow[i] = true ⇐⇒the i-th vertex is a shallow vertex

1: for i from 0 to g − 1 do2: Shallow[i] = false3: end for4: for v ∈ Outer do5: Q← empty FIFO queue6: Q.enque(v)7: while Q not empty do8: w ← Q.deque()9: Shallow[w] = true

10: for i ∈ {0, 1, 2, 3} do11: x← V [v][i]12: if Shallow[x] = false

and dist(v, x) < Rvertex[v] + depth− rBand rB ≤ Redge[w][i] then

13: Q.enque(x)14: end if15: end for16: end while17: end for

joins a cheaper path before both paths reach shal-low vertices. This so-called overshadowing problemhas been previously described in [28]. Such pathscan be identified by manually increasing the depthparameter, but in the future, it would be desirableto develop a sufficiently efficient algorithm (with theexecution time reaching from seconds to few minutesper structure) for the fully automated identification ofall paths that are significantly dissimilar.

2.6 Tunnel PostprocessingEach tunnel is now represented by the polyline com-posed of Voronoi edges which is called centerline. Thecenterline is transformed into the sequence of emptyballs called the profile balls using the following pro-cedure. Points are placed on the centerline in regularintervals and the maximal empty balls are placed ateach point. Finally, the empty balls from the end ofthe tunnel are removed one by one until the empty

outer

shallow

inner

outer boundary

shallow boundaryV

W

E

F

Vertices

Edges

Fig. 8: Illustration of graphs (V,E) and (W,F ) usedin the tunnel search.

ball with a radius smaller than or equal to rS (theradius of the probe for determining outer vertices) isreached. Thus, the center of the last empty ball of eachtunnel lies in a practically negligible distance from thesolvent accessible (Lee-Richards) surface of a proteinstructure determined by the probe of radius rS [37],[38].

2.7 Similarity of TunnelsTwo tunnels are considered to be similar if at leastsome portions of them lead through the same regionsof the structure. Our measure of similarity of twotunnels is based on the Euclidean distance betweenpairs of points derived from the centerlines of the tun-nels. This measure is used for two reasons. The firstreason is to remove nearly identical tunnels withinone conformation. The second is to find the correspon-dence between tunnels from different conformationsby clustering (see Figure 9). This can be very timeconsuming because the evaluation of as many asn · (n − 1)/2 distances can be required during theclustering phase, where n is the number of all tunnelsin all conformations. Therefore, the distance functionshould be calculated efficiently. It is also beneficial ifthe distance is a metric, because it can potentially beused to accelerate the clustering process. A distancefunction meeting both these requirements as well asthe process of identification of points capturing thetunnel centerline geometry and the algorithm for theevaluation of the metric of tunnel similarity will bedescribed in the following sections.

2.7.1 Representation of Tunnel GeometryIn order to evaluate the similarity between two tun-nels efficiently, the centerline geometry of each tunnelis characterized by h representative points (default h =10). The distances between corresponding points aremeasured, i.e., when comparing tunnels T and U , theEuclidean distances dist(AT1, AU1), dist(AT2, AU2),

Page 8: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

8

Fig. 9: Five top-ranking clusters of tunnels identifiedin 20,000 snapshots of the molecular dynamics simu-lation of haloalkane dehalogenase DhaA. Tunnels aredisplayed by lines representing their axes.

rUT

U

rT

d1

d2

d3 AU3AT3

AT1

AT2

AU2

AU1

Saverage

PX1

KT3

KT2

KT1

KT0

X3X2

= Saverage

Fig. 10: Computation of tunnel similarity. The poly-lines represent tunnel axes, and Saverage is the av-eraged starting point. The aggregated points X1, X2

were used for the derivation of the radius rT and thepoints X2, X3 for the derivation of the radius rU . Thevalues d1, d2, d3 are the distances between the pointsAT1 and AU1, AT2 and AU2, AT3 and AU3, which areused to compute the similarity of axes of tunnels Tand U . KT0, KT1, KT2, KT3 are spheres, where KT0

with zero radius degenerates to the point Saverage.

. . . , dist(ATh, AUh) are evaluated (see Figure 10). Therest of this section describes how the representativepoints are computed and the computation of tunnelsimilarity from these points will be described in thefollowing subsection.

Even a small change in position of surface atomscan cause significant changes in the solvent accessiblesurface and thus the length of a tunnel. To reducethe effect of these changes on the tunnel length whenevaluating the tunnel-tunnel similarity, a sphere isassigned to each tunnel T . The tunnel will later beconsidered to end at the surface of this sphere forpurpose of tunnel similarity estimation. The center ofthis sphere is positioned into the centroid of startingvertices from all conformations Saverage. The radiusof the sphere is derived by averaging the distancesof Saverage and the tunnel ends in the vicinity of

the end of tunnel T . For this purpose, we designedthe Algorithm 2, which is also described in the nextparagraph.

First, a set of tunnel endpoints is computed, whereeach endpoint is identified as the point on the tunnelcenterline lying in the furthest distance from Saverage.Next, for each endpoint, we find all endpoints inits proximity. To make the search more efficient, wetransform the set of all endpoints into a smaller setcalled aggregated points. This is performed using theprocedure described in Algorithm 2. The purpose ofthe algorithm is to replace each group of points thatare close together by a single point that is locatedreasonably near to their centroid. The points A andB are close if the angle ASaverageB is smaller thana threshold value. The weight of each representativepoint equals the number of points it represents. Ifthe point being added is close enough to an alreadyexisting point, both points are merged using the func-tion AV ERAGE. The direction and the distance of anew point is computed (Algorithm 2, line 5, 6). Bothdirection and distance take into account the weightsa and b of the points merged, so that the new pointlies closer to the point which represents more points.

The set of aggregated points is constructed itera-tively by taking the endpoints one by one. The anglesbetween the endpoint, Saverage and the aggregatedpoints are considered. If the smallest angle is belowthe threshold value, the endpoint and the aggregatedpoint are merged. The weighted average of theirpositions is used, where the weight of the aggregatedpoint equals to the number points it represents. Oth-erwise, the endpoint is added to the aggregated pointset.

For each tunnel centerline T with endpoint P , asubset of aggregated points is identified such thateach point Xi from the subset must satisfy the condi-tion that the angle PSXi is small enough (thresholdvalue 5◦). The value rt is then computed as theaverage of the distances of the points from this subsetto S (see the left side of Figure 10). Next, the sequenceof spheres K1, . . . , Kh (e.g., the spheres KT1, KT2,KT3 in Figure 10), having the common center S andradii rT /h, 2rT /h, . . . , rT are constructed. Finally, eachrepresentative point ATi, i ∈ (1, ..., h), is computed asthe centroid of the intersection of the tunnel centerlineand the space between spheres Ki−1 and Ki.

2.7.2 Metric for Evaluating Tunnel SimilarityThe distance between tunnels T and U with se-quences of representative points AT1, . . . , ATh andAU1, . . . , AUh is defined as

dist(T,U) =1

h

h∑i=1

w

(i− 1

h− 1

)dist(ATi, AUi), (2)

where w is the linear function w(x) = k1x + k2. Thecoefficients k1 and k2 are set so that the ratio between

Page 9: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

9

Algorithm 2: Aggregate Tunnel EndsCenterlines – set of tunnel centerlines (polylines)S – abbreviation for centroid of starting verticesSaverage

amax – threshold angle for point aggregationPoints – associative array, keys are unit vectors,values are weighted pointsA,Au, B,Bu, C, Cu, D – points; a, b – scalar values

1: function UNIT(A)2: return A/|A|3: end function

4: function AVERAGE(S,A, a,B, b)5: C ← a · UNIT(A− S) + b · UNIT(B − S)6: D ← (a · |AS|+ b · |BS|) · UNIT(C)/(a+ b)7: return (S +D, a+ b)8: end function

9: function AGGREGATE(Centerlines, S)10: for X ∈ Centerlines do11: A← point on X most distant to S12: Au = UNIT(A− S)13: if Points is empty then14: Points.insert(Au, (A, 1))15: else16: (B, b)← Points.nearest(Au)17: Bu = UNIT(B − S)18: if amax < arccos(Au ·Bu) then19: Points.insert(Au, (A, 1))20: else21: (C, c)← AVERAGE(S,A, 1, B, b)22: Points.delete(Bu)23: Cu ← UNIT(C − S)24: Points.insert(Cu, (C, c))25: end if26: end if27: end for28: return Points29: end function

the last point w(1) and first point w(0) is equal to theparameter q (default 1), and w(0.5) equals to 1. Settingq to values smaller than 1 emphasizes the importanceof the beginning of the tunnel, while values of q largerthan 1 emphasize the end of the tunnel.

This geometry-based metric is an alternative to themetric based on the comparison of sets of atomslining the tunnel [26]. The geometric approach allowsusers to emphasize the importance of either end orbeginning of tunnels for similarity estimation. On theother hand, the atom-based approach is more general,as the geometric approach assumes that the tunnelcenterline continually increases its distance to thestarting point, otherwise the ability of the metric todistinguish between dissimilar tunnels decreases.

2.8 Removal of Redundant Tunnels in One Con-formation

Several highly similar tunnel centerlines can be iden-tified in one static structure. To remove such re-dundant tunnels, the following iterative procedure isemployed. The lowest cost tunnel T is selected and alltunnels within the user-specified distance from T arediscarded. The procedure is repeated with the next re-maining lowest cost tunnel, until all tunnels are eitherselected or discarded. The purpose of tunnel removalfor each conformation separately is to reduce the datasize in order to accelerate the subsequent clustering ofall tunnels and to make the results of the computationon a static structure more comprehensible.

2.9 Tunnel Clustering

All remaining tunnels from all conformations are col-lected and clustered to allow the statistical analysis ofthe properties of corresponding tunnels, i.e., allowingto study the dynamics of tunnels. The lowest costpathway of a cluster is selected from each conforma-tion, providing the information about all changes ofthe locally most significant pathway in time.

2.9.1 Average Link Clustering

The number of tunnels, i.e., the elements to cluster,typically ranges from tens to hundreds of thousandsand it can be expected to grow with the increasinglength of molecular dynamics simulations [39]. Be-cause tunnel identification in multiple conformationscan be distributed among many computers, the clus-tering phase is the most important computational bot-tleneck in the workflow. In CAVER 3.0, we used thememory-constrained average link hierarchical clus-tering algorithm (C1) [9], [40] in the hope that itwould allow for the efficient clustering of hundredsof thousands of tunnels, even with a computer notequipped with a large amount of RAM. However, itslong computational time made clustering impractical.Therefore, the algorithm with the optimal worst-casecomputational complexity O(n2) was implementedinstead (C2) according to the paper [41] and thecorresponding source code from the author’s homepage.

As O(n2) memory complexity imposes the limitfor maximal size of the dataset due to the limitedavailable RAM, we developed a modification of thealgorithm using O(n) memory and O(n3) worst casetime complexity (C3). The difference is that instead ofmaintaining the cluster-cluster similarity matrix in thememory, the distances are computed on the fly. Thisslower algorithm is used until the number of clustersdecreases to the value that allows the whole matrixto be stored in memory. From this point, the efficientO(n2) algorithm is used to finalize the cluster joining.

Page 10: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

10

Algorithm 3: Preclusteringthreshold – size of clusters specified by usersTunnels – set of tunnelsdist – measure of tunnel similarity according to 2,dist(T, T ) = 0, dist(T,U) = dist(U, T )Preclustering – set of sets of tunnels

1: Preclustering ← {}2: while Tunnels not empty do3: T ← the lowest cost tunnel from Tunnels4: for U ∈ Tunnels do5: Cluster ← {}6: if dist(T,U) ≤ threshold then7: remove U from Tunnels8: add U to Cluster9: end if

10: end for11: add Cluster to Preclustering12: end while

2.9.2 Approximation

To allow for the clustering of very large datasets, twotechniques for the reduction of the size of the data areused: subsampling and preclustering.

First, a user-defined percent of the tunnels (e.g.,20%) is randomly subsampled into set A, whereas theremaining tunnels are placed into set B. Tunnels in setA are preclustered using Algorithm 3. The resultingset of small clusters is then used as an input for theaverage link clustering, which aggregates these smallclusters into a set of full-sized clusters. Then eachtunnel in set B is assigned to one of the full-sizedclusters using the Algorithm 4.

Algorithm 3 (preclustering), Algorithm 4 (subsam-pling) and average link clustering have the sameasymptotic time complexity O(n2). However, the sub-sampling is faster than the preclustering and averagelink clustering (the computational time is proportionalto the subsample percentage). Both preclustering andsubsampling have O(n) memory complexity. Theyreduce the amount of elements to be clustered bythe average link clustering, thus overcoming its majorlimitation – high memory requirements. The limita-tion of subsampling is that it may destroy small clus-ters. However, this can be usually tolerated, becausethe biologically relevant tunnels appear in a largenumber of conformations.

The example of clusters obtained by the averagelink clustering and this approximation will be givenin the following subsection.

2.9.3 Measurements

To illustrate how fast the tunnel clustering is in prac-tice, we provide the computational times for the abovementioned algorithms. All of the times were obtainedby clustering tunnels computed on 20,000 snapshots

Algorithm 4: Cluster AssignmentClusters – clustered tunnels, i.e., set of sets of tunnelsTunnels – set of non-clustered tunnelssize(C) – number of tunnels in cluster Cdist – measure of tunnel similarity according to 2avgTC – the average of distances between T andtunnels from the cluster CavgC – the maximal value of the previous averagefor all tunnels in the cluster CdistTC – the distance between the tunnel T its nearestneighbor from the cluster CdistC – the maximum of all nearest neighbor distancesfor tunnels in the cluster

1: for T ∈ Tunnels do2: X ← the tunnel from Clusters such that

dist(T,X) is minimum3: C ← cluster from Clusters to which X belongs

4: avgTC =

∑Y ∈C dist(T, Y )

size(C)

5: avgC = maxY ∈C

∑Z∈C dist(Y, Z)

size(C)6: distTC = dist(T,X)7: distC = max

Y ∈CminZ∈C

dist(Y, Z)

8: if avgTC < avgC and distTC < distC then9: assign tunnel U to cluster C

10: end if11: end for

of molecular dynamics simulation of haloalkane de-halogenase enzyme DhaA [9], which is available atwww.caver.cz. The default clustering threshold 3.5 Awas used (i.e., the average distance of tunnels in thecluster is smaller than 3.5 A). In the case of algorithmC3, the preclustering threshold 1.0 A was used (i.e.,each tunnel in the precluster is at most 1.0 A farfrom the lowest cost tunnel in the precluster) and20% of all tunnels were put into the set A. Allcomputations were performed by running CAVER3.02 on Java OpenJDK Runtime Environment 1.7.0with 14 GB maximum Java heap size on a computerequipped with an Intel Xeon E5-1620 v2 3.70 GHzprocessor and 16 GB RAM. The computational timesare summarized in Table 1.

Experiment Time in minsC1 - old algorithm 576C2 - all in quick phase 12C2 - half in quick phase 38C2 - all in slow phase 289C3 - approximation 3.6

TABLE 1: Summary of computational times of clus-tering algorithms.

First, we focused on the comparison of algorithmsC1 and C2 on 58,766 tunnels computed using thedefault bottleneck radius 0.9 A. The clustering by the

Page 11: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

11

algorithm C1 previously used in CAVER 3.0 took 576minutes. The algorithm C2 working in the O(n2) quickphase all the time finished in 12 minutes. When it wasforced to work in O(n3) phase all the time, it took 289minutes. However, in practice, both phases are used.To test this, the quick phase was activated when thenumber of clusters dropped to half, which resulted inthe computation time of 38 minutes. This setting has1/4 memory cost compared to using only the quickphase. The computation time for the approximativealgorithm C3 was 3.6 minutes. However, the primarypurpose of this algorithm is not be faster than C2, butrather to allow computations with O(n2) time com-plexity in case there is not enough memory availableto run C2 on large data.

Second, we tested the quality of the clustering pro-duced by the approximative algorithm C3 by compar-ing it with the output of C2. 99.6% of all tunnels arecontained within the seven greatest clusters of the C2

clustering. We compared each of those clusters withthe most similar cluster from C3 clustering, wherethe similarity of a pair of clusters is measured as thenumber of tunnels they share divided by the numberof tunnels in the union of both clusters. The averagecluster-cluster similarity was 93%. Six clusters werereplicated very precisely (with greater than 96% sim-ilarity), but the similarity for the fifth greatest clusterwas only 58%. The difference can be explained by theinstability of average link clustering itself rather thanby error of the approximation. We found out that thecontent of clusters varies significantly when differentrandomly selected subsets of tunnels are clustered bythe average link algorithm. The similarity for the fifthcluster remained low (56%) even when just 10% ofrandomly sampled tunnels were removed. This sug-gests that the neighborhood of the fifth cluster cannotbe unambiguously clustered at the given thresholdand a different threshold value should be used foranalysis of tunnels in this region.

Finally, we tested the capability of the approxima-tion to process very large data by clustering 711,517tunnels obtained using bottleneck radius 0.6 A. Thecomputation took 117 minutes.

The new clustering solutions allow the user toanalyze dynamical tunnels faster and to process largersets of tunnels. Thus, long molecular dynamics trajec-tories can be analyzed, which is essential for the ob-servation of slow dynamical events (e.g., loop move-ments), while maintaining the level of detail essentialto also monitor fast events (e.g., rotation of sidechains) in such data.

2.10 Tunnel Properties and VisualizationAlong with the tunnel calculation also several nu-merical characteristics of the tunnel geometries arecalculated. Moreover, the tunnel can be visualizedwithin the context of its molecule. These issues arediscussed in this section.

Fig. 11: Set of tunnels computed by CAVER andvisualized using CAVER Analyst [42]. The proteinmolecule is shown as a gray cartoon representation.

2.10.1 Tunnel Properties

There are several properties which can be computedand serve as another descriptor of the computedtunnel. After the previously described phases of thecalculation, each tunnel is represented as a sequenceof empty balls and belongs to a certain conformationand cluster. Typically, each conformation contains oneor very few tunnels from each cluster. The smallestball of the tunnel and its radius – the tunnel bottleneck– are frequently used values in tunnel analysis. It canbe expected, however, that the suitability of the tunnelfor transport of a small molecule is also influencedby its secondary bottlenecks and the tunnel length.Therefore, the cost (see equation 1 from section 2.2) ofthe tunnel should be a more appropriate predictor oftunnel relevance. However, because the distribution ofcosts for the evolution of one tunnel in time is asym-metric, the transformed value, called the throughput, iscomputed as e−cost, where e is the natural exponentialfunction.

Clusters are then ordered by their priority, which isderived from the tunnel throughputs. The throughputin each cluster is averaged over all conformations, us-ing zero value for conformations without any tunnel,and the greatest throughput value in conformationscontaining several tunnels.

The length of the tunnel is computed as∑m−1i=1 dist(Zi, Zi+1), where Z1, Z2, . . . , Zm is a

sequence of the profile balls of the tunnel. A valuecalled the curvature is computed as the ratio of thedistance dist(Z1, Zm) and the tunnel length. Atomslining the tunnel are computed as atoms withinthe user-specified distance from any ball of thetunnel. The tunnel and cluster properties are savedand can be further analyzed using a text viewer orspreadsheet editor.

Page 12: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

12

2.10.2 Tunnel VisualizationThere are several possibilities for the visualization ofclusters and individual tunnels. To allow the compari-son of the radii profiles of multiple tunnels, heat mapsare generated. The heat map is an image representingvalues of a matrix of colors. Thus, the evolution of theradii profile of a whole tunnel in time can be capturedin a single image, or an average tunnel radii profilecan be constructed for each cluster allowing severaldifferent clusters to be compared.

Exporting the computed tunnels to existing molecu-lar visualization applications allows for the visualiza-tion of tunnels in the context of their correspondingmolecular system. Tunnels can be visualized as se-quences of densely sampled spheres, or these spherescan be covered with a smooth surface to providea cleaner visualization.These two representations aresuitable for viewing the tunnels in individual confor-mations. To visualize clusters of tunnels from manyconformations, showing each tunnel as a line repre-senting its centerline provides more comprehensiblevisualization (see Figure 9). The ability of molecularvisualization software PyMOL [43] and VMD [44] todisplay these representations is enabled in the exportsof the CAVER results.

In all of these cases, tunnels are rendered using thesame method as is used for molecular rendering, i.e.,atoms are used to draw a profile ball and chemicalbonds to draw a line. This approach is quite simpleto implement and can be used to export the data toalmost any molecular browser. On the other hand,this solution can cause performance problems when alarge number of tunnels is visualized and the proper-ties of tunnels cannot be displayed interactively. Theselimitations motivated the development of a novelsoftware tool called CAVER Analyst [42]. The tool in-tegrates CAVER and provides an easy to use graphicaluser interface for computation, comprehensive visu-alization (see Figure 11), and interactive explorationand evaluation of tunnels and their properties bothon static and dynamic structures. Users can analyzetunnels interactively together with features such asheat maps or detailed information about the tun-nel surroundings. Furthermore, it provides graphicalmethods suitable for the visualization of the shape oftunnels. CAVER Analyst introduces the asymmetricrepresentation of the tunnel surface, which aims torepresent the tunnel shape more precisely.

3 CONCLUSIONS AND FUTURE WORK

This paper introduces several algorithms for analy-sis of tunnels in biomacromolecules. Tunnels in in-dividual structures are detected using the Voronoidiagram. Moreover, we introduce an algorithm forpositioning the origin of tunnels into an empty spacein the proximity of the user-specified point. Then,the algorithms for the demarcation of the surface in

macromolecules are also presented. The metric for theefficient computation of the geometrical similarity oftwo tunnels is described. We utilize the clusteringapproach to find the correspondence between tunnelsfrom different snapshots and to allow the analysisof changes of tunnel shape in time. An approximateclustering algorithm is available for processing of alarge number of tunnels. Altogether, the algorithmsallow the identification and analysis of tunnels in bothstatic and dynamic structures.

Presented algorithms have some limitations, whichcan be addressed in future research. First, it must benoted that the geometry-based tunnel identification isapproximate from its principle, since it does not con-sider the tunnel opening induced by the transportedmolecule, nor does it see energy barriers caused byother effects than the sterical clashes. However, totake into account the mentioned effects is significantlymore time consuming, requires considerable expe-rience and knowledge of the transported molecule.Therefore, geometry based methods are valuable com-plement to more complex simulations.

The inaccuracies caused by approximating largeratoms by multiple smaller balls can be expected tobe smaller than the above mentioned limitations.However, we plan to utilize the additively weightedVoronoi diagram in the future to improve accuracy,efficiency, and the deterministic behavior of the algo-rithm.

The metric for tunnel-tunnel similarity can sufferfrom inaccuracy in distinguishing the parts of tunnelswhose centerlines would follow a sphere centered atthe starting point over large portion of length of thetunnels. It would be possible to measure the similarityof two tunnel centerlines using mutual minimal dis-tances from one path to the other and vice versa, atthe cost of the computational time. However, we didnot implement such a solution, because the evaluationof tunnel similarity is a computational bottleneck andalso because we have not encountered biologicallyrelevant tunnels that would behave in the abovementioned way.

The previously mentioned overshadowing problemrefers to the inability to identify multiple tunnelssharing a common exit. Even though almost all suchtunnels could be identified by the repeated compu-tation with different parameters, a more general andefficient solution would be desirable. We consider thisto be the most significant limitation of the currentlyavailable algorithms.

The workflow contains two computational bottle-necks - Voronoi diagram construction and clustering.Voronoi diagram construction can be performed inparalel by distributing structures among several com-puters. In that case, computational time of Voronoidiagram construction and clustering becomes similar.Distributed version of clustering is probably not agood solution, because computational times are al-

Page 13: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

13

ready within few hours for large data and runningthe computation on multiple computers would re-quire additional effort from the users. However, theacceleration of the search for similar tunnels on lines4 - 10 of Algorithm 3 by exploiting the properties ofmetric space could be useful.

Our primary motivation was to develop a tool forthe analysis of tunnels in enzymes with narrow tun-nels, because these narrow tunnels are hard to identifyand analyze without a dedicated tool. However, ourtool can be used for the detection of tunnels fordifferent biological systems. It can be utilized evenfor the analysis of pores.

The program CAVER 3.02 and the academic ver-sion of CAVER Analyst 1.0 are freely available atwww.caver.cz, together with the molecular dynamicssimulation used for testing our new clustering solu-tion and PyMOL session corresponding to Figures 9and 11. This simulation can be useful also for thedevelopment and testing of any other algorithms andsoftware tools for the analysis of molecular dynamictrajectories.

ACKNOWLEDGMENTS

The work of AP was supported by Ph.D. TalentScholarship provided by Brno City Municipality. Thework of JB was supported by the project Employ-ment of Best Young Scientists for International Co-operation Empowerment (CZ.1.07/2.3.00/30.0037) co-financed from European Social Fund and the statebudget of the Czech Republic. The Ministry of Educa-tion is acknowledged for financial support (LO1214,CZ.1.05/3.2.00/08.0144). MetaCentrum is acknowl-edged for providing access to computing facilities,supported by the Czech Ministry of Education ofthe Czech Republic (LM2010005). CERIT-SC is ac-knowledged for providing access to their computingfacilities, under the program Center CERIT scientificCloud (CZ.1.05/3.2.00/08.0144).

REFERENCES

[1] R. Berisio, F. Schluenzen, J. Harms, A. Bashan, T. Auerbach,D. Baram, and A. Yonath, “Structural Insight into the Role ofThe Ribosomal Tunnel in Cellular Regulation.” Nature Struc-tural Biology, vol. 10, no. 5, pp. 366–70, 2003.

[2] A. Wlodawer and J. Vondrasek, “Inhibitors of HIV-1 Protease:a Major Success of Structure-assisted Drug Design.” AnnualReview of Biophysics and Biomolecular Structure, vol. 27, pp. 249–84, 1998.

[3] O. Kirk, T. V. Borchert, and C. C. Fuglsang, “Industrial EnzymeApplications,” Current Opinion in Biotechnology, vol. 13, no. 4,pp. 345–351, 2002.

[4] M. Petrek, M. Otyepka, P. Banas, P. Kosinova, J. Koca, andJ. Damborsky, “CAVER: a New Tool to Explore Routes fromProtein Clefts, Pockets and Cavities.” BMC Bioinformatics,vol. 7, p. 316, 2006.

[5] M. Otyepka, J. Skopalik, E. Anzenbacherova, and P. Anzen-bacher, “What Common Structural Features and Variationsof Mammalian P450s Are Known to Date?” Biochimica etBiophysica Acta, vol. 1770, no. 3, pp. 376–89, 2007.

[6] M. Karplus and J. A. McCammon, “Molecular DynamicsSimulations of Biomolecules.” Nature Structural Biology, vol. 9,no. 9, pp. 646–52, 2002.

[7] M. Klvana, M. Pavlova, T. Koudelakova, R. Chaloupkova,P. Dvorak, Z. Prokop, A. Stsiapanava, M. Kuty, I. Kuta-Smatanova, J. Dohnalek, P. Kulhanek, R. C. Wade, andJ. Damborsky, “Pathways and Mechanisms for Product Releasein the Engineered Haloalkane Dehalogenases Explored UsingClassical and Random Acceleration Molecular Dynamics Sim-ulations.” Journal of Molecular Biology, vol. 392, no. 5, pp. 1339–56, 2009.

[8] F. M. Ho, “Uncovering Channels in Photosystem II by Com-puter Modelling: Current Progress, Future Prospects, andLessons from Analogous Systems.” Photosynthesis Research,vol. 98, no. 1-3, pp. 503–22, 2008.

[9] E. Chovancova, A. Pavelka, P. Benes, O. Strnad, J. Brezovsky,B. Kozlikova, A. Gora, V. Sustr, M. Klvana, P. Medek, L. Bie-dermannova, J. Sochor, and J. Damborsky, “CAVER 3.0: ATool for the Analysis of Transport Pathways in DynamicProtein Structures.” PLoS Computational Biology, vol. 8, no. 10,p. e1002708, 2012.

[10] J. L. Sussman, D. Lin, J. Jiang, N. O. Manning, J. Prilusky,O. Ritter, and E. E. Abola, “Protein Data Bank (PDB): Databaseof Three-Dimensional Structural Information of BiologicalMacromolecules,” Acta Crystallographica Section D, vol. 54, no.6 Part 1, pp. 1078–1084, 1998.

[11] J. Brezovsky, E. Chovancova, A. Gora, A. Pavelka, L. Bie-dermannova, and J. Damborsky, “Software Tools for Identi-fication, Visualization and Analysis of Protein Tunnels andChannels.” Biotechnology Advances, vol. 31, no. 1, pp. 38–49,2012.

[12] Z. Prokop, A. Gora, J. Brezovsky, R. Chaloupkova,V. Stepankova, and J. Damborsky, Protein EngineeringHandbook. Weinheim: Wiley-VCH, 2012, ch. Engineering ofProtein Tunnels: Keyhole-lock-key Model for Catalysis by theEnzymes with Buried Active Sites, pp. 421–464.

[13] D. Kim and Y. Cho, “Euclidean Voronoi Diagram of 3DBalls and Its Computation Via Tracing Edges,” Computer-AidedDesign, vol. 37, no. 13, pp. 1412–1424, 2005.

[14] A. Okabe, B. Boots, K. Sugihara, and S. N. Chiu, Spatial Tessel-lations: Concepts and Applications of Voronoi diagrams, 2nd ed.,ser. Probability and Statistics. NYC: Wiley, 2000, 671 pages.

[15] F. Aurenhammer, “Voronoi Diagrams - A Survey of a Fun-damental Data Structure,” ACM Computing Surveys (CSUR),vol. 23, no. 3, pp. 345–405, 1991.

[16] N. N. Medvedev, V. P. Voloshin, V. A. Luchnikov, and M. L.Gavrilova, “An Algorithm for Three-Dimensional Voronoi S-Network,” Journal of Computational Chemistry, pp. 1–3, 2006.

[17] O. S. Smart, J. G. Neduvelil, X. Wang, B. Wallace, and M. S.Sansom, “HOLE: A Program for the Analysis of the PoreDimensions of Ion Channel Structural Models,” Journal ofMolecular Graphics, vol. 14, no. 6, pp. 354–360, 1996.

[18] M. Pellegrini-Calace, T. Maiwald, and J. M. Thornton, “Pore-Walker: a Novel Tool for the Identification and Characteriza-tion of Channels in Transmembrane Proteins from Their Three-dimensional Structure.” PLoS Computational Biology, vol. 5,no. 7, p. e1000440, 2009.

[19] R. G. Coleman and K. a. Sharp, “Finding and CharacterizingTunnels in Macromolecules with Application to Ion Channelsand Pores.” Biophysical Journal, vol. 96, no. 2, pp. 632–45, 2009.

[20] N. Lindow, D. Baum, and H.-C. Hege, “Voronoi-based Extrac-tion and Visualization of Molecular Paths.” IEEE Transactionson Visualization and Computer Graphics, vol. 17, no. 12, pp. 2025–34, 2011.

[21] D. S. Kim, Y. Cho, J. K. Kim, and K. Sugihara, “Tunnelsand Voids in Molecules via Voronoi Diagrams and Beta-Complexes,” Lecture Notes in Computer Science (including sub-series Lecture Notes in Artificial Intelligence and Lecture Notes inBioinformatics), vol. 8110, pp. 92–111, 2013.

[22] J. K. Kim, Y. Cho, R. a. Laskowski, S. E. Ryu, K. Sugihara, andD. S. Kim, “BetaVoid: Molecular Voids via Beta-Complexesand Voronoi Diagrams,” Proteins: Structure, Function and Bioin-formatics, no. October 2013, pp. 1829–1849, 2014.

[23] M. Manak and I. Kolingerova, “Fast Discovery of VoronoiVertices in the Construction of Voronoi Diagram of 3D Balls,”2010 International Symposium on Voronoi Diagrams in Science andEngineering, pp. 95–104, 2010.

Page 14: CAVER: Algorithms for Analyzing Dynamics of Tunnels in …caver.cz/fil/publications/TCBB.pdf · 2015. 10. 2. · The CAVER 3.0 tool described in this paper falls into this last category,

14

[24] J. Damborsky, M. Petrek, P. Banas, and M. Otyepka, “Iden-tification of Tunnels in Proteins, Nucleic Acids, InorganicMaterials and Molecular Ensembles.” Biotechnology Journal,vol. 2, no. 1, pp. 62–7, 2007.

[25] P. Medek, P. Benes, and J. Sochor, “Computation of Tunnelsin Protein Molecules Using Delaunay Triangulation,” Journalof WSCG, University of West Bohemia, Pilsen, vol. 15(1-3), pp.107–114, 2007.

[26] M. Petrek, P. Kosinova, J. Koca, and M. Otyepka, “MOLE:a Voronoi Diagram-based Explorer of Molecular Channels,Pores, and Tunnels.” Structure, vol. 15, no. 11, pp. 1357–63,2007.

[27] D. Sehnal, R. Svobodova Varekova, K. Berka, L. Pravda,V. Navratilova, P. Banas, C.-M. Ionescu, M. Otyepka, andJ. Koca, “MOLE 2.0: Advanced Approach for Analysisof Biomacromolecular Channels.” Journal of Cheminformatics,vol. 5, no. 1, p. 39, 2013.

[28] E. Yaffe, D. Fishelovitch, H. J. Wolfson, D. Halperin, andR. Nussinov, “MolAxis: Efficient and Accurate Identificationof Channels in Macromolecules.” Proteins, vol. 73, no. 1, pp.72–86, 2008.

[29] P. Benes, P. Medek, and J. Sochor, “Tracking Single Channel inProtein Dynamics,” WSCG Communication Papers Proceedings,pp. 109–114, 2010.

[30] P. Benes and P. Medek, “Computation of Dynamic Channelsin Proteins,” The Third International Conference on Bioinformatics,Biocomputational Systems and Biotechnologies, pp. 78–83, 2011.

[31] P. Benes, O. Strnad, and J. Sochor, “New Path Planning Methodfor Computation of Constrained Dynamic Channels in Pro-teins,” WSCG Full Papers Proceedings, pp. 81–88, 2011.

[32] N. Lindow, D. Baum, A.-N. Bondar, and H.-C. Hege, “Dy-namic Channels in Biomolecular Systems: Path Analysis andVisualization,” 2012 IEEE Symposium on Biological Data Visual-ization (BioVis), pp. 99–106, 2012.

[33] N. Lindow, D. Baum, A. N. Bondar, and H. C. Hege, “Explor-ing cavity dynamics in biomolecular systems.” BMC Bioinfor-matics, vol. 14, no. S-19, p. S5, 2013.

[34] E. Yaffe, “Efficient construction of pathways in the comple-ment of the union of balls in R3,” Tel Aviv, 2007.

[35] A. Goede, R. Preissner, and C. Frommel, “Voronoi Cell: NewMethod for Allocation of Space Among Atoms: Eliminationof Avoidable Errors in Calculation of Atomic Vvolume andDensity,” Journal of Computational Chemistry, vol. 18, no. 9, pp.1113–1123, 1997.

[36] C. B. Barber, D. P. Dobkin, and H. Huhdanpaa, “The QuickhullAlgorithm for Convex Hulls,” ACM Transactions on Mathemat-ical Software, vol. 22, no. 4, pp. 469–483, 1996.

[37] N. R. Voss and M. Gerstein, “3V: Cavity, Channel andCleft Volume Calculator and Extractor.” Nucleic acids research,vol. 38, no. Web Server Issue, pp. W555–62, 2010.

[38] B. Lee and F. Richards, “The Interpretation of Protein Struc-tures: Estimation of Static Accessibility,” Journal of MolecularBiology, vol. 55, no. 3, pp. 379–IN4, 1971.

[39] J. L. Klepeis, K. Lindorff-Larsen, R. O. Dror, and D. E. Shaw,“Long-timescale Molecular Dynamics Simulations of ProteinStructure and Function,” Current Opinion in Structural Biology,vol. 19, no. 2, pp. 120–127, 2009.

[40] Y. Loewenstein, E. Portugaly, M. Fromer, and M. Linial, “Effi-cient Algorithms for Accurate Hierarchical Clustering of HugeDatasets: Tackling the Entire Protein Space.” Bioinformatics,vol. 24, no. 13, pp. i41–9, 2008.

[41] F. Murtagh, “Complexities of Hierarchic Clustering Algo-rithms: the State of the Art,” Computational Statistics Quarterly,vol. 1, pp. 101–113, 1984.

[42] B. Kozlikova, E. Sebestova, V. Sustr, J. Brezovsky, O. Strnad,L. Daniel, D. Bednar, A. Pavelka, M. Manak, M. Bezdeka,P. Benes, M. Kotry, A. Gora, J. Damborsky, and J. Sochor,“CAVER Analyst 1.0: Graphic Tool for Interactive Visual-ization and Analysis of Tunnels and Channels in ProteinStructures.” Bioinformatics, vol. 30, no. 18, pp. 2684–5, 2014.

[43] Schrodinger, LLC, “The PyMOL Molecular Graphics System,Version 1.7,” January 2014, software.

[44] W. Humphrey, A. Dalke, and K. Schulten, “VMD – VisualMolecular Dynamics,” Journal of Molecular Graphics, vol. 14,pp. 33–38, 1996.

Antonin Pavelka received his MS degreein Bioinformatics from the Faculty of Infor-matics at Masaryk University in 2008 andis currently working there toward the PhDdegree. He was awarded the Brno Phd Talentscholarship in 2011. His research interestsinclude bioinformatics and machine learning.

Eva Sebestova received the PhD degreein environmental chemistry from the Fac-ulty of Science, Masaryk University in 2011.Her main research interests include appliedbioinformatics and rational design of pro-teins.

Barbora Kozlikova received the PhD degreein computer graphics from the Faculty ofInformatics at Masaryk University in 2011.Currently, she is an assistant professor atthe Department of Computer Graphics andDesign at the same institution. Her researchinterests include visualization, computationalgeometry and bioinformatics.

Jan Brezovsky is a leader of theoreticalteam of Loschmidt Laboratories at the De-partment of Experimental Biology and theResearch Centre for Toxic Compounds inthe Environment, Masaryk University, Brno,Czech Republic. He received his PhD in En-vironmental Chemistry from Masaryk Univer-sity in 2011. His main research topics arecomputational engineering of proteins, virtualscreening and development of computationaltools for protein engineering.

Jiri Sochor is an Associate Professor ofcomputer science at the Faculty of Infor-matics at Masaryk University, Brno, TheCzech Republic. His research interests in-clude computer graphics, virtual reality andhuman-computer interaction. He receivedPh.D. in Digital computers from Czech Tech-nical University in 1981. He is a co-authorof the book Modern Computer Graphics (inCzech).

Jiri Damborsky is the Josef Loschmidt ChairProfessor of Chemistry and Professor ofBiochemistry at the Faculty of Science atMasaryk University in Brno, Czech Republic.Research in his group is focused on pro-tein and metabolic engineering. His groupdevelops new concepts and software toolsfor protein engineering (CAVER, HOTSPOTWIZARD, PREDICTSNP), and uses them forthe rational design of enzymes and bacte-ria with improved properties for biocatalysis,

biodegradation and biosensing. He has published over 160 originalarticles, 14 book chapters, and has filed 5 international patents.He is a co-founder of the first biotechnology spin-off from MasarykUniversity Enantis Ltd.