26
Annual Review Johns Hopkins University | Institute for Data Intensive Engineering and Science October 1, 2018 - September 30, 2019

Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

Annual ReviewJohns Hopkins University | Institute for Data Intensive Engineering and Science

October 1, 2018 - September 30,

2019

Page 2: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

M E S S A G E F R O M T H E

D I R E C T O RAt IDIES, faculty and students work together to solve amazing data-intensive problems, from genes to galaxies, including new projects in materials science, and urban planning. Over the last six years, our members have successfully collaborated on many

proposals related to Big Data, and we have hired new faculty members, all work-ing on different aspects of data-driven discoveries. Together, we are successful in building a collection of unique, large data sets, giving JHU a strategic advantage in making new discoveries.Making decisions based on facts, relying on sound data foundations is more im-portant than ever. To create and grow diverse datasets, we have active partnerships with six divisions, and several institutes and large projects within JHU. Emerging collaborations include the School of Advanced International Studies, School of Education, the Mathematical Institute for Data Science (MINDS), the multi-insti-tutional Paradim project in materials science (HEMI), the 21st Centuries Cities Initiative (21CC), cancer immunotherapy projects through the Bloomberg Kimmel Institute, and various large-scale studies in genomics. Over the past year, IDIES has looked beyond JHU to establish new partnerships with outside organizations. We have active collaborations with the Lieber Institute for Brain Development, and the Kennedy Krieger Institute. We are involved with various space science projects: collaborating with the Space Telescope Science Institute on the WFIRST Space Telescope Data Archive; providing the data system for a new European X-ray satellite, eRosita; and soon we will host High Energy Astrophysics data from the Goddard Space Flight Center. Most recently, we started a new project with the National Institute of Standards and Technology (NIST) to build an aggregator for data from Puerto Rico about the effects of hurricane Maria.In August, IDIES hosted an inaugural symposium, “Urban Spaces in Baltimore: Data Science in the City”, in collaboration with 21CC and the Carey Business School. Building on the success of this workshop, we look to invite members of the Baltimore City administration and various NGOs to work with us on how to use a larger variety of data to make our city a better place.Postdocs and graduate students are working with IDIES faculty on AI-related proj-ects, from materials science to astronomy and cancer biology. The Schmidt Family Foundation awarded $6M to IDIES, jointly with Princeton, to develop AI tools to se-lect targets for the PFS project. Machine learning, in particular Deep Learning, has revolutionized how industry handles Big Data. IDIES and MINDS have decided to work together towards these very important emerging goals, joining our expertise to lead to greater new discoveries. To further this collaboration, next year the two institutes will hold their annual meeting together, at Homewood.IDIES aims to accelerate, grow and become more relevant across the University by providing more intensive help in launching and sustaining data intensive projects in all disciplines. We seek new ideas and new directions but cannot do this alone: we need your help and initiative. Please send us your ideas, big or small, on how we can improve our engagement with your research community.

On the cover: The cover image is from a study of multiplex imaging of cancer cells done at the Bloomberg-Kimmel Center for Cancer Immunotherapy. The image was acquired as part of a collaborative project with IDIES.

October 1, 2018 - September 30, 2019The Institute for Data IntensiveEngineering and ScienceANNUAL REVIEW

SYMPOSIUMAgenda 1

Keynotes 2

Speakers 4

NEWSNews 6

Announcements 10

IDIESSeed Fund Updates 13

Urban Spaces 20

IDIES in Numbers 21

About IDIES 22

Page 3: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

1

2019 IDIES ANNUAL REVIEW | SYMPOSIUM

Sy pos u

AGENDA

Page 4: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

2

SYMPOSIUM | 2019 IDIES ANNUAL REVIEW

ELIZABETH JAFFEE, MD Deputy Director, The Sidney Kimmel Comprehensive Cancer Center at Johns HopkinsProfessor of OncologyJohns Hopkins UniversityDr. Jaffee is an internationally recognized ex-pert in cancer immunology and pancreatic can-cer. She is Deputy Director of the Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, Co-Director of the Skip Viragh Pancreatic Cancer Center and Associate Director of the Bloomberg Kimmel Institute for Cancer Immunotherapy. Her research focus is on developing novel immunotherapies for the treatment and prevention of pancreatic cancer. Dr. Jaffee is a Past President of AACR. She has served on

a number of committees at the National Cancer Institute including co-chair of the Blue Ribbon Panel that provided scientific advice to Vice President Biden’s Moonshot Initiative. She cur-rently serves as chair of the National Cancer Advisory Board and Chief Medical Advisor to the Lustgarten Foundation for Pancreatic Cancer Research. She is the inaugural Director of the new Convergence Institute at Johns Hopkins.

KEYNOTES

v v v

Page 5: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

3

2019 IDIES ANNUAL REVIEW | SYMPOSIUM

RENE VIDAL, PhD Herschel L. Seder Professor of Biomedical Engineering

Director, Mathematical Institute for Data ScienceJohns Hopkins University

Rene Vidal is the Herschel Seder Professor of Biomedical Engineering and the Inaugural Director of the Mathematical Institute for Data Science at The Johns Hopkins University. Vidal’s research focuses on the development of theory and algorithms for the analysis of complex high-dimensional datasets. His current major research focus is understanding the mathematical foundations of deep learning and its applications in computer vision and biomedical data science. His lab has pioneered the development of methods for dimensionality reduction and clustering, such as Generalized Principal Component Analysis and Sparse Subspace Clustering, and their applications to face recognition, object recognition, motion segmentation and action recognition. His lab creates new technologies for a variety of biomedical applications, including detection, classification and tracking of blood cells in holographic images, classification of embryonic cardio-myocytes in optical images, and assessment of surgical skill in surgical videos. Dr. Vidal has been Associate Editor in Chief of TPAMI and CVIU, Program Chair of ICPR, ICCV and CVPR, co-author of the book “Generalized Principal Component Analysis” (2016), and co-author of more than 250 articles in machine learning, computer vision, biomedical image analysis, hybrid systems, robotics and signal processing. He is an IEEE Fellow, an IAPR Fellow, a Sloan Fellow, and has received numerous awards for his work, including the 2017 D’Alembert Faculty Award, 2012 J.K. Aggarwal Prize, 2009 ONR Young Investigator Award, as well as best paper awards in machine learning, computer vision, controls, and medical robotics.

v v v

Page 6: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

4

SYMPOSIUM | 2019 IDIES ANNUAL REVIEW

SPEAKERS

Tom Haine is Professor of Oceanography in the Department of Earth & Planetary Sciences at Johns Hopkins. He studies ocean circulation and dynamics and

the ocean’s role in climate, typically by synthesizing observations, theory, and numerical simulations

Gregory Eyink is a professor of applied mathematics and statistics, focuses mainly on the phenomenon of turbulence in fluids and plasmas. He holds joint appointments in the departments of Physics and Astronomy, Mathematics, and Mechanical Engineering. Eyink’s research interests include mathematical physics, fluid mechanics, turbulence, dynamical systems, partial differential equations, non-equilibrium statistical physics, geophysics and climate, astrophysics, and plasma physics.

Roman Galperin, PhD, is an associate professor of Management and Organization at Johns Hopkins Carey Business School and an associate faculty at IDIES. As a social scientist, he researches ways in which quality and expertise are signaled and perceived in markets and organizations, using network-analytic approach to study the organization of knowledge work.

Mark Patton is a Senior Software Engineer in the Digital Research and Curation Center of the Sheridan Libraries. He is a JHU alumn with a BS and MS in Computer Science. Projects he has worked on include the Archaelogy of Reading and the Public Access Submission System.

Gerard Lemson has his PhD in theoretical cosmology and is currently a research scientist at Johns Hopkins University. He is associate director for science coordination in the

NSF funded SciServer project (www.sciserver.org) and assists in code development of that platform.

Christopher Cannon is Bloomberg Distinguished Professor of English and Classics and a medievalist. He is the author of four books on Middle English and is now general co-editor

of Oxford Studies in Medieval Literature and Culture (a monograph series) and of the Oxford Chaucer (an edition in progress of all of Chaucer’s writing).

Jaime Combariza, PhD, is the director of the Maryland Advanced Research Computing Center (MARCC), a shared high-performance computing facility for John Hopkins

University and the University of Maryland.

Rajat Mittal is Professor of Mechanical Engineering with a secondary appointment in the School of Medicine. His research interests include fluid mechanics, computing, biomedical engineering, biofluids and flow control.

SPEAKERS

Page 7: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

5

2019 IDIES ANNUAL REVIEW | NEWS

Brian Camley works on the physics of cell biology - how physics constrains the ability of cells to sense their environment, move, and cooperate. These questions link soft matter physics and statistical inference. Brian is an Assistant Professor in the Departments of Physics & Astronomy and Biophysics.

Sarah Wheelan, MD, PhD, co-directs the Center for Computational Genomics and the Experimental and Computational Genomics Core in the School of Medicine. An associate professor of Oncology, Molecular Biology and Genetics, and Biostatistics

(in the School of Public Health), her research focuses on creating computational methods to better understand complex genetic and epigenetic processes in cancer.

Aboozar Hadavand is a postdoctoral fellow in Biostatistics at the Bloomberg School for Public Health, currently researching Massive Open Online Courses (MOOC)

data and the effectiveness of a job training program in Baltimore. Dr. Hadavand is the lead instructor of the Cloud-Based Data Science (CBDS) program and has been

instrumental in assisting Dr. Jeff Leek launch and lead the program through the Johns Hopkins Data Science Lab (JHU DaSL).

Sayeed Choudhury is the Associate Dean for Research Data Management and Hodson Director of the Digital Research and Curation Center at the Sheridan Libraries of Johns Hopkins University. He is a member of the IDIES Executive Committee. He has been the PI for grants from the National Science Foundation, Institute for Museum and Library Services, the Alfred P. Sloan Foundation, the Andrew W. Mellon Foundation, the Library of Congress, Microsoft Research, and a Maryland based venture capital group. He is a President Obama appointee to the National Museum and Library Services Board. He is currently a member of the National Academies Committee for Forecasting Costs for Preserving, Archiving, and Promoting Access to Biomedical Data. He was formerly a member of the National Academies Board on Research Data and Information and the Blue Ribbon Task Force on Sustainable Digital Preservation and Access.

Janet Markle, Ph.D., is an Assistant Professor in the Department of Molecular Microbiology and Immunology, Johns Hopkins Bloomberg School of Public health.

Janet’s research group works to understand the genetic basis of rare diseases of the immune system, by using next-generation genome-wide sequencing technologies to

find, and then functionally characterize, variant alleles in the genomes of patients with immunological diseases with unknown aetiologies. Janet’s research spans

several disciplines including human genetics, bioinformatics, molecular biology and cellular immunology.

Alex Szalay is the founding director of IDIES, a Bloomberg Distinguished Professor, Alumni Centennial Professor of Astronomy, and a professor of Computer Science. As

a cosmologist, he works on the use of big data in advancing scientists understanding of astronomy, physical sciences, and life sciences.

Scot Miller is an assistant professor of in the Departments of Environmental Health and Engineering and in Earth and Planetary Sciences. His lab uses observations of greenhouse gases collected from airplanes, towers, and satellites to estimate emissions across the globe.

Page 8: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

6

NEWS | 2019 IDIES ANNUAL REVIEW

NEWS Data Intensive Scientific Computing (DISCO) at MARCC

[NSF 1920103] MRI: Acquisition of an Advanced Computing Instrument to Integrate Data-Driven Research and Data intensive computing at Johns Hopkins University

Jaime Combariza (Director of MARCC)

The generation of large amounts of data in climate modeling, turbulence models, atmospheric research, genomics, brain science and many other areas has greatly impacted the needs for diverse yet integrated

resources so that researchers using advanced computing technologies can more effectively conduct transformational research. The complexity of factors involved in these processes include: types of data, size, network speed, as well as different novel approaches for analysis and processing. No singular solu-tion exists to resolve all issues that arise from these new developments, and there is increasing demand for a common infrastructure that is both support-ive of research trends and at the same time reduces cost and effort. That is, the integration of data intensive computing and traditional high-performance computing creates a powerful environment conducive to advancing research computing.Large data sets are a challenge for researchers as there is a need to develop new sophisticated methodologies that may include combining observational data with simulations to accomplish the desired research in a short period of time, developing and/or combining new applications or adapting existing applications to account for addition-al functionality. Research computing resources must adapt as data needs to be stored, shared and transferred to different facilities for further analysis. The Data Intensive Scientific Computing (DISCO) environment will provide APIs and containerized workflows that enable interoperability between local data centers and will coordinate data transfer processes when necessary or exploit existing high-speed connectivity between MARCC and several other campuses at Hopkins and internet2. The DISCO ecosystem at Johns Hopkins University will enable the interoperability between a traditional HPC and data intensive system capable of processing and storing large amounts of data, an Open Storage Network appliance to facilitate data sharing and the SciServer, a groundbreaking Science Platform for the hosting, access, sharing, and processing of large scientific data sets. Merging the above collaborative and data management capabilities of SciServer with the HPC resources of MARCC and the data storage services of the OSN will provide a powerful eco-system for researchers to define, manage and execute large scale, data intensive, projects covering all stages of the pipeline from project definition, data generation (simulation), data storage and transfer, postprocessing and analy-sis, and subsequent data sharing and publication. Research projects currently carried out at MARCC involve simulations on large amounts of data, which require the use of an integrated system where the infrastructure is presented as a whole rather than using isolated systems. Figure 1 illustrates the relationship between example focus areas with respect to an integrated environment based on advanced computing and the collaborations with existing NSF-sponsored projects and minority institutions. An important aspect to highlight is the development of scientific tools in these focus areas. The proposed resource will be used as a sandbox to develop, test and benchmark these applications, which for the most part are distribut-ed to different communities as open-source software. Likewise, it is envisioned that many of the datasets that will be created will be available through a SciServer web interface for the benefit of scientific communities. This proposed Data Intensive Scientific Environment, represents an example of the evolution of research com-puting and will provide tools and resources to easily integrate large HPC simulations needed to produce and store highly reliable datasets that can provide relevant information that complements existing experiments and field observables.

Figure 1. Relationship between the integrated advanced computing environment pand each research focus area

Page 9: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

7

2019 IDIES ANNUAL REVIEW | NEWS

Using the JHTDB and Machine Learning to Elucidate Fractal Scaling of Turbulent Spots in Transitional Boundary Layer Flow

Zhao Wu, Tamer A. Zaki, and Charles Meneveau

How an initially laminar viscous boundary layer becomes turbulent is one of the most fundamental and practically important phenomena in fluid dynamics, impacting various disciplines ranging from aeronautics, transportation engineering, geophysics and biological fluid mechanics. Transition from laminar to turbulent

flow over a solid surface is often accompanied by inception, growth and merger of turbulent spots. The evolution and scaling properties of the “skin” of these spots, the interface separating laminar from turbulent flow, are of particular interest. To study the scaling properties of these spots, extensive datasets are required.

We use data from a high-fidelity simulation of a zero-pressure-gradient, transitional boundary layer over a smooth plate with free-stream turbulence that are available from the Johns Hopkins Turbulence Databases (JHTDB, http://turbulence.pha.jhu.edu). This is one of the largest datasets curated by IDIES. The spot interfaces are detected using an unsupervised clustering method, specifically a self-organizing map (SOM) with the number of clusters set to 2. The method automatically identifies whether a point is in the turbulent state (which includes the fully turbulent downstream region and the turbulent spots in the transitional region) or whether it is in laminar state. Figure 1 shows one of these spots identified by the SOM clustering method (black line) on a horizontal plane parallel to the bottom surface. Figure 2 shows a snapshot of the entire interface colored by height above the surface. Several spots are visible before the transition is completed.

In order to find the fractal dimension of the spot boundaries, we generate a scatter plot of spot surface areas plotted as a function of spot volume. The results are shown in Figure 3. The fractal dimension is three times the slope in such a log-log plot, i.e. we measure a fractal dimension, D=2.36 ± 0.01 with a scaling range visible over almost 5 decades of spot volume. The measured fractal dimension agrees very well with values known from fully turbulent flows, such as for cloud boundaries at much higher Reynolds numbers. The value is near 7/3, where the 1/3 excess value above 2 for smooth surfaces arises from classical Kolmogorov’s theory of turbulence. Hence, turbulent spots at very low Reynolds numbers already exhibit scaling properties similar to those of high-Reynolds number scaling behavior. The result is of considerable interest since finding high-Reynolds number behavior in nascent turbulence at low Reynolds numbers can lead to new developments in turbulence theory.

Figure 1: Sample spot at a height of 20% of the incoming laminar boundary layer thickness on a horizontal plane. The background color shows ∂u/∂y while the black line shows the interface as identified by the unsupervised machine learning clustering algorithm.

Figure 3: Scatter plot of spot interface area versus volume. The solid line is the linear fit over the entire range and has a slope of 0.79±0.003 (95% confidence bounds), leading to D=2.36 ±0.01. For reference, the dashed lines have a slope of 2/3 (D=2, lower dashed line) and 1 (D=3, upper dashed line).

Figure 2: Interface in transitional boundary layer identified by unsupervised machine learning clustering method, colored by height above the wall. Data are from the JHTDB.

Page 10: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

8

NEWS | 2019 IDIES ANNUAL REVIEW

Cloud-Based Data Science Plus: Overcoming Barriers to Economic Mobility

The Johns Hopkins Data Science Lab (JHU DaSL) has taught massive open online courses for over six years; in that time they have reached more than 5 million learners interested in breaking into the data science indus-try, which boasts the top-rated jobs in America.

While the JHU DaSL has achieved incredible results through these training programs, they have also come to the realization that there are still significant barriers to entering the field of data science: knowing about the study of data science before learning can take place; associated training programs often assume that learners have advanced math or programming knowledge; data science training programs often require expensive computers; the training itself is oftentimes expensive; data science jobs are concentrated in tech hubs of major cities; and getting a job in data science requires networking and connections.In the spring of 2018, the JHU DaSL launched Cloud-Based Data Science, a two-part training initiative created with the aspiration of overcom-ing these barriers. The first half of the program consists of a set of pay-what-you-want—in-cluding free—massive open online courses that can be completed on any computer with a web browser and an internet connection.The second half of the program, Cloud-Based Data Science Plus (CBDS+) is a no-cost, 14-week tutoring program for young-adult, high school and GED-graduates in Baltimore City. Offered by the Johns Hopkins Bloomberg School of Public Health, in partnership with the Historic East Baltimore Community Action Coalition and Leanpub, CBDS+ aims to equip members of sur-rounding communities with the necessary skills and support required to work in the field of data science.The features of our program include Free Access to the Online Content: Whereas a typical col-lege program in data science costs, on average, $53,300, our online Johns Hopkins massive online open courses are offered for free on Leanpub. Fellowship: CBDS+ pays scholars a stipend as they complete courses. Free Laptops and Internet Access: All learners are provided with a free Chromebook laptop and offered funds to access the internet at home. In-Person and Online Tutoring: In addition to the online cours-es, scholars are required to participate in in-person office hours and provided additional online support outside of office hours. Post-graduation career mentoring: The training program emphasizes soft skills such as written and verbal communication and building a data science portfolio. Upon completion of the program, the CBDS+ team assists scholars in applying to a variety of data science jobs.This initiative has been an exciting undertaking for the JHU DaSL thus far. The first two cohorts have graduated a total of six scholars who were all subsequently hired into data science jobs. The third cohort is nearing completion and will graduate two additional scholars. For more information about CBDS+, please visit https://www.clouddatascience.org.

Page 11: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

9

2019 IDIES ANNUAL REVIEW | NEWS

Advances in Data-Driven Materials Science and EngineeringDavid Elbert, Hopkins Extreme Materials Institute

Materials science and engineering is in the midst of a revolution centered on leveraging Big Data from diverse sources, both experimental and computational. IDIES has taken on an influential role in accelerating that rev-olution by providing tools, infrastructure, and experience that integrate data-driven science into the materials

discovery and design process.The Center for Materials in Extreme Dynamic Environments (MEDE) is a multi-institution col-laborative research alliance led by Johns Hopkins University and the Army Research Laboratory (ARL). The MEDE Data Science Cloud (MEDE-DSC) provides a customized environment giving MEDE immediate access to SciServer’s tools. The col-laboration of David Elbert’s group with Dr. Brian Schuster at ARL has developed transformative approaches to extracting complex information from time-resolved, radiographic imaging of ballistic im-pacts. These experiments produce complex images of deformed projectiles and targets on a microsec-ond timeframe (Figure 1). Experimental evidence is a central driver of simulations to construct mechanistic understanding of such extreme deformation. Our collaboration has developed effective training of ma-chine learning models combining transfer learning, simulated radiographs, and manually labeled experimental results (Figures 2A and 2B). By combining automated image processing and GPU-trained CNN machine learning models, a full day of data analysis has become instantaneous, providing a realistic path to the understanding of light-weight, protective materials like boron carbide (Figure 2C).

PARADIM is an NSF Materials Innovation Platform at Cornell, Johns Hopkins, and Clark-Atlanta Universities. PARADIM is dedicated to accelerating the creation of new and novel interface materials with unprecedented prop-erties for the next generation of electronic de-vices. IDIES has partnered with the PARADIM Data Collective (PDC) to develop a new data model that captures the complex processes of materials design and realization in a diverse, distributed facility. The data model will tie

together work within individual projects while facilitating investigation of broader questions to guide future work. PARADIM’s development of a fully encrypted, streaming data pipeline is a highlight of this year’s work and an important

example of the flexibility IDIES provides to allow creation of innovative ways to attack data-rich problems. The PDC now combines streaming data in an Apache Kafka based pipeline hosted and managed by IDIES. Current data streams from PARADIM’s floating zone furnace lab allow deployment of the PDC machine learning model for melt-zone ge-ometry during crystal synthesis. By deploying ML in a streaming platform, we can now provide real-time insight to the complex parameter space of single-crystal synthesis and give PARADIM users expertise previously gained only by years of experience. As the PDC streaming data pipeline grows, experimental data from Cornell and computational results from MARCC will all flow through one logical unit for efficient, flexible application to all phases from project planning to data analysis and publication.

Figure 2: A. Synthetic radiograph calculated with the Tonge-Ramesh model provides initial CNN training. B. Manually labeled, experimental radiographs are critical for accurate model training. C. Deployment of the ML model provides rapid, accurate capture of projectile physics across the full deformation event.

Figure 1. High-speed radiographs showing projectile (red arrow) impact between six and 41 microseconds after firing. Blue arrow indicates the boron carbide target. Note initial extreme respone of the projectile and the subsequent deformation and onset of destruction of the boron carbide.

Page 12: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

10

NEWS | 2019 IDIES ANNUAL REVIEW

ANNOUNCEMENTSEight IDIES member receive 2019 JHU

Discovery Awards

IDIES member recipients include Dennice Gayme, Joshua Vogelstein, Rajat Mittal, Jung Hee Seo, Tim Mueller, Ben Langmead, Brian Caffo, and Johnathon Ehsani.

Congratulations to Gregory Eyink for winning The Simons Foundation Award

With $4 million in support from the New York-based Simons Foundation, an international team of researchers including Gregory Eyink of Johns Hopkins Engineering’s Department of Applied Mathematics and Statistics is embarking on a study aimed at understanding fluid turbulence from a physics perspective.

Congratulations to Rajat Mittal on receiving a Human Frontier Science Program Research Grant

Rajat Mittal received a Human Frontier Science Program Research Grant for his work studying the communicative properties of mosquito movements. Founded by a collective of scientists from around the world, the selective Human Frontier Science Program awards grants to international teams in an effort to “combine their expertise in innovative approaches to questions that could not be answered by individual laboratories.”

Congratulations to Mark Robbins on his award of the 2019 Simons Fellowship in Theoretical Physics

Mark Robbins was awarded a 2019 Simons Fellowship in Theoretical Physics. The Simons Fellows program extends academic leaves from one term to a full year, enabling recipients to focus solely on research for the long periods often necessary for significant advances. Mark is one of only nine theoretical physicists in the US to receive this honor this year.

Congratulations to Brice Menard on winning the Johns Hopkins University President’s Frontier Award

Brice Menard was awarded the President’s Frontier Award in recognition of his exceptional ability to develop new and potentially far-reaching approaches to big data sets. The award includes $250,000 to support his work analyzing very large astronomical data sets to make new discoveries about our galaxy and the universe beyond.

Congratulations to Sarah Preheim for winning the Johns Hopkins Catalyst Award

Sarah Preheim was one of the thirty-three faculty members selected to receive the Johns Hopkins Catalyst Award, an honor that is accompanied by a $75,000 grant, mentoring opportunities, and institutional recognition.

The Platform for the Accelerated Realization, Analysis, and Discovery of Interface Materials (PARADIM) used SciServer infrastructure for two educational events this year.

PARADIM worked with the National Institute of Standards and Technology (NIST) to hose the first 2D Data Framework data workshop to train 26 graduate students and postdocs in methods of data science. The PARADIM 2019 Summer School on Materials Growth and Design focused on Discovery in the Era of Big (Materials) Data with sessions on data wrangling, visualization, and machine learning in a laboratory setting.

Page 13: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

11

2019 IDIES ANNUAL REVIEW | NEWS

3D Visualization Hackathon, 2019On the weekend of January 25th-26th, 2019, IDIES brought together researchers from around JHU and the world at our second annual Visualization Hackathon.

The 2019 edition of the hackathon featured around 20 researchers from many different JHU schools and departments, as well as astronomers from the Space Telescope Science Institute (STScI). The event was hosted by IDIES at the JHU Department of Physics and Astronomy.

The weekend started with attendees meeting each other and describing their research interests. The organizers then gave a quick introduction to IDIES’s flagship SciServer website (www.sciserver.org). SciServer, a science platform to share, visualize, and analyze big data online, served as the primary online space for attendees to build their research collaborations for the hackathon.

Next, attendees heard presentations about ongoing research projects being developed by IDIES affiliates using SciServer. Jordan Raddick presented a collaboration with JHU’s 21st Century Cities initiative looking at patterns of small business lending in Baltimore. Nick Carey, a researcher in the Department of Computer Science and IDIES affiliate, presented work analyzing EEG brain scans using neural nets.

Attendees then divided into three interdisciplinary teams, each with an exciting project. Each team spent the next day and a half working on their projects, in close proximity to SciServer researchers and developers, who served as a real-time helpdesk. At the end of the hackathon on Saturday afternoon, all three teams presented their work to the whole group of attendees.

One group, led by a JHU computer science student, used SciServer’s Recount project (http://sciserver.org/datasets/#recount) to create a Chaos Game representation of published RNA sequences, allowing for easier visualization of complex sequencing data.

Another group, a collaboration between researchers from JHU’s Bloomberg School of Public Health and Space Telescope Science Institute, looked at ways to apply methods from one field to problems in the other. They studied how two-point correlation techniques from astronomy could predict smoking-related morbidity and mortality, and how clustering analysis techniques from public health could shed light on the structure and evolution of galaxies.

The winning group, consisting of JHU undergraduates Ronan Perry and Darius Irani, looked at the frequency of calls to Baltimore City’s 911 emergency service. After geolocating the calls and plotting them on a map of the city, they compared the geographic distribution of 911 calls to police arrests, and grouped the data into Baltimore’s 200+ traditional neighborhoods. For their efforts

at the hackathon, both received a $25 Amazon.com gift certificate.

For the first time, the IDIES hackathon featured a parallel session, organized at the same time by former IDIES researcher Mubdi Rahman, at the Dunlap Institute of Astronomy and Astrophysics at the University of Toronto. At that parallel hackathon at the University of Toronto, Rahman helped several hackathon teams complete equally impressive projects. All the research teams at both sites produced outstanding results, and all attendees left with new research skills to engage with a wider variety of data-intensive projects.

Page 14: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

12

IDIES | 2018 IDIES ANNUAL REVIEW

S

IDIES Member, Greg Eyink: Untangling Fluid Turbulence

With $4 million in support from the New York-based Simons Foundation, an international team of research-ers including Gregory Eyink of Johns Hopkins Engineering’s Department of Applied Mathematics and Statistics is embarking on a study aimed at understanding fluid turbulence from a physics perspective.

“Our hope is that this collaboration will help to generate fundamental new points of view and progress on under-standing turbulence,” said Eyink. The Simons support comes in the form of grants to four groups, each spearheaded by a researcher with distinct—but overlapping—expertise in fluid turbulence, including Nigel Goldenfeld at the University of Illinois Urbana-Champaign; Björn Hoff of Austria’s Institute of Science and Technology; Gregory Falkovich of Israel’s Weizmann Institute; and Eyink at Johns Hopkins. Work began on September 1st.“Most fluids encountered in our everyday life are in a turbulent state: wakes behind speeding vehicles, pots of boil-ing water on a stove, or oceans stirred by wind currents and tides,” explains Eyink. “The enhanced dissipation from turbulent drag wastes huge amounts of energy, and engineers want to find ways to reduce it. The enhanced mixing of water density by deep ocean turbulence plays a crucial role in the global circulation and Earth’s climate. Our project will focus, in particular. on fluid flows interacting with walls, boundaries and obstacles, which is the most common form of terrestrial turbulence.” The Simons Foundation’s mission is to advance the frontiers of research in mathematics and the basic sciences, and its Mathematics and Physical Sciences (MPS) division supports research in mathematics, theoretical physics and theoretical computer science by providing funding for individuals, institutions and science infrastructure. “Our aim is to interact with and complement the existing efforts in this area of both engineers and mathematicians,” Eyink said. “This is especially the case here at JHU, where there is an extremely strong tradition of turbulence research, including our own Turbulence Database Group, which is a multi-department effort involving mechanical engineering, applied mathematics, computer science, and physics.”

v v v

Page 15: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

13

2019 IDIES ANNUAL REVIEW | IDIES

Seed Fund Updates, Spring 2019Each spring, the IDIES Seed Funding Program invites proposal submissions for $25,000 awards for Big Data pilot or seed projects. The goal of the Seed Funding initiative is to provide funding for data-intensive computing projects that (a) will involve areas relevant to IDIES and JHU institutional research priorities; (b) are multidisciplinary; and (c) build ideas and teams with good prospects for successful proposals to attract external research support by leveraging IDIES intellectual and physical infrastructure. Traditionally, IDIES awards four to five projects annually; however in 2019 the JHU School of Medicine generously funded two additional awards. We would like to thank the JHU School of Medicine, and all of our sponsors, for their support of this worthwhile program.

Use of Whole Exome Sequencing to Find and Test Novel Candidate Genes in Very Early Onset Inflammatory Bowel Disease

Janet Markle (Department of Molecular Microbiology and Immunology, Johns Hopkins Bloomberg School of Public Health),

Anthony Guerrerio (Pediatrics, Johns Hopkins School of Medicine)

This project aims to uncover genetic and immunological drivers of disease pathogenesis in children with a rare and devastating disease, very early onset inflammatory bowel disease (VEOIBD). This project involves the in-depth analysis of whole exome sequencing (WES) data generated using genomic DNA from a unique cohort of children with VEOIBD.

WES captures and sequences the full protein-coding portion of the human genome, therefore this approach can be used to discover extremely rare and highly-deleterious genetic variants that genome-wide association studies and other SNP-based approaches have missed. Thirty-nine families have been recruited to this study, and we have analyzed data from approximately 20 families to date (some analyses are still in progress) using a bioinformatic pipeline that was customized to detect single-gene inborn errors of immunity, and implemented as a web-based application in the PI’s lab.

We are also working to publish and freely share this web-based tool so that other researchers may use it to aid and expedite their analyses of WES data generated from cohorts of patients with other monogenic diseases. To date, we have identified 3 candidate genes that fulfill our filtering criteria for potential disease causality.

To further test the role of these rare genetic variants in VEOIBD, we are now in the process of cloning and over-expressing the WT and mutant alleles of each gene in cellular assays to interrogate expression and functional consequences of each mutation. These genes include a chemokine (CCL25), an inflammasome sensor (NLRP2) and an inflammasome adaptor protein (ASC), and all of these genes have known roles in the immune system but are not currently known to underlie VEOIBD.

These preliminary data suggest that we may have identified novel genetic lesions that directly impact the immune-mediated pathology that drives VEOIBD. Further molecular and cellular immunology experiments are needed to test this idea, and additional WES analyses are expected to reveal additional novel candidate genes.

v v v

Page 16: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

14

IDIES | 2019 IDIES ANNUAL REVIEW

The History of Meter and the History of English GrammarChristopher Cannon (Departments of English and Classics), Mark Patton (Digital Research and Curation Center)

The history of the verse rhythms in English poetry before 1500 has been difficult to write, because we cannot tell from the way poetry was written down how it sounded. This is a long-standing problem, and Geoffrey Chaucer is the central figure in this story, since much turns on the fate (the sounding or silence) of what is usually called ‘final -e’ in his writing. To give one example, such -e can be seen at the end of the first line of fig. 1 (‘soote’ which means ‘sweet’). A scribe could just as easily have written ‘soot’ (rhyming with ‘root’ in the next line. Fig 1 is an image from the beginning of one of the best (and one of the earliest) manuscript copies of Chaucer’s magnum opus, the Canterbury Tales. The -e’s I have just pointed to do not matter to Chaucer’s meter, but the excellent scribe who wrote this still sometimes seems to leave out -e’s we suspect were there when Chaucer composed his verse because they are necessary to make the rhythm of his poetry regular. We suspect, for example, that this is the case for ‘half ’ where you see the arrow pointing in fig. 1.

For the verse rhythm of this line to be regular, we would expect ‘halfe’ so that the line can be read with a regular alternation of weak and stressed syllables with the ‘-e’ pronounced as a sound sort of like ‘uh’. This is, then, the regular rhythm we would expect (with unstressed syllables marked ‘x’ and stressed syllables marked ‘/’):

x / x / x / x / x / Hath in the Ram his halfe cours yronne,

Without the -e in ‘halfe’ two stressed syllables would bump into one another:

x / x / x / / x / Hath in the Ram his half cours yronne,

But how to know if Chaucer meant to have that -e there? Most editors of Chaucer and historians of English have relied on manuscript copies and the preponderance of spellings for a given word, as well as historical theories about what is ‘normal’ in Chaucer’s grammar. In fact, ‘half ’ could have an ‘-e’ historically, but almost all editors have deferred to the Ellesmere manuscript and printed ‘half ’ and, thus, an irregular line, in the first sentence of Chaucer’s masterpiece.

This project seeks to define new norms for Chaucer’s grammar and verse rhythm digitally. Using a database created by Larry Benson 20 years ago which tagged every word in Chaucer’s writing for its grammatical form, the PI and a post-doc assistant are now tagging every word with its metrical profile assuming that Chaucer’s verse is always metrically regular. After a ground truth has been established by scanning 3000-5000 lines of the 44,000 lines of Chaucer’s verse, we will work with computer scientists to devise a method for machine learning to complete the analysis. Combining the tagged text and the metrically-correct text for all of Chaucer’s verse will make it possible to see what his grammar actually looked like (where did the -e’s belong, in which word forms, and in what positions). We also have a secondary data set (all of the English poems of John Gower, Chaucer’s contemporary, also tagged grammatically) which we will then scan as a control (checking to see how normative Chaucer’s grammar was). Since the late fourteenth century is the moment when English writers (such as Chaucer) were suddenly aware that English had a grammar, and Chaucer seems to have invented English meter largely employed for the next 500 years, no turning point could be more significant for establishing these norms. It is not too much to say that by knowing the status and fate of those crucial final -e’s, in every case, both the history of English meter and the history of English grammar can be rewritten.

Fig. 1 San Marino, California, The Huntington Library, MS EL 26 C 9 (‘The Ellesmere Manuscript’), fol. 1r.

Page 17: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

15

2019 IDIES ANNUAL REVIEW | IDIES

Fig. 1. StethoVest: a prototype wearable phonocardiographic (PCG) system for multisite recording of heart sounds and ECG.

Fig. 2. Heart sound (systolic valve murmur) simulated vial computational hemoacoustic modeling. Left: sound signals on the chest surface. Right: time-frequency spectrogram of the murmur associated with a diseased heart valve.

Mapping the Cardiac “Acousteome”- Advanced Sensors, Modeling and Data Enabled Science

Rajat Mittal (Department of Mechanical Engineering), Jung-Hee Seo (Department of Mechanical Engineering), Andreas Andreou (Electrical and Computer Engineering), Reid W. Thompson (Department of Pediatrics, Division of

Pediatric Cardiology), and Christos Sapsanis (Electrical and Computer Engineering).

The wearable sensor and tele-health revolutions have arrived. Wearable sensors are now able to automatically track and record our movements, exercise levels, pulse-rate, O2 saturation, sleep and respiration rates. Longitudinal monitoring of this health data has the potential to revolutionize the management of health and wellness. Interestingly however, this automated health monitoring revolution seems to have bypassed the one modality that has been, and continues to be, the mainstay for health monitoring: auscultation with a stethoscope! In our view, the primary reason for this void is the following. Since heart sounds propagate through the thorax to the chest surface, the measurement is significantly affected by body habitus and gender, and we currently do not understand how to compensate for these factors.

Recently, our team has made significant progress in addressing the measurement issue by developing and testing the “StethoVest,” (see Fig. 1) a prototype wearable phonocardiographic (PCG) system for simultaneous, multisite recording of heart sounds and ECG. The simultaneous multi-site recordings provide redundancy to overcome loss or sub-optimality of signal from any sensor, and the system generates 5-dimensional (3D in space, time and frequency) maps of heart sound/vibrations patterns which can be used for source localization and identification. However, effects of body-habitus on measurements and meaningful analysis of the complex signals remain an open issue and is the focus of our team.

We are employing a unique approach to addressing the issue of body-habitus effects on phonocardiographic measurements: data-enabled analysis with in-silico biophysical models of virtual populations. This study employs high-fidelity computational hemoacoustic modeling (see Fig. 2) for the direct simulation of heart sound generation and propagation to quantify and characterize the effect of body habitus on the heart sounds. The data from these simulations will be compared to the StethoVest phonocardiographic measurement of patients and we will also consider the redesign of the StethoVest (i.e. sensor placement) based on the findings of this study regarding the effect of body habitus on the measurement accuracy.

Page 18: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

16

IDIES | 2019 IDIES ANNUAL REVIEW

China is Talking the Talk but Not Walking the Walk on Its Methane Emissions RegulationsScot Miller (Whiting School of Engineering, Department of Environmental Health and Engineering), Darryn Waugh (Krieger School of Arts and Sciences, Department of Earth & Planetary Sciences)

China is the world’s largest emitter of human-caused greenhouse gases, notably emitting the most carbon dioxide and methane. Methane, in particular, is emitted from a wide variety of sources, both natural and anthropogenic, and yet coal mining in China is likely responsible for a plurality of the country’s methane emissions. Methane forms in coal seams over geological time scales, and is released to the atmosphere when the coal seam is mined. This methane is not only a potent greenhouse gas, but also a lost resource. Methane is the main component of natural gas, and methane can be used to generate electricity or heat buildings. Methane

also poses a grave safety danger in coal mines, and causes to explosions that kill thousands of Chinese coal workers per year. In response to these concerns, China passed ambitious coal mine methane regulations beginning in 2006 requiring that mine operators drain methane from coal mines and capture that methane such that it can be used to generate electricity or heat buildings (e.g., IEA 2009). These regulations should have curbed the countries emissions. However, methane emissions can be notoriously difficult to track. Traditionally, government

agencies in China, the US, and elsewhere have used an accounting-type approach to inventory national emissions: count up the total amount of coal produced and multiply that number by the estimated amount of methane leaked per ton of coal mined. This inventory approach to greenhouse gases often undercounts emissions. Inventories can overlook misbehaving actors or make overly optimistic assumptions about policy or technology improvements.

Furthermore, China has recently made large revisions to their coal consumption statistics, further complicating efforts to estimate the country’s greenhouse gas emissions. For example, in 2015, the Chinese government revealed that the country had been burning 17% more coal per year than previously disclosed, equivalent to 70% of the total amount of coal mined in the US (e.g., Buckley 2015).

Fortunately, observations of methane from satellites can provide an independent means to evaluate emissions. GOSAT, a Japanese satellite, was one of the first satellites to observe methane and CO2 with sufficient accuracy to evaluate emissions of these gases. This satellite launched in 2009 and now provides a decade-long record of observations to estimate trends in emissions from countries like China. In a recent study, we combined observations of atmospheric methane levels collected from GOSAT with atmospheric and statistical modeling to estimate surface methane emissions (Miller et al. 2019). Using GOSAT, we found that methane emissions from

(Continued on the bottom of page 18)

FIGURE 1: Observations of atmospheric methane from the TROPOMI sensor on the Sentinel-5 Precursor satellite, averaged over April 2019. The observations show clear methane enhancements in oil and gas basins, coal mining regions, and high populated regions of China (Figure credit: Leyang Feng, Johns Hopkins PhD student).

Page 19: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

17

2019 IDIES ANNUAL REVIEW | IDIES

Expanding Data-intensive Teaching at Johns Hopkins University by Hosting the Practical Genomics Workshop on SciServer

Sarah Wheelan (Oncology, Johns Hopkins School of Medicine), Jai Won Kim (Institute for Data Intensive Engineering and Science, Department of Physics and Astronomy), Jonathan Pevsner (Neurology, Kennedy Krieger Institute and Johns Hopkins School of Medicine), Luigi Marchionni (Oncology, Johns Hopkins School of Medicine),

Frederick Tan (Bioinformatics, Carnegie Institution)

Biological and medical research are advancing rapidly due to the continued development of DNA sequencing and related technologies, but the educational opportunities to acquire the skills needed to analyze these data have not kept pace with demand. The Practical Genomics workshop is a genomics training opportunity offered every summer since 2011 by a team of faculty from Johns Hopkins University and the Carnegie Institution for Science, with the goal of broadening the analysis expertise of non-computational scientists. Each summer, 50-60 graduate students, postdocs, and faculty arrive at the Carnegie Institution, Department of Embryology to spend four days receiving instruction in genomic data analysis.

Through this Seed Fund opportunity, the Practical Genomics faculty team are partnering with SciServer (sciserver.org) to “bring the analysis to the data” and transfer our training exercises to the SciServer compute environment. SciServer is an on-cloud science platform which, among other capabilities, allows server-side analysis using R and Python via the rich RStudio integrative development environment, along with access to preloaded software, curated datasets, specialized hardware (e.g. GPUs), and private storage space.

Partnering with SciServer will enable us to use realistic datasets and complex analyses in our training exercises, rather than small “toy” datasets that can be analyzed on participant laptops. It will introduce participants to SciServer, expanding the SciServer user community. It will also reduce the technical burden for both participants and instructors by allowing the analysis to take place on SciServer, rather than installing the software on participants’ laptops. This will minimize time spent on installation and will ensure participants are all operating within the same computational environment, removing difficulties arising from platform incompatibilities and software upgrades.

Before SciServer After SciServer

Realistic Data Analysis

Data • 9 samples• 2 modalities• Subset to a single chromosome

• 200 samples• 4 modalities• Full genome

Analysis • Integration limited to overlapping final peak calls and gene sets

• Integration performed across multiple modalities simultaneously

• Integration involves relatively unprocessed data

Reduced Technical Burden

Participants • Install R, RStudio, R packages on laptops• Download data files

• Log in to SciServer

Instructors • Support wide range of hardware and operating systems

• Distribute files and collect participant work via email, downloads, and screenshots

• Support a single, customizable SciServer Compute container

• Distribute files and collect participant work via SciServer’s Courseware app.

As a first step, we transferred existing lessons and exercises to SciServer. During our recent PG 2019 workshop, a workshop TA carried out the exercises on SciServer to identify any potential problems and ensure that they would function well in the workshop environment. We are pleased to report that SciServer made this an incredibly smooth process. SciServer allows processing pre-loaded data and R Scripts with little hindrance, and supports users in creating custom SciServer images complete with required R packages necessary for their specific data analysis needs. SciServer compute containers can continue running for up to 3 days if the user logs off, so analyses are not disrupted if the connection is dropped, and longer jobs can run while users shut down their computer or

(Continued on the bottom of page 19)

Figure 1 Advantages of using SciServer in the Practical Genomics workshop

Page 20: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

18

IDIES | 2019 IDIES ANNUAL REVIEW

v v v(Continued from page 16, China is Talking...)

(Continued on the next page)

China have been increasing business-as-usual, in spite of China’s ambitious coal mine methane regulations. We found no evidence that these regulations have had an impact on the country’s emissions. Furthermore, we found that the largest increases in the country’s emissions were in coal-mining regions.

Satellite-based monitoring of greenhouse gases has been improving by leaps and bounds over the past decade, and newly-launched satellites hold particular promise for tracking methane emissions. For example, the Sentinel-5 Precursor satellite launched by the European Space

Agency observes methane and other gases with a far greater resolution and data density than any existing satellite. This satellite should provide an unprecedented window into methane emissions across the globe, particularly from difficult-to-monitor countries, and should allow the scientific community to help pinpoint which emissions sources are driving recent global increases in methane. However, these new satellites also bring an unprecedented “big data” and computing challenge. Satellites like the Sentinel-5 Precursor collect millions of observations per year, a bottleneck for statistical and atmospheric models that were originally designed for small numbers of ground-based observations. Hence, these new satellite observations could hold the key to understanding global methane emissions but necessitate innovative approaches to big data in order to crack this greenhouse gas puzzle.

Predicting Morphogenesis: Understanding the Role of Cell-to-Cell Variation in Collective Gradient SensingBrian Camley (Department of Physics & Astronomy; Department of Biophysics, Johns Hopkins University Krieger School of Arts and Sciences), Andrew Ewald (Department of Cell Biology; Departments of Oncology and Biomedical Engineering, Johns Hopkins University School of Medicine)

In developing organisms, groups of cells work together to sense chemical signals, sharing information to make measurements more precisely than any single cell can alone. In particular, cells can work together to understand a directional cue, like a signal that is graded over space, something we call “collective gradient sensing.” However, when groups of cells share information, not all cells are equally reliable – in fact cells are highly variable, with different responses to signals and different protein concentrations for each cell. In the past, the Camley group made predictions for how cells should best combine the information from these cells in order to measure a graded signal.

The Camley and Ewald groups are currently working together to test this idea by studying an extreme example of variable cells – mammary organoids made of a mixture of active cells (which always believe they see a signal) and normal cells. Over time, these organoids develop branches, as during normal mammary development; we also know from earlier experiments by the Ewald group that these mammary organoids cooperate to sense gradients.

Our goal is to use the location of the active cells to predict the location of the branches (Fig. 1A), using the model arising from the Camley group’s predictions on collective sensing. We will also extend our results to infer which cells are most important directly from experimental data, allowing us to characterize whether cells near the core or periphery of the organoid play a larger role.

We have been able to generate preliminary data showing the time course of branching in these mosaic organoids (Fig. 1B), and are now accumulating data to test the predictive theory. We have also been developing new Bayesian tools to infer the underlying rules of how branching arises from the experimental data. Understanding how the pattern of activity is translated into branching will allow us to better understand how chemical signals are integrated across a group of cells.

Figures 1A and 1B

Page 21: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

19

2019 IDIES ANNUAL REVIEW | IDIES

use it for other tasks.

We are now updating our lessons to take advantage of SciServer’s computational capacity and allow participants to engage with a realistic dataset using cutting-edge multi-factor analysis techniques. Modern genomic research is increasingly reliant on integrating multiple modalities of data (genome sequencing, RNA-seq, ChIP-seq, DNA methylation, chromatin capture, etc.) to capture insights only accessible through examining the interactions between these biological phenomena. Previously, our lessons have been constrained by the need for analyses to be performed in a reasonable time on a wide range of participant laptops. Consequently, we limited ourselves to a small (9 sample) dataset, and, for portions of the workshop, we reduced our analysis to a single chromosome (chr 20, which contains only ~2% of the human genome). We also included only limited comparisons between two data types (RNA-seq and ChIP-seq). We are now working on using SciServer to incorporate Multi-Omics Factor Analysis of a chronic lymphocytic leukemia dataset, following the example of Argelaguet et al. (2018). This analysis encompasses 200 samples with four different types of data (genome sequence, RNA-seq, DNA methylation, and drug response) and will allow participants to engage with the complexities of a real-world dataset.

Procuring grant support for the workshop’s faculty effort has been challenging, as most opportunities require substantial novelty, and teaching bioinformatics is not inherently novel. Incorporating SciServer is likely to boost our chances of winning funding, as the platform is both novel, and makes the workshop much more portable. The 2020 workshop will feature the work described here; we are excited about the impact that this will have on attendees and teachers alike.

(China is Talking...)

(Continued from page 17, Expanding Data Intensive...)

REFERENCES:

Buckley, C. (Nov 4 2015). China Burns Much More Coal Than Reported, Complicating Climate Talks. New York Times, A1, https://www.nytimes.com/2015/11/04/world/asia/chinaburnsmuchmorecoalthanreportedcomplicatingclimatetalks.html.

International Energy Agency. Coal mine methane in China: a budding asset with the potential to bloom. IEA information paper. IEA https://www.iea.org/publications/freepublications/publication/china_cmm_report.pdf. Last access: 19 Mar 2018 (2009).

Miller, S. M., Michalak, A. M., Detmers, R. G., Hasekamp, O. P., Bruhwiler, L. M. P., & Schwietzke, S. (2019). China’s coal mine methane regulations have not curbed growing emissions. Nature Communications, 10(1), 303, doi:10.1038/s41467018078917.

Figure 2 A Practical Genomics lesson on SciServer.

v v v

Page 22: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

20

IDIES | 2019 IDIES ANNUAL REVIEW

Urban Spaces in Baltimore: Data Science in the CityOn August 27, 2019, IDIES partnered with 21CC and the Carey Business School to host an inaugural symposium, “Urban Spaces in Baltimore: Data Science in the City” at the Carey Business School. The organiz-ing committee; consisting of IDIES faculty members (Tamas Budavari, Katalin Szlavecz, Benjamin Zaitchik), 21CC Director (Matthew Kahn), and Baltimore City Housing officials (Michael Braverman, Robert Pipik); set out to organize an event that would bring together academic researchers and city officials with overlapping interests to create new research collabora-tions that address city challenges, with a data science focus.

The symposium was kicked off with keynote addresses from Matthew Kahn (21CC Director, Bloomberg Distinguished Professor, JHU) and Alan Mallach (Senior Fellow, Center for Community Progress). Professor Kahn spoke to the increasing availability and importance of data allowing researchers to investigate the host of challenges and opportunities facing cities, and the role he sees for 21CC at JHU to use this data effectively towards their goal of improving the quality of life and social and economic outcomes for disadvan-tage citizens of Baltimore and cities around the world. In Mr. Mallach’s keynote, he discussed the challenges and com-plications of creating research/practice collaborations that benefit both researchers and practitioners, and suggested some ideas to bridge that gap.

“The onus is really on the research and academic community,” he said, “to reach out to the world of practice and demonstrate that they can be useful and productive and trusted”

The morning was round out by short talks from Jacky Jennings (Associate Professor, Pediatrics, JHU), Natalie Exum (Assistant Scientist, Environmental Health& Engineering, JHU), Lisa McNeilly (Director, Baltimore Office of Sustainability), and Tamas Budavari (Associate Professor, Applied Math & Statistics, JHU) and Michael Braverman (Baltimore Housing Commissioner) that spoke to the four themes of the symposium: health, infrastructure, green spaces, and occupancy & vacancy; respectively. Each spoke to current research projects highlighting productive research partnerships between JHU and Baltimore City. The final talk, given by Professor Budavari and Mr. Braverman, in particular, focused on what their collaboration has taught them regarding the expected benefits and outcomes, and practical realities of establishing and managing an ongoing relationship between city government and academia.

In the afternoon, over 100 symposium attendees had the opportunity to share their energy and interact with their diverse group of fellow attendees during breakout sessions. Small groups were established to identify areas within the assigned theme where new collaboration would be valuable, and discuss the opportunities and challenges in

creating a successful collaboration for their identified challeng-es. Attendees at the Urban Spaces symposium represented various groups within JHU, Baltimore City Government and Departments, Baltimore City organizations and non-profits, MICA, Towson University, the Maryland State Assembly, and several local community members.

IDIES and the organizers were overwhelmed with the energy received in response to this inaugural Urban Spaces sympo-sium, and look to continue and encourage efforts towards new collaborations to address the challenges and opportunities facing Baltimore. Be sure to watch your inboxes for news of future events relating to the Urban Spaces efforts!

Page 23: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

21

2019 IDIES ANNUAL REVIEW | IDIES

Page 24: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

22

IDIES | 2019 IDIES ANNUAL REVIEW

Management• Storage resource• Enable high performance computing with

stable operations

ABOUT IDIESMission Statement To enhance the JHU mission of “Knowledge for the world” by providing intellectual leadership in data driven science:

• Research and Education in adaptive disruptive technologies • Technical and domain expert guidance via collaboration and consultation • Open and sustainable long-term access to high-value datasets

Vision Statement IDIES will lead the translation of data science & technology to real world problems across the University

Values

Leadership• Offer expertise on which solution is best for data needs

(IDIES, MARCC, SciServer, etc)• Guide those who want to integrate data science into

research (disruptive technologies)

Vision• Expand thinking – identify what needs to be done and

help in the “how”Incubator• Foster collaborations across the university• Build new tools

Agility• Flexibility to adapt, respond, and integrate new

disruptive technologies

Growth• Continuing education and expertise on new and

developing technologies; continually offering researchers guidance and expertise in leading edge technology

Development• Design tools, approaches, and solutions to solve

data problems

IDIES is always accepting members. For more inforamation about our membership categories, please visit idies.jhu.edu/join for more information, and to join today!

Page 25: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

To our generous sponsors

THANK YOU

The IDIES Executive Committee would like to extend our heartfelt gratitude to our affiliates, collaborators, contributors, editors, and staff, without whose continued support and cooperation IDIES would not be possible.

-Sayeed Choudhury, Jaime Combariza, Tara Hentgen, Charles Meneveau, Mark Robbins, Stephen Salzberg, Valerie Suslow, Alex Szalay, and Ani Thakar.

Page 26: Institute for Data Intensive Engineering and Scienceidies.jhu.edu/wp-content/uploads/2019/10/IDIESAnnualReview2019.pdf · for a new European X-ray satellite, eRosita; and soon we

IDIES • Johns Hopkins University • 3400 N. Charles St • Baltimore, MD 21218