24
Augusta University Second Annual Data Science Workshop Emerging Data Science Methods for Complex Biomedical and Cyber Data October 14 – 15, 2021 Workshop Venue Georgia Cyber Center Hull McKnight Building, 100 Grace Hopper Lane, Augusta, GA Banquet on Thursday, October 14, 2021 6:30 – 8:30 PM The Pinnacle Club, 17 th Floor, 699 Broad Street, Augusta, GA

Emerging Data Science Methods for Complex Biomedical and

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Emerging Data Science Methods for Complex Biomedical and

Augusta University

Second Annual Data Science Workshop

Emerging Data Science Methods for Complex Biomedical and Cyber Data

October 14 – 15, 2021

Workshop Venue Georgia Cyber Center Hull McKnight Building, 100 Grace Hopper Lane, Augusta, GA

Banquet on Thursday, October 14, 2021 6:30 – 8:30 PM

The Pinnacle Club, 17th Floor, 699 Broad Street, Augusta, GA

Page 2: Emerging Data Science Methods for Complex Biomedical and

  2

Page 3: Emerging Data Science Methods for Complex Biomedical and

  3

Second Annual Data Science Workshop

Emerging Data Science Methods for Complex Biomedical and Cyber Data The Division of Biostatistics and Data Science in the Department of Population Health Sciences (DPHS) in the Medical College of Georgia (MCG) and the School of Computer and Cyber Sciences at Augusta University (AU) is organizing this workshop on "Emerging Data Science Methods for Complex Biomedical and Cyber Data". The goal of the two-day workshop is to educate and empower graduate students, postdoctoral fellows, and early career researchers and faculty members with emerging statistical methods to address the complex data arising from various fields, in particular, from biosciences and cyber science. We have witnessed the explosion of complex and big data from various disciplines, social media, cyber traffic, and environment surrounding us in the recent decade. Data scientists and statisticians are blessed with such variety of data that they have never seen before, yet are also facing many challenges because of the complexity and massiveness of such data. The goal of this workshop fits into the society’s demand of fostering collaborative research between data science/statistics and other disciplines in science for the purpose of meeting the very hardest and most important data and model-driven scientific challenges. The workshop is funded by the National Science Foundation and the Augusta University Research Institute. Co-sponsors of the workshop include the American Statistical Association Georgia Chapter, Caucus for Women in Statistics, Institute of Mathematical Statistics, International Statistical Institute, Joint Committee on Women in the Mathematical Sciences, Journal of Applied Statistics, Southern Reginal Council on Statistics, and the Statistical and Applied Mathematical Sciences Institute.

 

About the Department of Population Health Sciences

DPHS, established in 2017, strives to build a transdisciplinary research infrastructure, educational programs, and community partnerships to improve the health of all populations, in line with the 200-year history of MCG’s pursuit of better health. DPHS was shaped by expanding the Department of Biostatistics and Epidemiology with the renewed mission to understand, preserve and improve the health of human populations through research, training and community engagement, especially focusing on the health of Georgians.

The department currently has three divisions. The Division of Biostatistics and Data Science focuses on developing innovative methodologies in quantitative science to advance biomedical research; collaborating with biomedical and public health researchers on study design and data analyses; and training biomedical, public health and quantitative scientists. The Division of Epidemiology is engaged in research and educational endeavors for understanding the relationships between determinants and outcomes of health through epidemiological studies. Health Economics and Modeling, the third division of DPHS, is actively engaged in research related to health economic evaluations, healthcare modeling and health policy. The three divisions, while pursuing distinct but related goals, are well-integrated and work together for the overarching goal of improving the health of populations through research, education and community engagement. Together, we play a vital role in the multidisciplinary and translational research and education mission of MCG and AU, crossing the boundaries of academic colleges.

Page 4: Emerging Data Science Methods for Complex Biomedical and

  4

Second Annual Data Science Workshop

Emerging Data Science Methods for Complex Biomedical and Cyber Data

Planning Committee

Jie Chen, PhD (Co-Chair) Professor and Division Chief of Biostatistics and Data Science Department of Population Health Sciences

Varghese George, PhD (Co-Chair) Professor of Biostatistics and Data Science Chair, Department of Population Health Sciences

Santu Ghosh, PhD Assistant Professor of Biostatistics and Data Science Graduate Program Director of Biostatistics Department of Population Health Sciences

Darius Kowalski, PhD Professor of Computer Science School of Computer and Cyber Sciences

Steven Weldon, MS Director Cyber Insitute, School of Computer and Cyber Sciences

Hongyan Xu, PhD Professor of Biostatistics and Data Science Department of Population Health Sciences

Page 5: Emerging Data Science Methods for Complex Biomedical and

  5

 Emerging Data Science Methods for Complex Biomedical and Cyber Data

Program

 DAY ONE: THURSDAY, OCTOBER 14, 2021  REGISTRATION AND REFRESHMENTS

8:15 am - 9:00 am GCC Lobby

SESSION 1 Chair: Varghese George, PhD

9:00 am - 9:20 am Opening Remarks: Neil MacKinnon, PhD, Provost & Executive VP, AU

9:20 am - 10:10 am Medical Decision Making & Diagnostic Ambiguity in the AI Era Douglas Miller, MD, MBA, Augusta University, Augusta, GA

10:10 am - 10:20 am Q & A

10:20 am - 10:35 am Break GCC Lobby

SESSION 2 Chair: Darius Kowalski, PhD

10:35 am - 11:25 am Deep Learning data-driven approaches for Epidemic Forecasting Aditya Prakash, PhD, Georgia Institute of Technology, Atlanta, GA

11:25 am - 11:35 am Q & A

11:35 am - 12:25 pm Reinforcement Learning with Attrition George Cybenko, PhD, Dartmouth College, Hanover, NH

12:25 pm - 12:35 pm Q & A

POSTER SESSION AND LUNCH

12:35 pm - 2:15 pm GCC Lobby

SESSION 3 Chair: Steven Weldon, MS

2:15 pm - 3:05 pm Innovative Statistical Methods for Manifold Data with Biological Applications Ashis SenGupta, PhD, Indian Statistical Institute, Kolkata, India

3:05 pm - 3:15 pm Q & A

3:15 pm - 4:05 pm Applying Data science to Detect Malicious User Behavior Moazzam Khan, PhD, IBM Security Systems, Atlanta, GA

4:05 pm - 4:15 pm Q & A

4:15 pm - 4:30 pm Break GCC Lobby

INTEGRATED PANEL DISCUSSION I

4:30 pm - 5:15 pm Panel Discussion Leader: Gagan Agrawal, PhD, Augusta University, Augusta, GA

SOCIAL HOUR & BANQUET The Pinnacle Club 699 Broad Street, 17th Floor

6:15 pm - 7:00 pm Social Hour

7:00 pm - 9:00 pm Banquet Remarks: David Hess, MD, Dean, Medical College of Georgia, AU Alex Schwarzmann, PhD, Dean, School of Comp & Cyber Sci, AU

Page 6: Emerging Data Science Methods for Complex Biomedical and

  6

DAY TWO: FRIDAY, OCTOBER 15, 2021 REGISTRATION AND REFRESHMENTS

8:00 am – 8:30 am GCC Lobby

SESSION 4 Chair: Jie Chen, PhD

8:30 am - 9:20 am Personalized Treatments: Sounds Heavenly, but where did they find my Guinea Pigs?

Xiao-Li Ming, PhD, Harvard University, Cambridge, MA

9:20 am - 9:30 am Q & A

9:30 am - 10:20 am Differential Privacy for Dynamic Databases Rachel Cummings, PhD, Columbia University, New York, NY

10:20 am - 10:30 am Q & A

10:30 am - 10:45 am Break GCC Lobby

FIVE-MINUTE PRESENTATIONS Chair: Santu Ghosh, PhD

10:45 am - 11:45 am Presentations

LUNCH

11:45 am – 12:30 pm GCC Lobby

SESSION 5 Chair: Hongyan Xu, PhD

12:30 pm - 1:20 pm Predicting Disease Risk from Genomics Data Hongyu Zhao, PhD, Yale University, New Haven, CT

1:20 pm - 1:30 pm Q & A

1:30 pm - 2:20 pm A Joint Model for Biomarker Discovery in Heterogeneous Populations Elizabeth Slate, PhD, Florida State University, Tallahassee, FL

2:20 pm - 2:30 pm Q & A

2:30 pm - 2:45 pm Break GCC Lobby

INTEGRATED PANEL DISCUSSION II

2:45 pm - 3:30 pm Panel Discussion Leader: Jennifer Priestley, PhD, Kennesaw State University, Kennesaw, GA

AWARDS AND CLOSING REMARKS

3:30 pm – 3:45 pm Poster/Five-Minute Presentation Awards Workshop Organizing Committee

3:45 pm – 4:00 pm Closing Remarks Workshop Organizing Committee

    

Page 7: Emerging Data Science Methods for Complex Biomedical and

  7

 

Biography of Featured Speakers  

Rachel Cummings, PhD: Dr. Rachel Cummings is an Assistant Professor in the Fu Foundation School of Engineering and Applied Science at Columbia University. Her research interests lie primarily in data privacy, with connections to machine learning, algorithmic economics, optimization, statistics, and information theory. Her work has focused on problems such as strategic aspects of data generation, incentivizing truthful reporting of data, privacy-preserving algorithm design, impacts of privacy policy, and human decision-making. Dr. Cummings received her Ph.D. in Computing and Mathematical Sciences from the California Institute of Technology, her M.S. in Computer Science from Northwestern University, and her B.A. in Mathematics and Economics from the University of Southern California. She is the recipient of an NSF CAREER award, a Google Research Fellowship for the Simons Institute program on Data Privacy, a Mozilla Research Grant, the ACM SIGecom Doctoral Dissertation Honorable Mention, the Amori Doctoral Prize in Computing and Mathematical Sciences, a Caltech Leadership Award, a Simons Award for Graduate Students in Theoretical Computer Science, and the Best Paper Award at the 2014 International Symposium on Distributed Computing. Dr. Cummings also serves on the ACM U.S. Public Policy Council's Privacy Committee.

 

George Cybenko, PhD: Dr. George Cybenko is the Dorothy and Walter Gramm Professor at the Thayer School of Engineering at Dartmouth. Cybenko has made research contributions in machine learning, information security and computational and adversarial behavioral analysis. He has advised dozens of PhD students, included among them are the previous CTO of the FBI (Wayne Chung) and the current CISO of Barclays International (David Robinson). Cybenko was the Founding Editor-in-Chief of IEEE Security & Privacy, which is currently the largest professional society publication focused on security. Professor Cybenko is a Fellow of the IEEE, has served on the Defense Science Board, the US Air Force Science Advisory Board and is presently on the US Army Cyber Institute Advisory Board. Prior to joining Dartmouth, he was Professor of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. Dr. Cybenko received his BS (University of Toronto) and PhD (Princeton) degrees in Mathematics.

Moazzam Khan, PhD: Dr. Moazzam Khan is an IBM software engineer working on the Qradar-based User Behavior Analytics application. Before joining the development role, Moazzam was involved with the Watson for Cyber Security group as a researcher for another Qradar-based application called Watson Advisor. He has also worked with the engineering team for IBM’s IPS and IDS solutions such as G, GX, M and XGS series. He holds a doctorate from Georgia Institute of Technology in Electrical and Computer Engineering. He regularly writes for SecurityIntelligence.com on topics related to cyber security and data science. He is also an adjunct faculty at Kennesaw State University. In his leisure time, Moazzam is a tennis aficionado and participates in several Atlanta-based tennis leagues.

Xiao-Li Meng, PhD: Dr. Xiao-Li Meng is the Founding Editor-in-Chief of Harvard Data Science Review and is well known for his depth and breadth in research and his innovation and passion in pedagogy. His interests range from the theoretical foundations of statistical inferences (e.g., the interplay among Bayesian, Fiducial, and frequentist perspectives; frameworks for multi-source, multi-phase and multi- resolution inferences) to statistical methods and computation (e.g., posterior predictive p-value; EM algorithm; Markov chain Monte Carlo; bridge and path sampling) to applications in natural, social, and medical sciences and engineering (e.g., complex statistical modeling in astronomy and astrophysics, assessing disparity in mental health services, and quantifying statistical information in genetic studies). Dr. Li was the former Dean of the Graduate School of Arts and Sciences at Harvard. Dr. Li was named the best statistician under the age of 40 by COPSS (Committee of Presidents of Statistical Societies) in 2001, and he is the recipient of numerous awards and honors for his more than 150 publications in at least a dozen theoretical and methodological areas, as well as in areas of pedagogy and professional development. He has delivered more than 400 research presentations and public speeches on these topics.

Page 8: Emerging Data Science Methods for Complex Biomedical and

  8

Doug Miller, MD: A medical graduate of McGill University, Dr. Miller was trained in cardiovascular medicine at Emory University and Harvard University Having published over 200 original papers and book chapters in the field, Dr. Miller enjoys a global reputation as a cardiologist and academic medicine leader. He has served as the Dean of three research-intensive medical schools in the U.S. and Canada, and as a member of their national MD program accrediting bodies. He advises several global health policy and biomedical business organizations. An entrepreneur holding patents in the fields of new drug development, medical imaging and artificial intelligence (AI), his high impact journal publications and invited lectures make him a leading authority on AI applications to healthcare and medical education.

B. Aditya Prakash, PhD: Dr. Aditya Prakash is an Associate Professor in the College of Computing at the Georgia Institute of Technology (Georgia Tech). He received a Ph.D. from the Computer Science Department at Carnegie Mellon University in 2012, and a B.Tech (in CS) from the Indian Institute of Technology - Bombay in 2007. He has published one book, more than 80 papers in major venues, holds two U.S. patents and has given several tutorials at leading conferences. His work has also received multiple best-of-conference, best paper and travel awards. His research interests include Data Science, Machine Learning and AI, with emphasis on big-data problems in large real-world networks and time-series, with applications to computational epidemiology/public health, urban computing, security and the Web. Tools developed by his group have been in use in many places including ORNL, Walmart and Facebook. He has received several awards such as a Facebook Faculty Award, the NSF CAREER award and was named as one of ‘AI Ten to Watch’ by IEEE. His work has also won awards in multiple data science challenges (e.g the Facebook COVID19 Symptom Challenge) and been highlighted by several media outlets/popular press like FiveThirtyEight.com. He is also a member of the infectious diseases modeling MIDAS network and core-faculty at the Center for Machine Learning (ML@GT) and the Institute for Data Engineering and Science (IDEaS) at Georgia Tech.

Ashis SenGupta, PhD: Dr. Ashis SenGupta is the Advisor/Consultant and former Head of Applied Statistics Unit, Indian Statistical Institute, Kolkata, India. His is also adjunct Professor, Augusta University, Georgia, USA; Distinguished Professor, Middle East Technical Univ, Turkey. After receiving his PhD from Ohio State University, he held visiting profressor positions worldwide, including at Stanford Universit; University of California, Santa Barbara and Riverside; University of Wisconsin, Madison; Concordia Univ, Montreal; Institute of Statistical Mathematics, Tokyo; and University of Malaya, KualaLumpur. His areas of Research include big data analytics, directional statistics in biosciences, high volatility probability distributions, and multivariate analysis. He is author/co-author of 12 books and volumes/special issues, including Topics in Circular Statistics (World Scientific) and Probability Distributions on Manifolds (Wiley). His international and national recognitions include two Lifetime Achievements and Distinguished Statistician Award. He is the ex Editor-in-Chief of Environmental and Ecological Statistics (Springer). He is elected member of the International Statistical Institute; fellow of the National Academy of Sciences - India, fellow of Indian Society of Probability and Statistics, and fellow of the American Statistical Association.

Elizabeth Slate, PhD: Dr. Elizabeth Slate is the Duncan McLean and Pearl Levine Fairweather Professor of Statistics and Distinguished Research Professor in the Department of Statistics at Florida State University. She received her PhD in Statistics from Carnegie Mellon University in Pittsburgh, PA and held positions on the faculty at Cornell University and the Medical University of South Carolina prior to joining FSU. She has held visiting positions at Stanford University, the Biometry Research Group of the National Cancer Institute in Bethesda, MD, the Statistics and Applied Mathematical Science Institute in Raleigh, NC and as the David C. Jordan Visiting Scholar at AbbVie, Inc. At FSU, she directs the program in Statistical Data Science and is involved in several clinical trials, including a SMART study, with the FSU Autism Institute. Elizabeth is a Fellow of the American Statistical Association.

Page 9: Emerging Data Science Methods for Complex Biomedical and

  9

Hongyu Zhao, PhD: Dr. Hongyu Zhao is the Ira V. Hiscock Professor and Chair of Biostatistics at Yale University. He received his BS in Probability and Statistics from Peking University in 1990 and PhD in Statistics from UC Berkeley in 1995. His research interests are the developments and applications of statistical methods in molecular biology, genetics, drug developments, and precision medicine. His current projects include the analysis of biobank samples with genomics, imaging, and wearable device data, cancer multi-omics data, brain multi-omics data, and single cell data. Dr. Zhao is a Co-Editor of the Journal of the American Statistical Association – Theory and Methods, and was the recipient of several honors, including the Mortimer Spiegelman Award for a top statistician in health statistics by the American Public Health Association, and Pao-Lu Hsu Prize by the International Chinese Statistical Association.

Biography of Panel Discussion Leaders

Gagan Agrawal, PhD: Dr. Gagan Agrawal is a Professor and associate dean of research in the School of Computer and Cyber Sciences at Augusta University. He received his MS and PhD degrees from University of Maryland, College Park. He previously held faculty positions at University of Delaware and Ohio State University. His research interests include high performance computing, big data analytics, cloud, edge and fog computing, and parallel machine learning. His work in these areas has resulted in more than 275 peer-reviewed publications, significant funding from the National Science Foundation and the Deparment of Energy and 30 PhD graduates. He has served on the editorial board of four journals and served as program committee co-chair, area chair, or program committee major for many conferences. His notable research contributions include middleware systems for parallelizing data-analytics applications on clusters and other HPC architectures, techniques for managing scientific data, and parallel algorithms for data mining and machine learning.

Jennifer Lewis Priestley, PhD: Dr. Priestley is a Professor of Statistics and Data Science, the Executive Director of Analytics and Data Science Institute, and an Associate Dean of the Graduate College at Kennesaw State University, Kennesaw, GA. She received a Ph.D. from Georgia State University, an MBA from The Pennsylvania State University, and a BS from Georgia Institute of Technology. She architected the first Ph.D. Program in Data Science, which was launched in February 2015. In 2012, the SAS Institute recognized Dr. Priestley as the 2012 Distinguished Statistics Professor of the Year. Datanami recognized Dr. Priestley as one of the top 12 “Data Scientists to Watch in 2016.” Dr. Priestley has been a featured international speaker at the World Statistical Congress, the South African Statistical Association, the Nelson Mandela University, the Federal Reserve Bank, SAS Global Forum, Big Data Week, Technology Association of Georgia, Data Science ATL, the Atlanta CEO Council, Predictive Analytics World, INFORMS and dozens of academic and corporate conferences addressing issues related to the evolution of data science, women in data science and ethical data science. Prior to receiving a Ph.D. in Statistics, Dr. Priestley worked in the Financial Services industry for 11 years. Her positions included Vice President of Business Development for VISA EU in London, as well as Regional Vice President for MasterCard US and as a senior consultant with Accenture’s strategic services group.

Page 10: Emerging Data Science Methods for Complex Biomedical and

  10 

ABSTRACTS: Featured Presentations

Rachel Cummings, PhD The Fu Foundation School of Engineering and Applied Science Columbia University, New York, NY Differential Privacy for Dynamic Databases Privacy concerns are becoming a major obstacle to using data in the ways we want. How can data scientists make use of potentially sensitive data, while providing rigorous privacy guarantees to the individuals who provided data? Over the last decade, differential privacy has emerged as the de facto gold standard of privacy preserving data analysis. Differential privacy ensures that an algorithm does not overfit to the individuals in the database by guaranteeing that if any single entry in the database were to be changed, then the algorithm would still have approximately the same distribution over outputs. In this talk, we will focus on recent advances in differential privacy for dynamic databases, where the content of the database evolves over time as new data are acquired. First, we will see how to extend differentially private algorithms for static databases to the dynamic setting, with relatively small loss in the privacy-accuracy tradeoff. Next, we see algorithms for privately detecting changes in data composition. We will conclude with a discussion of open problems in this space, including the use of differential privacy for other types of data dynamism. (based on joint works with Sara Krehbiel, Kevin Lai, Yuliia Lut, Yajun Mei, Uthaipon (Tao) Tantipongpipat, Rui Tuo, and Wanrong Zhang.) George Cybenko, PhD Dorothy and Walter Gramm Professor of Engineering Thayer School of Engineering Dartmouth College, Hanover, NH Machine Reinforcement Learning with Attrition

While machine learning has shown remarkable progress in a variety of domains, those successes have been made in environments that are stochastically stationary. That is, the statistics of the environment do not change which effectively assumes that an adversary is not adapting. This talk will review basic concepts and several recent relevant results, suggesting ways in which to both analyze, learn and operate in such environments.

Moazzam Khan, PhD Software Engineer IBM Security Systems Atlanta, GA Applying Data Science to Detect Malicious User Behavior

User behavior is a major indicator of security status of a network. Malicious user behavior may range from inadvertent, such as drive-by download from an infected website, to intentional misuse such as unauthorized access, stealing proprietary information etc. Every user action on a network leaves a trail behind in the form of device logs, we can apply data science on these device logs to extract useful analytics about a user’s behavior. In this talk we will discuss IBM Qradar User Behavior Analytics as a use case to see how we can apply data science to the device log data and extract useful analytics about the user behavior.

Page 11: Emerging Data Science Methods for Complex Biomedical and

  11 

Xiao-Li Meng, PhD Whipple V.N. Jones Professor of Statistics Harvard University, Cambridge, MA

Personalized Treatments: Sounds heavenly, but where on Earth did they find my guinea pigs?

Are you kidding me? Surely no one should take personalized literally. Fair enough, but then how un-personalized is personalized? That is, how fuzzy should “me” become before there are enough qualified “me”s to serve as my guinea pigs? Wavelet-inspired Multi-resolution (MR) inference (Meng, 2014, COPSS 50th Anniversary Volume) allows us to theoretically frame such a question, where the primary resolution level defines the appropriate fuzziness---very much like identifying the best viewing resolution when taking a photo. Statistically, the search for the appropriate primary resolution level is a quest for a sensible bias-variance trade-off: estimating more precisely a less relevant treatment effect verses estimating less precisely but a more relevant treatment effect for “me.” Theoretically, the MR framework provides a statistical foundation for transitional inference, an empiricism concept, rooted and practiced in clinical medicine since ancient Greece. Unexpectedly, the MR framework also reveals a world without the bias-variance trade-off, where the personal outcome is governed deterministically by potentially infinitely many personal attributes. This world without variance apparently prefers overfitting in the lens of statistical prediction and estimation, a discovery that might a clue to some of the puzzling success of deep learning and the like (Li and Meng, 2021, JASA). Doug Miller, MD Medical College of Georgia at Augusta University Augusta, GA Medical Decision Making and Diagnostic Ambiguity in the Al Era

Modern medicine at the nexus of two unhealthy megatrends – growing administrative cost waste and exploding big data. New technologies are often advanced as solutions for these and other indurate healthcare complexities. But technology insertion into systems is never neutral, and often has unintended consequences. In this complexity-technology context, diagnostic ambiguity can emerge and cause poor patient outcomes, cost-inefficiencies and medical errors. The challenges of system complexity and diagnostic ambiguity can now be purposefully addressed by harnessing the probability, data and computing sciences. Medical decision-making is an application of probability science to the interrelated processes of testing, diagnosis and treatment. Decision-makers and diagnostic systems seek to learn from existing information to achieve faster, more accurate and reproducible solutions to problems, and ideally to avert them. Decision analysis is a mathematical approach helpful under circumstances of diagnostic ambiguity, by showing decision makers that a preferred treatment plan depends on knowledge, the care objective and decision criteria. Data science de-convolutes high-dimensional dynamic datasets and wrangles complex big data. Knowledge creation sorts data complexities in the evidence so that medical providers can make credible assumptions and draw logical conclusions that guide complex care. Knowledge representation makes querying of diverse data structures by humans and/or intelligent machines possible, in order to model and communicate solutions to complexity. The computing science trend artificial intelligence (AI) uses algorithms in neural networks to ‘learn’ patterns in complex datasets, providing insights obscure to humans and predictive models opaque to standard statistics. Most approved AI medical applications are diagnostic, but this narrow AI cannot disambiguate clinical uncertainties from messy datasets like EMRs. Meeting global AI challenges in other data-dense context-uncertain domains like autonomous driving vehicles offers salient lessons for healthcare, where the probability of flawed human reasoning and bias are high and potentially life threatening. Next wave broad AI technologies may be capable of helping doctors to disambiguate complex individual patient diagnoses in real time by improving clinical reasoning, mitigating biases and explaining or even averting medical errors. However, adopting difficult to interpret “black box” AI models based on suspect data quality for high stakes medical decisions can worsen clinical ambiguities and healthcare inefficiencies.

Page 12: Emerging Data Science Methods for Complex Biomedical and

  12 

Aditya Prakash, PhD College of Computing Georgia Institute of Technology Atlanta, GA Deep Learning Data-Driven Approaches for Epidemic Forcasting The devastating impact of the currently unfolding global COVID-19 pandemic and those of the Zika, SARS, MERS, and Ebola outbreaks over the past decade has sharply illustrated our enormous vulnerability to emerging infectious diseases. There are many questions that are being studied by epidemiologists and public officials during these outbreaks. Building on our prior work, we have been pursuing multiple activities amidst the COVID-19 pandemic in the United States, collaborating with partners in academia, industry and public health agencies, from award-winning work on helping forecast pandemic trajectories (also shown on the CDC website) to designing more localized and less burdensome campus interventions. In this talk, I will briefly give an overview of our recent research in designing well calibrated, robust, accurate and interpretable deep learning models for epidemic forecasting, illustrating the important role data science and machine learning have to play for pandemic prevention and prediction. Ashis SenGupta, PhD Professor Emeritus, Applied Statistical Unit, Indian Statistical Institute, Kolkata, India Augusta University, Augusta, USA; Middle East Technical University, Ankara, Turkey Innovative Statistical Methods for Manifold Data with Biological Applications We consider analysis of data on smooth Manifolds – specifically Directional Data, where observations can be mapped onto circles and spheres. Due to the disparate topologies between the line and the circle, usual statistical analyses for linear data are not valid for directional data. For example, the arithmetic mean is a non-sensical summary measure for circular or angular data. With the current era of Big Data analytics, dimension reduction has become a crucial aspect of data analysis and this is true for directional data too. However, the usual PCA or ICA are no longer valid here. In the first part of the talk, we present a simple and elegant solution to this problem for circular and spherical data. Illustration of our approach is provided through analysis of real-life Gait data. In the second part of the talk, an elegant approach for generalizing MANOVA to multivariate directional data is presented and illustrated through several real-life biological examples. Elizabeth Slate, PhD Department of Statistics Florida State University, Tallahassee, FL A Joint Model for Biomarker Discovery in Heterogeneous Populations Identification of valid, clinically relevant biomarkers for disease has potential to provide less invasive diagnostic tools, to enhance understanding of initiation and progression at the cellular level, and to guide development of new therapeutic agents. When the biomarkers are binary, logic regression provides a means to discover Boolean combinations of the markers strongly associated with outcome. The interpretability of these Boolean marker combinations and, potentially, additional interactions with environmental and behavioral characteristics, is appealing and can provide insight. However, complex diseases such as cancer that arise from multiple pathways and present at varying stages of development and progression can lead to hidden population heterogeneity in the biomarker-disease association. We describe an extension of logic regression for jointly modeling binary and continuous outcomes that uses a latent class structure to accommodate subpopulation heterogeneity. Estimation and inference are compared for two Bayesian semiparametric formulations using a variety of computational approaches.

Page 13: Emerging Data Science Methods for Complex Biomedical and

  13 

Hongyu Zhao, PhD Ira V. Hiscock Professor of Biostatistics, Professor of Genetics, and Professor of Statistics and Data Science Yale University New Haven, CT Predicting Disease Risk from Genomic Data Accurate disease risk prediction based on genetic and other factors can lead to more effective disease screening, prevention, and treatment strategies. Despite the identifications of thousands of disease-associated genetic variants through genome-wide association studies in the past 15 years, performance of genetic risk prediction remains moderate or poor for most diseases, which is largely due to the challenges in both identifying all the functionally relevant variants and accurately estimating their effect sizes. Moreover, as most genetic studies have been conducted in individuals of European ancestry, it is even more challenging to develop accurate prediction models in other populations. Furthermore, many studies only provide summary statistics instead of individual level genotype and phenotype data. In this presentation, we will discuss a number of statistical methods that have been developed to address these issues through jointly estimating effect sizes (both across genetic markers and across populations), modeling marker dependency, incorporating functional annotations, and leveraging genetic correlations among different diseases. We will demonstrate the utilities of these methods through their applications to a number of complex diseases/traits in large population cohorts, e.g. the UK Biobank data. This is joint work with Wei Jiang, Yiming Hu, Yixuan Ye, Geyu Zhou and Qiongshi Lu.

Page 14: Emerging Data Science Methods for Complex Biomedical and

  14 

ABSTRACTS: Poster and Five-Minute Presentations

Sasanka Adikari Old Dominion University Norfolk, Virginia Simultaneous Inferences for Multiple Utility in Time Choice Pairs under Copula-based Models Discrete choice models (DCMs) or qualitative choice models are applied in many fields and in the statistical modelling of consumer behavior. The construction of the DCMs takes many forms such as: Binary Logit, Binary Probit, Multinomial Logit, Conditional Logit, Multinomial Probit, Nested Logit, Generalized Extreme Value Models, Mixed Logit, and Exploded Logit. Choice behaviors and their utilities are illustrated in social sciences, health economics, transportation research, marketing and health systems researches. They have a time dependent behavior. In this manuscript, we extend the DCMs with emphasis on time dependent best-worst choice and discrimination between choice attributes using a flexible distribution function for the time dependence: the copula method. Here we fit a bivariate best- worst copula distribution for consumer choice by including parameters for customer feeling and the state of uncertainty. We used conditional logit model to calculate initial utility. Expected utilities over time are obtained using backward recursive method based on Markov decision processes. We used transition probabilities, derived using a copula method called CO-CUB model to predict the utilities in time (UiT). Through Flynn (2007) estimated covariates, we illustrate the behavior of the UiTs and their confidence/credible intervals. We analyzed the UiTs. The properties of the transition probabilities are assessed in bootstrap study. Under the copula and bootstrap approach, the transition probabilities follow a Bessel sequence under sufficient conditions. Shijia Bian Emory University Atlanta, Georgia BSA to Define Important Features for the Interpretation of 99mTc-MAG3 Diuretic Scintigraphy In recent years, 99mTc-Mercaptoacetyltriglycine (MAG3) diuretic renal scans have been widely used as a high-tech, non-invasive and cost-effective procedure to diagnose kidney obstruction. To facilitate accurate and timely interpretation of diuretic renal scans, renogram curves are generated at baseline and after furosemide injection, and their quantitative features, such as time to half-maximum counts, are derived to help evaluate possible kidney obstruction. Maximizing the utility of quantitative features requires a clear understanding of renal physiology and MAG3 pharmacokinetics. A large percentage of renal scans in the United States, however, are interpreted by general radiologists with limited experience or insufficient training, who tend to select and utilize quantitative features based on ad hoc blending of intuition and past practice without proper guidance and scientific justification. In fact, a naïve and uninformed reliance on quantitative features is currently a leading cause of erroneous scan interpretation, inappropriate patient management and unnecessary renal surgery in the United States. As such, the goal of this article is to rigorously evaluate the diagnostic accuracy of various quantitative features of renogram curves that are currently used or newly considered, and establish scientifically justified guidance regarding their selection and application. Our proposed approach is two-fold. Firstly, we use a kernel smoothing method to address measurement error and discreteness in the renogram curves and obtain accurate estimates of various quantitative features that reflect important physiological mechanisms of kidney obstruction. Secondly, we evaluate the diagnostic utility of the estimated quantitative features by assessing their alignment with scan interpretations (obstruction, equivocal finding, or no obstruction) provided by three nuclear medicine experts via the recently introduced concept of broad sense agreement (BSA). The top three quantitative features that show highest BSA with the consensus rating of three experts and thus have the greatest diagnostic utility in detecting obstructed kidneys are: 1) ratio of the counts in the last time point of the furosemide renogram to maximum counts of baseline renogram; 2) ratio of the counts in the first timepoint of furosemide renogram to the maximum counts in baseline; and 3) area under the first derivative of baseline renogram on the scan time interval. Use of these features in practice can potentially help radiologists interpret studies at a faster rate with higher degree of accuracy that resemble those of nuclear medicine experts, and ultimately improve the quality and affordability of the clinical care of kidney obstruction.

Page 15: Emerging Data Science Methods for Complex Biomedical and

  15 

Maxime Bouadoumou Georgia State University Atlanta, Georgia Jackknife Empirical Likelihood Inference of the Difference of Two Correlated Coefficients of Variation The coefficient of variation (CV) is a unitless measure of variability used in various areas of applied statistics. CV is the ratio of the standard deviation to the mean (average). In this paper, we propose jackknife empirical likelihood (JEL) and its related methods, particularly transformed jackknife empirical likelihood (TJEL), transformed adjusted jackknife empirical likelihood (TAJEL), the adjusted jackknife empirical likelihood (AJEL), the bootstrapped jackknife empirical likelihood (BJEL), and the mean jackknife empirical likelihood (MJEL), to construct confidence intervals for the CV from paired samples for U-statistics using profiling method. These proposed jackknife novel approaches are used to 1 overcome the under-coverage problem that some of the EL methods encounter. TJEL and TAJEL have better performance than other EL methods in terms of coverage probability of confidence interval for the normal distribution. Overall, JEL method studied in this paper gives the shortest length of confidence intervals. BJEL has better coverage probability for laplace and shifted exponential distribution. Subsequently, two real data applications are used to illustrate our methods. William Cocke Augusta University Augusta, Georgia Word Frequency for Intermediate Students of Hindi and Urdu This project uses web scraping to pull Hindi and Urdu language news article and do basic word frequency counting. Next, with an aim towards providing realistic frequency counts for intermediate level students, we perform lemmatization with an emphasis on combining related words via grammatical function. The resulting lists provide resources for students of Hindi and Urdu as well as for speakers of one language to quickly improve their proficiency in the other. Some of the political complexities surrounding Hindi and Urdu are uncovered by the resulting lists. Ying Cui Emory University Atlanta, Georgia An Efficient Model-based Approach for Clustering Two-dimensional Functional Data Monitoring anemia with precise hemoglobin measurements are clinically significant, but the necessary invasive blood sampling procedure (phlebotomy) for its detection is known to cause various adverse effects, including fainting, nerve damage and hematoma. This work is motivated by the need to develop non-invasive tools for screening anemia and reduce phlebotomy burden in patients. Specifically, we introduce a new clustering method that can non-invasively separate patients into low and high anemia risk groups using clinical pallor data extracted from patient sourced fingernail photos. To increase efficiency and accuracy of clustering, we propose a novel clustering algorithm based on a latent class functional mixed model that: (i) fully leverages the two-dimensional structural (pixel) information of clinical pallor data using an appropriate, flexible basis system; (ii) can be extended to simultaneously cluster multiple fingernail photos while controlling for effects of available covariates (e.g., image metadata) on cluster membership; and (iii) allows borrowing information across different subjects via random effects on basis coefficients. An EM algorithm is derived to estimate the model parameters and latent cluster memberships. We further introduce a data-driven approach for choosing the appropriate number of clusters based on the “distortion function” adapted to our setting. Our simulation study demonstrates that the proposed method outperforms other competing methods for clustering two-dimensional functional data with and without covariate information. The proposed method is applied to cluster patient-sourced fingernail photos collected at the Emory University Hospital and identify patient groups with low-risk and high-risk for anemia. This application unveils useful subpopulation structures of fingernail photos whose relationships with the underlying physiological mechanism of anemia can be further delineated.

Page 16: Emerging Data Science Methods for Complex Biomedical and

  16 

Adam J. Dugan University of Kentucky Lexington, Kentucky A New Functional F-Statistic for Gene-Based Inference Involving Multiple Phenotypes variant influences multiple traits. Numerous statistical methods exist for testing for genetic pleiotropy at the variant level, but fewer methods are available for testing genetic pleiotropy at the gene-level. In the current study, we derive an exact alternative to the Shen and Faraway functional F-statistic for functional-on-scalar regression models. Through extensive simulation studies, we show that this exact alternative performs similarly to the Shen and Faraway F-statistic in gene-based, multi-phenotype analyses and both F-statistics perform better than existing methods in small sample, modest effect size situations. We then apply all methods to real-world, neurodegenerative disease data and identify novel associations.

Dilini Katukoliha Gamage Old Dominion University Norfolk, Virginia Spatio-Temporal Modeling of Progression of the COVID−19 Pandemic The recent novel coronavirus (COVID-19) pandemic has been the worst in recent history regarding disease fatalities. It is therefore critical to find suitable statistical models that can be used to predict progression of the COVID-19. Such models will allow us to monitor the spread of the virus in the targeted locations and time. Data on new COVID-19 cases data on four selected countries over time were collected via Situation Reports published by the World Health Organization. The data of the neighboring countries were also considered in modeling. The Bayesian Conditional autoregressive (CAR) models are applied to the data to account for the spatial dependency in the countries along with the temporal dimension of the disease. Moran measures were also computed to compare spatial trends of the new COVID-19 cases within each country and block. Different blocks evidenced discrepancies that could be used to direct guidelines and regulations specific to those countries. Xinyu Guo Johns Hopkins University Baltimore, Maryland The ASSET-based Tissue-set Association Analysis Genome-wide association studies (GWASs) have successfully detected numerous genetic variants (i.e., single-nucleotide polymorphisms (SNP)) that are associated with complex human traits and diseases. However, a large proportion of GWAS findings remain unexplained. Thus, statistical genetics integration is a valuable and crucial tool for understanding potential biological mechanisms underlying GWAS results. To integrate shared information across tissues, we developed an ASSET-based Tissue-set Association Analysis (ATAA) method. This method only requires the summary level GWAS statistics. With an integrative approach large-scale transcriptome-wide association study (TWAS) results for all available tissues, ATAA uses an association analysis based on subsets (ASSET), identifying the subset of tissues with maximal evidence of association for each potential gene. We successfully identified Novo tissue-diseases relationships for Alzheimer's Disease, Type 2 Diabetes, and Bipolar Disorder. We also did simulation studies for two selected genes, KLK2 and APOE, on chromosome 19. In the simulations, we compared ATAA with the traditional meta-analysis approach, which the TWAS-FUSION software implements. From the results, ATAA provides a higher specificity for selected causal tissue-set. Moreover, compared to the omnibus test implemented in TWAS-FUSION, the subset analysis in ATAA provides a less type 1 error.

Page 17: Emerging Data Science Methods for Complex Biomedical and

  17 

Daniel Linder Augusta University Augusta, Georgia Bayesian Model Selection in the High-Dimensional Logistic Regression Setting We describe a Bayesian hierarchical model termed `PMMLogit' for classification and model selection in high-dimensional settings with binary phenotypes. Posterior computation in the logistic model is known to be computationally demanding due to its non-conjugacy with common priors. We combine a Polya-Gamma based data augmentation strategy and use recent results on Markov chain Monte-Carlo (MCMC) techniques to develop an efficient and exact sampling strategy for the posterior computation. We use the resulting MCMC chain for model selection and choose the best combination(s) of genomic variables via posterior model probabilities. Further, a Bayesian model averaging (BMA) approach using the posterior mean, which averages across visited models, is shown to give superior prediction of phenotypes given genomic measurements. Yiling Luo Georgia Institute of Technology Atlanta, Georgia Directional Bias Helps SGD to Generalize in Nonparametric Regression We study Stochastic Gradient Descent (SGD) and Gradient Descent (GD) algorithms in kernel regression. Our result reveals different direction bias of SGD and GD during the training process. Specifically, SGD with moderate and annealing step size converges along the direction corresponding to the large eigenvalue of the Hessian, while GD with a moderate or small step size converges along the small eigenvalue. We show that this directional bias helps the SGD estimator to generalize better. This gives one way to explain how noise helps in generalization when learning with a nontrivial step size, which may be useful for promoting further understanding for a stochastic algorithm in deep learning. We provide numerical studies on simulated data and FasionMNIST dataset to support our theory.

Wenjing Ma Emory University Atlanta, Georgia Important Aspects in Supervised Cell Type Identification for Single-Cell RNA-Seq Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset. From our real data experiments, we observed that Multi-Layer Perceptron (MLP) as classifier along with F-test as feature selection always performed best. In the meantime, combining all individuals from available datasets generated better results. Besides these, we also examined how data preprocessing, how discrepancies between reference and target, and how pooling and purifying the reference datasets might affect the prediction performance.

Page 18: Emerging Data Science Methods for Complex Biomedical and

  18 

Sara Motlaghian Georgia State University Atlanta, Georgia Nonlinear Functional Network Connectivity in FMRI Data Previous studies of brain function in patients with schizophrenia have found both hyperconnectivity and hypoconnectivity between distinct brain regions in individuals. One of the most used methods of assessing functional changes across the brain is functional connectivity. Prior functional connectivity studies have used metrics such as correlation and thus focus on linear relationships, however the assessment of explicitly nonlinear relationships is understudied. In this work, our focus is on nonlinear relationships. We introduced a technique using normalized mutual information (MI) method that calculates the nonlinear correlation between different regions of the brain. We evaluate nonlinear functional network connectivity (FNC) in fMRI data by first removing the linear relationship and then evaluating the residual correlation of time courses with mutual information to model nonlinear effects. We first demonstrate this approach using simulated data, then apply to a schizophrenia dataset previously studied in (Damaraju, Allen et al. 2014). The resting-state eyes-closed fMRI data included 151 schizophrenia patients and 163 age- and gender-matched healthy controls. The data were first decomposed by group independent component analysis (ICA). The functional brain data yielded 47 functionally relevant intrinsic connectivity networks. Our analysis showed a modularized nonlinear relationship among brain functional networks particularly noticeable in the sensory and visual cortex. Interestingly, the modularity was different from that revealed by the linear approach. Analysis of group differences identifies significant differences in nonlinear dependencies between schizophrenia patients and healthy controls particularly in visual cortex. Some domains, such as cognitive control, and default mode, appeared much less nonlinear, whereas links between the visual domain and other domains showed substantial modularity. Overall, these results suggest that considering nonlinear dependencies of F(N)C may provide a complementary and potentially important tools for improving our understanding of brain function and its links to behavior.

Brian Pidgeon Georgia State University Atlanta, Georgia Jackknife Empirical Likelihood Methods for Testing Distributional Symmetry In this poster, we consider a general k-th correlation coefficient between the density function and distribution function of a continuous variable as a measure of symmetry and asymmetry. We make statistical inference of the k-th correlation coefficient by using jackknife empirical likelihood (JEL) and its variations to construct confidence intervals. The JEL statistic is shown to be asymptotically a standard chi-squared distribution. We compare our methods to the previous empirical likelihood (EL) techniques of Zhang et al. (2018) and show the JEL possesses better small sample properties. Simulation studies are conducted to examine the performance of the proposed estimators, and we also use our proposed methods to analyze two real datasets for illustration.

Page 19: Emerging Data Science Methods for Complex Biomedical and

  19 

Khaled Bin Satter Augusta University Augusta, Georgia DBU, A Classification Algorithm for High Dimensional Genomics Data Transcriptomic analyses are representative of metabolic status are becoming standard in cancer studies. These high dimensional data are used in classification and regression with unsupervised and supervised learning models. These data are resource hungry and computationally exhaustive. Therefore, a dimension reduction step before clustering can resolve many of these issues. Here, we present a DBU (Density-based UMAP), a transcriptomic classification algorithm, that integrates a dimension reduction algorithm, Uniform Manifold Approximation and Projection (UMAP), and a clustering algorithm, Density based Spatial Clustering (DBSCan). DBU creates random subsets of the whole genome into a smaller subset of variables and applies dimension reduction with UMAP. The resultant output is the two-dimensional embedding of the smaller subset, which was then fed into our DBSCan, to identify the groups. We implement DBU to classify of Chromophobe renal cancer and Oncocytoma, two histologically similar tumor. DBU was iterated for 1000 iterations and final classification was determined with > 70% in plurality voting. Our classification accuracy with DBU was 95.5%. This result show DBU is a stable, robust transcriptomic classification algorithm, that can be used to identify/differentiate tumors in a consistent manner.

Yuyang Shi Georgia Institute of Technology Atlanta, Georgia Efficient Algorithm for Repeated Assignment Problem The assignment problem is to decide the optimal assignment between a number of agents and jobs that minimizes the total cost. It has many important real-world applications, such as job assignment in factories, rideshare scheduling, etc. The problem has been well-studied in the optimization literature when the costs are known. However, it is less studied when the costs are unknown and need to be learned from data through repeated assignment tasks. In this work, motivated by the scenario of pairing students with mentors in the mentoring programs of many universities, we propose an efficient algorithm to learn the risk and minimize the overall risk of assignment. Our key idea is to combine the logistic regression model for estimating the risk, with the Hungarian algorithm for deciding the assignment. Theoretical analysis and extensive numerical studies are conducted to illustrate the usefulness of our proposed method.

Paul Tran Augusta University Augusta, Georgia The 3p21.31 Genetic Locus promotes Type 1 Diabetes Progression through the CCR2/CCL2 Pathway Multiple cross-sectional and longitudinal studies have shown that serum levels of the chemokine ligand 2 (CCL-2) are associated with type 1 diabetes (T1D), although the direction of effect differs. We assessed CCL-2 serum levels in a longitudinal cohort to clarify this association, combined with genetic data to elucidate the regulatory role of CCL-2 in T1D pathogenesis. The Diabetes Autoimmunity Study in the Young (DAISY) followed 310 subjects with high risk of developing T1D. Of these, 42 became persistently seropositive for islet autoantibodies but did not develop T1D (non-progressors); 48 did develop T1D (progressors). CCL-2 serum levels among the three study groups were compared using linear mixed models adjusting for age, sex, HLA genotype, and family history of T1D. Summary statistics were obtained from the CCL-2 protein quantitative trait loci (pQTL) and CCR2 expression QTL (eQTL) studies. The T1D fine mapping association data were provided by the Type 1 Diabetes Genetics Consortium (T1DGC). Serum CCL-2 levels were significantly lower in both progressors (p=0.004) and non-progressors (p=0.005), compared to controls. Two SNPs (rs1799988 and rs746492) in the 3p21.31 genetic locus, which includes the CCL-2 receptor, CCR2, were associated with increased CCR2 expression (p = 8.2e-5 and 5.2e-5, respectively), decreased CCL-2 serum level (p = 2.41e-9 and 6.21e-9, respectively), and increased risk of T1D (p = 7.9e-5 and 7.9e-5, respectively). The 3p21.31 genetic region is associated with developing T1D through regulatory control of the CCR2/CCL2 immune pathway.

Page 20: Emerging Data Science Methods for Complex Biomedical and

  20 

Bo Wei Emory University Atlanta, Georgia Tensor Response Quantile Regression With Neuroimaging Data Evaluating the impact of clinical factors on neuroimaging phenotypes is often of interest in neuroimaging studies. To this end, we propose a tensor response quantile regression framework, where the neuroimaging phenotype is formulated as a tensor response and clinical factors are allowed to have flexible heterogeneous effects on the tensor response. We develop a computationally efficient estimation procedure for the regression coefficient tensor associated with the covariate effects by imposing a sensible low-rank structure for the coefficient tensor. This approach allows interpretable estimates of covariate effects regarding the underlying structure of the neuroimaging phenotype. We establish the asymptotic properties of the proposed estimators. Simulation studies demonstrate good finite-sample performance of the proposed method. We apply the proposed methods to investigate the association of post-traumatic stress disorder(PTSD) clinical assessments and fMRI resting-state functional connectivities in the Grady Trauma Project.

Nicholas Woolsey University of South Carolina Columbia, South Carolina A New Perspective in Functional Errors-In-Variables Models Regression is the method by which we fit a particular function to a data set. This function is meant to mimic a true relationship that exists between the variables represented in the data set. In classical regression we have an overall error in the response, however in the Error in Variables model we can have error in each predictor independently. Often times the EIV model will more accurately reflect specific cases, such as image processing. However, unlike in classical regression, there is no clear 'best' way to derive the relationship between the variables. In classical regression The Least Squares(LS) estimator is the best, however in EIV LS is heavily biased. The Maximum Likelihood Estimator (MLE) is also used, however this estimator suffers from erratic behavior. Our goal here is to find a new estimator that does not suffer from these issues.

Qunzhi Xu Georgia Institute of Technology Atlanta, Georgia Active Sequential Change-point Detection Under Sampling Control

The active sequential change-point detection problem is considered under the sampling constraint that only a subset of the complete data can be observed per time step. For the special case when there is only one affected data stream and we are allowed to access only one local stream per time step, we develop a first-order asymptotically optimal algorithm when the number of streams M is fixed. Moreover, we extend our results to a general case when Q ≥ 2 streams can be accessed per time step, and the dimension M grows to infinity.

Page 21: Emerging Data Science Methods for Complex Biomedical and

  21 

Ye Yue Emory University Atlanta, Georgia Assessing Reproducibility Among Multiple Two-Dimensional Images Modern medical imaging technologies are increasingly producing novel biomarkers that allow non-invasive diagnosis of many diseases. Quantifying the intra-person reproducibility of repeated images under the same condition is a fundamental step for evaluating and establishing the validity of such image-based biomarkers. In this presentation, we consider a modern clinical setting where two-dimensional (pixel-specific) color intensity data of fingernail images taken by a smartphone are evaluated for use in non-invasive diagnosis of anemia. To facilitate evaluation of such image-based biomarkers, we propose a new agreement index that allows assessing intra-person reproducibility among multiple images. Specifically, the proposed agreement index quantifies the degree of concordance among multiple images based on their expected squared difference generalized to a space of two-dimensional functional data, providing intuitive interpretation and achieving computational efficiency. Our simulation results demonstrate satisfactory finite-sample performance of the proposed agreement measure. The methods are illustrated by an analysis of a dataset where several smartphone images are taken on the same subject.

Ruiwen Zhou Duke University Durham, North Carolina A New Approach to Estimation of the Proportional Hazards Model based on Interval-censored Data with Missing Covariates This paper discusses the fitting of the proportional hazards model to interval-censored failure time data when there may exist missing on covariates. Many authors have discussed the problem when complete information on the covariates is available or the missing is completely at random. However, it does not seem to exist an established method for the situation where the missing is at random, the focus of this paper. For the problem, a sieve maximum likelihood estimation approach is proposed with the use of I-splines to approximate the unknown cumulative baseline hazard function. For the implementation of the method, we develop an EM algorithm based on a two-stage data augmentation. Furthermore, we show that the proposed estimators of regression parameters are consistent and asymptotically normal. The proposed approach is then applied to a set of the data on Alzheimer Disease that motivated this study.

Page 22: Emerging Data Science Methods for Complex Biomedical and

  22 

Second Annual Data Science Workshop

Emerging Data Science Methods for Complex Biomedical and Cyber Data

We would like to express our gratitude to all those at Augusta University who provided behind-the-scene support:

Michael Diamond, MD, Senior Vice President for Research, AU

Yolanda Dortch, Administrative Assistant, DPHS

David Hess, MD, Dean, Medical College of Georgia

Chelsey Lemons, MS, Research Manager, College of Nursing

Joy L. Lynn, Office Specialist, School of Computer & Cyber Sciences

Neil MacKinnon, PhD, Provost and Executuve Vice President, AU

Komal Patel, BS, Program Manager, Biostatistics & Data Science, DPHS

Shayla Ricks, MBA, Department Administrator, DPHS

Alex Schwarzman, PhD, Dean, School of Computer & Cyber Sciences

Lifang Zhang, MS, Biostatistician, Biostatistics & Data Science, DPHS

Page 23: Emerging Data Science Methods for Complex Biomedical and

  23 

Page 24: Emerging Data Science Methods for Complex Biomedical and

  24 

Second Annual Data Science Workshop

Emerging Data Science Methods for Complex Biomedical and Cyber Data

OUR SPONSORS