Division of Medical Data Informatics

76

Human Genome Center

Division of Medical Data Informatics医療データ情報学分野

Professor Tetsuo Shibuya, Ph.D.Project Assistant Professor Robert Daniel Barish, Ph.D.

教　授　　博士（理学）　　渋　谷　哲　朗特任助教　博士（学術）　　ロバート　ダニエル　バリッシュ

The objective of Division of Medical Data Informatics is to develop fundamental data informatics technologies for medical data, such as algorithm theory, big data technologies, artificial intelligence, data mining, and privacy preserving technologies. Medical data, especially genome data are increasing exponentially in medical science from basics to clinical research. Our aim is to innovate the entire medical science with novel data informatics solutions.

1. Development of Artificial Intelligence Technol-ogies for Biomedical Data

a. Natural Language Processing Methods for In-ternational Medical Text Databases

Arda Akdemir1, Tetsuo Shibuya1, Tunga Güngör2: 1Division of Medical Data Informtatics, Institute of Medical Science, the University of Tokyo, 2Depart-ment of Computer Engineering, Bogaziçi University

Morphological information is important for many sequence labeling tasks in Natural Language Process-ing (NLP). Yet, existing approaches rely heavily on manual annotations or external software to capture this information. We propose using subword contex-tual embeddings for languages with rich morpholo-gy. Evaluated on Dependency Parsing (DEP) and Named Entity Recognition (NER) tasks, which are shown to benefit highly from morphological informa-tion, subword contextual embeddings consistently outperformed other approaches on all languages test-ed (Hungarian, Finnish, Czech and Turkish). Besides, the novel network architecture we propose, coupled with a Bayesian hyperparameter optimization suite, achieved state-of-the-art results for both tasks for the Turkish language. Finally, we experimented with dif-ferent multi-task learning architectures to analyze the

effect of jointly learning the two tasks.Deep Neural Network (DNN) based Machine

Learning models achieved remarkable success in many fields of research. Yet, many recent studies show the limitations of these approaches to general-ize to unseen examples and to new domains such as the biomedical domain. Besides, supervised-learning based DNN models require a substantial amount of labeled data which is not readily available for many tasks such as the biomedical question answering task. Transfer Learning is shown to mitigate these chal-lenges by transferring information from auxiliary tasks to improve the performance on a source task, and shown to be especially useful for low-resource tasks. These observations and findings motivated us to investigate the effect of transfer learning and mul-ti-task learning on the biomedical question answering task. We proposed a novel multi-task learning model to learn biomedical entities and questions simultane-ously. In this work, we explain the three different neural models we used to participate for the BioASQ 8B challenge. Our results showed that transferring in-formation from the biomedical entity recognition task brings improvement for the biomedical question an-swering task.

04ヒトゲノム解析センター_sk.indd 7604ヒトゲノム解析センター_sk.indd 76 2021/05/10 16:47:302021/05/10 16:47:30

77

b. Artificial Intelligence Techniques for Molecular Data Analysis

Xiao Shaobin1, Robert Daniel Barish1, Tetsuo Shibuya1, Adnan Sljoka3: 3Center for Advanced In-telligence Project, RIKEN.

We are developing artificial intelligence tech-niques for various molecular data analysis, including protein 3-D structures, metagenome NGS data. As for the metagenome data, we developed deep learning models that couple CNN (Convolutional Neural Net-work) with RNN (Recurrent Neural Network) for vi-rus classification from metagenome next generation sequencer data. Our model called BiRNN_CNN achieves 0.891 AUC, which outperforms previous state-of-the-art methods.

2. Development of Privacy Preserving Technolo-gies for Medical Data

a. Differential Privacy Methods for Medical Data

Tatsuki Koga1, Akito Yamamoto1, Robert Daniel Ba-rish1, Tetsuo Shibuya1

Privacy-preserving machine learning is being in-creasingly important for a variety of applications in medical science because we often need to handle sen-sitive data including personal information. Using dif-ferentially private empirical risk minimization (DP-ERM) algorithms is one of the most common approaches to obtain privatized predictors in super-vised learning. However, minimizing the empirical risk with current algorithms may negatively affect the classification performance under class imbalance in terms of metrics suited for imbalanced datasets. In this case, ERM with class-dependent weights is a pro-cedure typically used for non-private ERM. We ex-tend two fundamental DP-ERM algorithms to mini-mize the empirical risk with class-dependent weights so that they perform better on imbalanced datasets. We also propose an algorithm to tune hyperparame-ters in terms of the area under the receiver operating characteristic curve (AUC) with the privacy guaran-tee. We show through experiments that when data-sets have class imbalance and are large enough, the proposed algorithms outperform the existing algo-rithms.

Analyses of datasets which contain personal genomic information are very important for revealing associations between diseases and genomes. Ge-nome-wide association studies (GWAS), which are large-scale genetic statistical analyses, often involve tests with contingency tables. However, if the statis-tics obtained by these tests are made public as they are, sensitive information of individuals might be leaked. Existing studies have proposed privacy pro-tection methods for statistics in the chi-squared test

with 3 × 2 contingency tables, but they do not cover all the tests used in GWAS. In addition, existing methods for releasing p-values are not practical. In this work, we propose methods to release p-values in the chi-squared test with a 3 × 2 contingency table, chi-squared statistics and p-values in the chi-squared test with a 2 × 2 contingency table, p-values in the Fisher’s exact test, and chi-squared statistics and p-values in the Cochran-Armitage ’s trend test while preserving both personal privacy and utility. The above statisti-cal tests are used for comparative evaluation of allele frequencies and genotype frequencies, and the Fish-er’s exact test is often applied when the entries of con-tingency tables are small. We make theoretical guar-antees by showing the sensitivity of the above statistics based on the concept of differential privacy. From our experimental results, we evaluate the utility of the proposed methods and show the appropriate thresholds for using the private statistics in statistical tests.

b. RAM Simulator Data Structures for Privacy Preserving Computation

Taku Onodera4, Tetsuo Shibuya1: 4Department of Computer Science, University of Helsinki

Wear leveling — a technology designed to balance the write counts among memory cells regardless of the requested accesses — is important for securi-ty-critical applications. We completely determine the problem parameter regime for which Security Re-fresh — one of the most well-known existing wear leveling schemes for PCM — is optimal by providing a positive result and a matching negative result. In particular, Security Refresh does not achieve optimal-ity for the practically relevant regime of large-scale memory. We also propose a novel scheme that achieves an almost optimal lifetime, time/space over-head, and wear-free space for the relevant regime not covered by Security Refresh. Unlike existing studies, we give rigorous theoretical lifetime analyses, which is necessary to assess and control the security risk.

3. Development of Biomedical Database Technol-ogies

a. Development of Algorithms for Next Genera-tion Sequencer Data

Kazushi Kitaya1, Tetsuo Shibuya1

Many bioinformatics tasks are achieved using a set of k-mers and the de Bruijn Graph represented by it for fast and space-saving processing. While much work has been done on how to efficiently compress and store a single set of k-mers or a de Bruijn Graph, methods for compressing multiple k-mer sets have been less studied. We propose a data structure that


78

can efficiently represent multiple k-mer sets con-structed from genomic data and from which it is pos-sible to efficiently reconstruct the original data. In ad-dition, the proposed data structure does not require reference data, i.e., no additional information other than the data to be compressed is needed. Given 3292 k-mer sets constructed from whole genome sequence data for E. coli, we successfully reduced the amount of data by more than 50% compared to compressing them individually. This data structure is useful for re-searchers who want to use multiple k-mer sets or multiple de Bruijn Graphs and for administrators of genome-related databases who want to store their data on disk efficiently.

b. Integrating Viruses and Cellular Organisms for Pathway Maps

Mari Ishiguro-Watanabe1, Minoru Kanehisa5: 5In-stitute for Chemical Research, Kyoto University.

KEGG is a manually curated resource integrating eighteen databases categorized into systems, genom-ic, chemical and health information. It also provides

KEGG mapping tools, which enable understanding of cellular and organism-level functions from genome sequences and other molecular datasets. KEGG map-ping is a predictive method of reconstructing molecu-lar network systems from molecular building blocks based on the concept of functional orthologs. Since the introduction of the KEGG NETWORK database, various diseases have been associated with network variants, which are perturbed molecular networks caused by human gene variants, viruses, other patho-gens and environmental factors. The network varia-tion maps are created as aligned sets of related net-works showing, for example, how different viruses inhibit or activate specific cellular signaling path-ways. The KEGG pathway maps are now integrated with network variation maps in the NETWORK data-base, as well as with conserved functional units of KEGG modules and reaction modules in the MOD-ULE database. The KO database for functional ort-hologs continues to be improved and virus KOs are being expanded for better understanding of virus-cell interactions and for enabling prediction of viral per-turbations.

Publications

1. Arda Akdemir, Tetsuo Shibuya. Transfer Learning for Biomedical Question Answering. CLEF 2020, 22-25. 2020.

2. Arda Akdemir. Research on Task Discovery for Transfer Learning in Deep Neural Networks. Proc. 58th Annual Meeting of the Association for Com-putational Linguistics: Student Research Work-shop, 33-41, 2020.

3. Arda Akdemir, Tetsuo Shibuya, and Tunga Güngör.

Subword Contextual Embeddings for Languages with Rich Morphology, ICMLA 2020, in press.

4. Taku Onodera, Tetsuo Shibuya, Wear Leveling Re-visited, Leibniz International Proceedings in Infor-matics (LIPIcs) 181(65):1-65:17, 2020.

5. Minoru Kanehisa, Miho Furumichi, Yoko Sato, Mari Ishiguro-Watanabe, and Mao Tanabe. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 49(D1), in press.


Documents

Division of Medical Data Informatics