7
Challenges in Combating COVID-19 Infodemic - Data, Tools, and Ethics Kaize Ding Arizona State University [email protected] Kai Shu Illinois Institute of Technology [email protected] Yichuan Li Arizona State University [email protected] Amrita Bhattacharjee Arizona State University [email protected] Huan Liu Arizona State University [email protected] Abstract While the COVID-19 pandemic continues its global devastation, numerous accompanying challenges emerge. One important challenge we face is to eciently and eectively use recently gathered data and find computa- tional tools to combat the COVID-19 info- demic, a typical information overloading prob- lem. Novel coronavirus presents many ques- tions without ready answers; its uncertainty and our eagerness in search of solutions of- fer a fertile environment for infodemic. It is thus necessary to combat the infodemic and make a concerted eort to confront COVID-19 and mitigate its negative impact in all walks of life when saving lives and maintaining nor- mal orders during trying times. In this posi- tion paper of combating the COVID-19 info- demic, we illustrate its need by providing real- world examples of rampant conspiracy the- ories, misinformation, and various types of scams that take advantage of human kindness, fear, and ignorance. We present three key challenges in this fight against the COVID-19 infodemic where researchers and practitioners instinctively want to contribute and help. We demonstrate that these challenges can and will be eectively addressed by collective wisdom, crowd sourcing, and collaborative research. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 Introduction Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The World Health Or- ganization (WHO) recently declared the COVID-19 outbreak a Public Health Emergency of International Concern (PHEIC) and a pandemic due to its high morbidity and mortality rates. As of April 15, 2020, more than 2.04 million cases have been reported across 210 countries and territories, resulting in over 133,000 deaths 1 . These numbers are continuing to rise and the health systems in many countries are overwhelmed to provide treatment. Concomitant with the pandemic are many unknowns that create a conducive environ- ment for misinformation, fake news, political disinfor- mation campaigns, scams, etc. Those malicious con- tents instigate fears or anger, capitalize on human vulnerability, and exploit human emotion, kindness, and/or wishes for miracles. As the coronavirus spreads like fire in the world, disinformation machines also accelerate their cam- paigns on various fronts, rendering a new infodemic battlefield. Social media platforms such as Face- book/Instagram, Twitter, and Google/YouTube have been abused to disseminate erroneous contents. When the whole world is scrambling to fight the COVID-19 pandemic, governments and WHO also have to combat an infodemic, which is defined as “an overabundance of information — some accurate and some not—that makes it hard for people to find trustworthy sources and reliable guidance when they need it” [Don20]. The COVID-19 infodemic causes confusion, sows di- 1 https://en.wikipedia.org/wiki/Coronavirus_disease_ 2019 Title of the Proceedings: "Proceedings of the CIKM 2020 Workshops October 19-20, Galway, Ireland" Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi

Challenges in Combating COVID-19 Infodemic - Data, Tools

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Challenges in Combating COVID-19 Infodemic

- Data, Tools, and Ethics

Kaize Ding

Arizona State University

[email protected]

Kai Shu

Illinois Institute of Technology

[email protected]

Yichuan Li

Arizona State University

[email protected]

Amrita Bhattacharjee

Arizona State University

[email protected]

Huan Liu

Arizona State University

[email protected]

Abstract

While the COVID-19 pandemic continues itsglobal devastation, numerous accompanyingchallenges emerge. One important challengewe face is to e�ciently and e↵ectively userecently gathered data and find computa-tional tools to combat the COVID-19 info-demic, a typical information overloading prob-lem. Novel coronavirus presents many ques-tions without ready answers; its uncertaintyand our eagerness in search of solutions of-fer a fertile environment for infodemic. It isthus necessary to combat the infodemic andmake a concerted e↵ort to confront COVID-19and mitigate its negative impact in all walksof life when saving lives and maintaining nor-mal orders during trying times. In this posi-tion paper of combating the COVID-19 info-demic, we illustrate its need by providing real-world examples of rampant conspiracy the-ories, misinformation, and various types ofscams that take advantage of human kindness,fear, and ignorance. We present three keychallenges in this fight against the COVID-19infodemic where researchers and practitionersinstinctively want to contribute and help. Wedemonstrate that these challenges can and willbe e↵ectively addressed by collective wisdom,crowd sourcing, and collaborative research.

Copyright © 2020 for this paper by its authors. Use permittedunder Creative Commons License Attribution 4.0 International(CC BY 4.0).

1 Introduction

Coronavirus disease 2019 (COVID-19) is an infectiousdisease caused by severe acute respiratory syndromecoronavirus 2 (SARS-CoV-2). The World Health Or-ganization (WHO) recently declared the COVID-19outbreak a Public Health Emergency of InternationalConcern (PHEIC) and a pandemic due to its highmorbidity and mortality rates. As of April 15, 2020,more than 2.04 million cases have been reported across210 countries and territories, resulting in over 133,000deaths1. These numbers are continuing to rise and thehealth systems in many countries are overwhelmed toprovide treatment. Concomitant with the pandemicare many unknowns that create a conducive environ-ment for misinformation, fake news, political disinfor-mation campaigns, scams, etc. Those malicious con-tents instigate fears or anger, capitalize on humanvulnerability, and exploit human emotion, kindness,and/or wishes for miracles.

As the coronavirus spreads like fire in the world,disinformation machines also accelerate their cam-paigns on various fronts, rendering a new infodemicbattlefield. Social media platforms such as Face-book/Instagram, Twitter, and Google/YouTube havebeen abused to disseminate erroneous contents. Whenthe whole world is scrambling to fight the COVID-19pandemic, governments and WHO also have to combatan infodemic, which is defined as “an overabundanceof information — some accurate and some not—thatmakes it hard for people to find trustworthy sourcesand reliable guidance when they need it” [Don20].The COVID-19 infodemic causes confusion, sows di-

1https://en.wikipedia.org/wiki/Coronavirus_disease_

2019

Title of the Proceedings: "Proceedings of the CIKM 2020Workshops October 19-20, Galway, Ireland"Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi

vision, incites hatred, promotes unproven cures, andprovokes social panic, which directly impacts emer-gency response, treatment, recovery, and financial andmental health during the di�cult time of self-isolation.Therefore, combating the COVID-19 infodemic is achallenging yet imperative task to solve.

In this paper, we first present some COVID-19 re-lated examples to illustrate the variety and range of in-fodemic cases in representative categories: conspiracytheories and misinformation, and scams and securityattacks to reinforce the urgency and need for address-ing the COVID-19 infodemic via scalable and timelysolutions. We then discuss the essential challenges indesigning and developing corresponding AI solutionsfrom three perspectives: data, computational tools,and ethics. The last challenge of ethics is particu-larly easy to overlook when we rush to confront theimmediate threats. Therefore, it is important to un-derstand unintended consequences when developing AIsolutions to ensure sustainable and healthy use and de-ployment. Last, we use some current e↵orts to demon-strate the feasibility of addressing the three challengesin combating the COVID-19 infodemic; Meanwhile,by understanding the challenges and what we have,we also appreciate the importance of collaborativeresearch for e↵ectively and e�ciently combating theCOVID-19 infodemic.

2 Examples of COVID-19 Infodemic

To illustrate what the COVID-19 infodemic looks like,how expansive, active, and devastating it is, and why itis important to thwart or mitigate its present threats,we first present various examples regarding conspir-acy theories and misinformation, and scam and othersecurity attacks.

2.1 Conspiracy theories and misinformation

With the spread of COVID-19 pandemic, the WorldHealth Organization (WHO) recently warned of an“infodemic” of rampant conspiracy theories about thecoronavirus. Those conspiracy theories have appearedin both social media and mainstream news outlets andare often intertwined with geopolitics. One example isabout how the new coronavirus originated: accordingto a Pew Research Center survey, nearly three-in-tenAmericans believe COVID-19 was a bio-weapon madein the lab. Some top 10 conspiracy theories includeSARS-CoV-2 virus was created as a biologic weaponfrom a lab, GMOs are the culprit, COVID-19 actu-ally doesn’t exist, and coronavirus is a plot by bigPharma [Lyn20].

Coronavirus misinformation is also flooding the in-ternet through social media, text messages, and prop-agated by celebrities, politicians, or other prominent

public figures. According to the report in [KG20],“among outlets that repeatedly share false content,eight of the top 10 most engaged-with sites are runningcoronavirus stories.” For instance, there are plenty ofsupposed “cures” on social media that will likely mis-lead people to risk their lives for quick fixes. Disre-garding the National Institutes of Health (NIH) warn-ing of many hearsay cures without evidence of curingbeing e↵ective, there are endless claims such as herbsand teas, or something of the sort that can prevent thecoronavirus. Recently, some wireless towers were dam-aged in the UK due to a false claim that radio wavessent by 5G technology are causing small changes topeople’s bodies that make them succumb to the virus.

2.2 Scam, spam, phishing, and malware

As more and more people start working or study-ing from home, cyber criminals recently shift focusto target remote workers. Di↵erent attacks such asscam, spam, phishing and malware, which prey onpeople’s willingness to help, fear of supply shortage,and moments of weakness, have become increasinglyactive. Researchers have found that the volume ofcoronavirus email scams nearly tripled in one week,with almost 3% of all global spam now estimated to beCOVID-19 related. During the coronavirus pandemic,as state governments and hospitals have scrambled toobtain masks and other medical supplies, scammers at-tempted to sell a fake stockpile of 39 million masks to aCalifornia labor union. According to The Hill [Mil20], “Hackers are taking advantage of the increased re-liance on networks to target critical organizations suchas health care groups and members of the public, steal-ing and profiting o↵ sensitive information and puttinglives at risk.”

3 Data, Tool, and Ethics Challenges

The scale, volume, and reach of the COVID-19 info-demic entails the reliance on AI and machine learn-ing (ML) algorithms to react promptly and respondrapidly. The success of AI and ML algorithms re-quires large amounts of multi-modal data for theire�ciency and e↵ectiveness, which introduces a datachallenge. Data extraction and curation from multi-source data needs di↵erent computational tools to ac-curately categorize and sort out various types of data,which presents a tool challenge. When we rush to dealwith present threats, we should be aware of poten-tial side-e↵ects, unexpected consequences, and biasesof our solutions, which suggests an ethic challenge. Inthis section, we will discuss these challenges in detail.

3.1 Data challenge

Though numerous COVID-19 data sources are avail-able online, their datasets are available on various web-sites for di↵erent needs. The major data challengeof isolated data sources is the awareness of their ex-istence. Another related issue is that they are col-lected from di↵erent sources or under di↵erent crawlsettings. For example, Allen Institute for AI (AI2)released the scholarly articles dataset2 collected fromPMC, medRxiv and bioRxiv; LitCovid [CAL20] col-lected the scientific information from PubMed. Com-bining di↵erent data sources leads to higher quality ofdata and better coverage.

To address the data challenges, we need to over-come some shortcomings: disorganization – most ofthem merely list all the collected datasets on their web-sites without information summarizing the relation-ships among them; specificity – data collected for aspecific topic, for example, Amazon provides the epi-demic dataset on cloud3 and COVID-19 GIS Hub4

only contain the academic findings and geospatial-related datasets respectively; and inconvenience –most sites merely provide the reference links to thesource datasets and do not provide data utility toolslike covid19datahub [GA20] for easy access.

3.2 Computational tool challenge

There are existing resources that can assist users toidentify malicious intent in websites. Google’s SafeBrowsing API, for instance, allows the user to enter aURL and check it against Google’s constantly updatedlists of unsafe web resources. Similar resources in-clude isitPhishing.org, malwareurl.com, and antivirussoftware, among many others. Additionally, users cancheck malicious domain lists through di↵erent sourcessuch as phishtank.com or the aforementioned Google’sSafe Browsing lists. As many malicious sites use URLshorteners to disguise themselves, to counteract po-tential attacks, it would be safe to first use URL ex-panders to figure out what they are before clickingthem. Despite the easy access of those computationaltools, they are not available conveniently in a singleplace where di↵erent tools can be called up wheneverneeded.

The awareness of these existing tools and e�cientuse of them for quick response is vital for combat-ing COVID-19. An associate issue is the requirementfor current and frequently updated black-lists [SLH17].As we know, it is infeasible to manually maintain a dy-

2https://allenai.org/data/cord-19

3https://aws.amazon.com/blogs/big-data/a-public-

data-lake-for-analysis-of-covid-19-data/

4https://coronavirus-disasterresponse.hub.arcgis.

com/

namically changing list of malicious URLs, with newsites being generated everyday. Therefore, it is neces-sary to develop AI/ML identifiers that can learn fromthe old malicious sites for estimating the threats ofnew ones.

3.3 Ethics challenge

The COVID-19 pandemic is ushering in a new era ofdigital surveillance since governments are employingtools that track and monitor individuals. South Ko-rea and Israel, for instance, have demonstrated the ef-fectiveness of harnessing di↵erent digital surveillancetools. However, such a new practice can breach dataprivacy in the meantime and may even remain in useafter the pandemic. In this section, we discuss the po-tential privacy concerns, trade-o↵s between stringentdisease monitoring and patient privacy and ethical is-sues behind the disruption of civil liberties.

Gauging the war-like severity of the coronaviruspandemic, academics, researchers, companies and non-profits alike have come forward to contribute in anypossible way. However, given the rapid nature of suchresponses and the subsequent lack of policy checks,these otherwise novel endeavors may have ethical loop-holes. In an attempt to provide a transparent view ofthe degree of infection and prevent community spreadof the virus, many counties and states in the UnitedStates have decided to publicly release data corre-sponding to cases, including the number of cases perzip-code [Mal20]. Smartphone applications with geo-locating capabilities have come out for users to logtheir symptoms. But the use of such applications hassignificant privacy concerns 5 [Wet20]. Contact trac-ing has been identified as an e↵ective way to controlthe spread of the virus in communities where the in-fection is not yet widespread or has slowed down sig-nificantly, and companies including Google and Appleare currently developing applications to make this pos-sible. Only when a su�cient number of people use theapplication and voluntarily report their cases can it beused as a reliable tool of tracking. In this situation,there is an obvious trade-o↵ between user health pri-vacy and data transparency and it is challenging toidentify well-defined ethical boundaries when it comesto public health during a pandemic. The success ofsuch an app requires a majority of the population todownload and use it.

4 Feasibility Discussion

In this section, we present some current e↵orts thataddress the aforementioned challenges and show that

5https://privacyinternational.org/examples/apps-and-

covid-19

the three challenges are solvable with collaborative re-search.

For the data challenge, we collect the publicly avail-able COVID-19 datasets and cluster them into severalgroups6. Under each group, researchers can referencecomplete datasets from di↵erent sources or settings.For example, in social media data, we gather availabletweet corpus on COVID-19 [BTW+20][CLF20] withdi↵erent query keywords and time spans. The hierar-chy cluster structure in Figure 1 helps the researchersto quickly locate the dataset. Lastly our data reposi-tory includes areas in academics, news, social media,and epidemic reports for multi-disciplinary research.For example, if a researcher wants to analyze the in-fluence of the news or academic findings on social me-dia like Twitter, s/he can use the data in academic ornews topics and social media.

COVID-19 Datasets

Social Media Twitter

Epidemic ReportResource Report

Case Report

Geo-Spatial MobilityNews

Rumor

Fact Checked

AcademicPublished

Pre-print

Figure 1: A taxonomy of collected datasets

To help a researcher easily access the datasets inthe repository, we build a data-loader7. It is a Pythonpackage with a pandas Dataframe [pdt20] by callingdata = DataLoader().download(url). This widely useddata format can help the downstream data analysis.

To tackle the tool challenge, we develop TellMe,a computational tool that provides an estimate if apiece of news or text is disinformation. Its input in-cludes URLs and text, and its output is a score basedon di↵erent functions of TellMe as shown in Figure 2:URL Checker, Fake News Classifier, Website Matcher,Credibility and Trusty. The Trusty [MYL09] andCredibility [AL13] scores are based on contents’ socialengagements that malicious users share more similar-ity than general users. The fake news score is returnedfrom a state-of-the-art fake news detector [SZL+20].The website matcher compares the input URL withwebsites that publish false information about the virusfound by NewsGuard [BC19]. In addition, we are alsoin the process of developing and integrating more com-ponents (e.g., advertisement tracker, source attribu-tor) and algorithms [DLBL19, DLL19, DLD+19] intothe TellMe system.

Now, we use fake news as an example to illustrateour attempts to learn with weak social supervisionto detect COVID-19 disinformation more e↵ectivelyand with explainability. First, for e↵ective fake news

6https://github.com/bigheiniu/awesome-coronavirus19-

dataset

7https://github.com/bigheiniu/COVID-19-Dataloaders

Figure 2: The current components of TellMe.

detection, we consider the relationships among pub-lishers, news pieces, and consumers, which is moti-vated by existing sociological studies on journalismon the correlation between the partisan bias of pub-lishers, the credibility of consumers, and the veracitydegree of news content; and explore various auxiliaryinformation from these relations to help detect fakenews [SWL19]. Second, for explainable fake news de-tection, we aim to derive explanation of prediction re-sults to help decision makers and practitioners; we at-tempt to explore user comments as a source and mineinformative and relevant pieces to help explain why apiece of news is predicted as fake, and pinpoint morefictional text in news text simultaneously [SCW+19].

To tackle the ethics challenge due to the increasein government surveillance and prevalence of smart-phone apps to collect and gather user/patient data,we need to take into account legitimate concerns re-garding privacy and the degree to which such a regimeof monitoring and enforcement will a↵ect democracyafter the pandemic ends. It requires us to understandand acknowledge the fact that there is a clear di↵er-ence between standard biomedical ethics versus pri-vacy concerns and ethics during a public health crisis.Governments and public health o�cials may need totake certain measures aimed at minimizing the dam-age caused by the virus and for the common goodduring this trying time, which under normal circum-stances might have been inappropriate. Nevertheless,measures could be taken to avoid potential misuse ofdata. One possible way to have better guarantees onuser privacy would be to make these contact tracingsmartphone applications communicate in an encryptedpeer to peer way rather than storing all the data in acentral server. These technologies should also be de-ployed in a way that is as transparent as possible, sothat the user is fully aware of what and how muchpersonal information he/she permits the applicationto use. Furthermore, there is significant ongoing dis-cussion among experts, researchers and policy-makersregarding a steady recovery into a normal function-ing society. For example, the ethics research group atHarvard University makes e↵orts at finding solutionswithout compromising user privacy to keep civil lib-erty and democracy at the forefront.

5 Looking Ahead

The significance of combating the COVID-19 info-demic lies at protecting people from falling victims tothe pandemic in this unexpected front and from dis-rupting otherwise already inconvenient daily routines

so as to improve our resilience in our fight to con-tain the pandemic. In this position paper, we showa good number of problems posed by the COVID-19infodemic, the vast amounts of data generated in theworld’s e↵ort to contain the pandemic, and the needfor concerted e↵orts at various levels to e�ciently ande↵ectively deal with current and future challenges inmedical and information fronts.

It is evident that (1) we face both immediate and fu-ture challenges in this unprecedented fight, (2) existingdata will grow fast, and existing computational toolsare insu�cient to contain and mitigate the COVID-19 infodemic, and (3) short-term solutions can havepotential long-term impact. Therefore, when we facehard choices, we need to resist the temptation to trade-o↵ so as to minimize long-term negative impact; whenwe search for solutions, we should consider those em-ploying crowdsourcing and take long views for fair-ness and responsibility; when we design methods, weshould rely on collective wisdom and diversity to aimfor robustness; and when we form teams, we shouldgive priority to multi-disciplinary collaboration andpreemptively address hidden biases. Our future willalways be uncertain, but with the advancement in sci-ence and technology and with our preparedness trainedand tested in our concerted e↵orts to contain the pan-demic in all fronts, our future will surely be brighterand healthier.

Acknowledgements

This work is, in part, supported by Global SecurityInitiative (GSI) at ASU and by NSF grants (2029044and 1614576). We would like to thank Denis Liu forhelping develop earlier versions of TellMe and for care-fully proofreading an earlier version of this paper.

References

[AL13] Mohammad-Ali Abbasi and Huan Liu.Measuring user credibility in social media.In SBP-BRiMS, 2013.

[BC19] S Brille and G Crovitz. Newsguard nowavailable on microsoft edge mobile apps forios and android, 2019.

[BTW+20] Juan M. Banda, Ramya Tekumalla,Guanyu Wang, Jingyuan Yu, Tuo Liu,Yuning Ding, and Gerardo Chowell. Alarge-scale covid-19 twitter chatter datasetfor open scientific research – an interna-tional collaboration, 2020.

[CAL20] Q. Chen, A. Allot, and Z. Lu. Keep upwith the latest coronavirus research. Na-ture, 2020.

[CLF20] Emily Chen, Kristina Lerman, and EmilioFerrara. Covid-19: The first public coron-avirus twitter dataset, 2020.

[DLBL19] Kaize Ding, Jundong Li, RohitBhanushali, and Huan Liu. Deep anomalydetection on attributed networks. InSDM, 2019.

[DLD+19] Kaize Ding, Jundong Li, Shivam Dhar,Shreyash Devan, and Huan Liu. Interspot:interactive spammer detection in socialmedia. In Proceedings of the 28th Inter-national Joint Conference on Artificial In-telligence, pages 6509–6511. AAAI Press,2019.

[DLL19] Kaize Ding, Jundong Li, and HuanLiu. Interactive anomaly detection on at-tributed networks. In Proceedings of theTwelfth ACM International Conference onWeb Search and Data Mining, pages 357–365, 2019.

[Don20] Joan Donovan. Here’s how so-cial media can combat the coron-avirus ‘infodemic’, 2020. https://www.technologyreview.com/s/615368/facebook-twitter-social-media-infodemic-misinformation/.

[GA20] Emanuele Guidotti and David Ardia.Covid-19 data hub, 04 2020.

[KG20] Kornbluh and Ellen P. Goodman.Safeguarding digital democracy, 2020.http://www.gmfus.org/publications/safeguarding-democracy-against-disinformation.

[Lyn20] Mark Lynas. Covid: Top 10 currentconspiracy theories, 2020. https://allianceforscience.cornell.edu/blog/2020/04/covid-top-10-current-conspiracy-theories/.

[Mal20] Laurel Mallory. Sc health o�cials up-date list of confirmed and estimatedcoronavirus cases by zip code, 2020.https://www.wtoc.com/2020/04/10/dhec-releases-number-confirmed-estimated-coronavirus-cases-by-zip-code/.

[Mil20] Maggie Miller. Virtual army rising up toprotect health care groups from hackers,2020. https://thehill.com/policy/cybersecurity/493997-virtual-army-

rising-up-to-protect-healthcare-groups-from-hackers/.

[MYL09] Sai T Moturu, Jian Yang, and Huan Liu.Quantifying utility and trustworthiness foradvice shared on online social media. InCSE, 2009.

[pdt20] The pandas development team. pandas-dev/pandas: Pandas, February 2020.

[SCW+19] Kai Shu, Limeng Cui, Suhang Wang,Dongwon Lee, and Huan Liu. defend: Ex-plainable fake news detection. In KDD,2019.

[SLH17] Doyen Sahoo, Chenghao Liu, andSteven CH Hoi. Malicious url detectionusing machine learning: A survey. arXivpreprint arXiv:1701.07179, 2017.

[SWL19] Kai Shu, SuhangWang, and Huan Liu. Be-yond news contents: The role of social con-text for fake news detection. In WSDM,2019.

[SZL+20] Kai Shu, Guoqing Zheng, Yichuan Li,Subhabrata Mukherjee, Ahmed HassanAwadallah, Scott Ruston, and Huan Liu.Leveraging multi-source weak social su-pervision for early detection of fake news.arXiv preprint arXiv:2004.01732, 2020.

[Wet20] Nicole Wetsman. Personal privacymatters during a pandemic — but lessthan it might at other times, 2020.https://www.theverge.com/2020/3/12/21177129/personal-privacy-pandemic-ethics-public-health-coronavirus/.