The Myths of Our Time: Fake News - arXiv · desires the same way myths, folk tales, and urban legends do (an example in footnote1). We generate fake news using Machine Learning (ML)

The Myths of Our Time: Fake News

Vıt Ruzicka, Eunsu Kang, David Gordon,Ankita Patel, Jacqui Fashimpaur, Manzil Zaheer

Carnegie Mellon [email protected], [email protected], [email protected],

[email protected], [email protected], [email protected]

Abstract

While the purpose of most fake news is misinformation andpolitical propaganda, our team sees it as a new type of myththat is created by people in the age of internet identities andartificial intelligence. Seeking insights on the fear and desirehidden underneath these modified or generated stories, we usemachine learning methods to generate fake articles and presentthem in the form of an online news blog. This paper aims toshare the details of our pipeline and the techniques used forfull generation of fake news, from dataset collection to presen-tation as a media art project on the internet.

Keywords: Fake news, Article generation, LSTM, RNN,Language model, Machine learning, AI, Media art, Internet

art, Web, Blog, Human-AI Co-Creation

IntroductionIs fake news a new type of myth that people are creating inthe age of internet and artificial intelligence? K. Shu et al saysfake news can have many definitions, and one narrow def-inition is “a news article that is intentionally and verifiablyfalse.” [12] While the purpose of fake news, according to K.Starbird, is disinformation and political propaganda [13], itoften gives us some insights into people’s hidden fears anddesires the same way myths, folk tales, and urban legendsdo (an example in footnote1). We generate fake news usingMachine Learning (ML) algorithms in an attempt to createnew myths of our time and share them in the form of an on-line blog, www.newsby.ml2. Our project is not designed tolure people into a specific perspective but to make a visiblestatement on this phenomenon in the context of art, which of-fers multi-layered provocations and interpretations. We plan

1News story [6], which is about a man who was run over by aparade car that was carrying dancing people at the Queer CultureFestival in Jeju city of Korea, spread quickly with a photo of a manunder a truck. The actual event details came out eventually: the manhad crawled under the car himself and he was mildly injured duringhis resistance against the policeman who was pulling him out for hisown safety. This fake news did not describe the actual event but whatthe writer wanted to see to be able to frame people of the festival asa danger to their society. While it stimulated confusion as intendedand strengthened the hate-cartel, it also vividly revealed their fearand worldview that had been hidden under their social masks.

2NewsBy.Ml blog url: www.newsby.ml

Figure 1: Snapshot of the website used to present gener-ated articles in a form of blog. Publicly available at www.newsby.ml.

to further develop this project and contribute to fake news de-tection research by providing a labeled dataset.

There have been other fake text generation projects such asgenerative reviews by Y.Yao et al [16] and generative HarryPotter books [14]. Our project, however, focuses on generat-ing longer articles from a dataset of texts with inconsistentwriting styles.

MethodThe developed pipeline is presented in Figure 2.

Dataset collection and filtrationIn order to generate fake news, we need to collect a textualdataset corresponding to real world news articles. We startedthis effort by scraping news articles from websites. We usedsearch terms from commonly used search APIs to get a largeamount of generic articles, and also specifically searched withsome topics in mind. This resulted in a large dataset of down-loaded articles with varying topics. More specifically, wehave assembled a dataset of 245,973 articles counting totally196,952,689 words.

arX

iv:1

908.

0176

0v1

[cs

.LG

] 5

Aug

201

9

www.newsby.ml

www.newsby.ml

www.newsby.ml

www.newsby.ml

ARTICLE- Title-

- Body

ARTICLE- Title-

- Body

original text dataset

(245973 articles)

topic specific dataset

LSTM

News by Misun Lean

keyword selection training LSTM models

hidden state size 128

LSTMhidden state size 128

generated output

keep creative sentences

Self-perpetuating circle, which covers anti-fake news on social media That’s reading the Fake News Networks, most of the people in this blog is not the topic of this blog and comments about Bob are off topic. The “Fakers” at CNN, NBC, ABC & CBS have done so much dishonest reporting that we are not allowed to know what the beginnings of national security, and will be the floodgates really

assembly articles

ARTICLE- Title- Excerpt- Body

- Misun Lean fake identity- blog design

CC Searchfor photo

TextBoxautomatic NLP

tagging

presentation



Figure 2: Fake news generation pipeline diagram

Table 1: Topics and keywords used to create specialized datasets.

topic keywords #articles #words

Asia Now: North Korea ’korea’, ’korean’, ’koreas’, ’koreans’, ’pyongyang’, ’nkorea’, ’jongun’, ’jongil’ 3064 13 572 577America Now: Politics ’trump’, ’america’, ’obama’, ’obamas’, ’american’, ’americas’, ’washington’, ’california’, ’fbi’, ’mexico’, ’florida’ 9000 9 033 277

Fake News and Journalism ’fake’, ’truth’, ’false’, ’wrong’, ’journalists’, ’intelligence’, ’zuckerberg’, ’illegal’, ’crime’, ’terror’, ’whistleblower’, ’fbi’,’cia’, ’journalist’, ’tweets’, ’instagram’, ’authorities’, ’reporter’, ’surveillance’, ’allegations’, ’wikileaks’, ’controversial’ 9481 7 128 962

We then chose to select a few subsets of this initial largedataset to create several smaller filtered and biased datasets.Using a Natural Language Processing (NLP) tool from Ma-chinebox Textbox [2], we performed an entity extraction andautomatic tag extraction on each article. We annotated ourdataset with multiple tags describing the topics each article isaddressing.

We manually created several categories of keywords,which are used to select subsets of the original large dataset.See these categories and their corresponding keywords in Ta-ble 1. If an article is tagged with at least one of the labels fromthe selected keyword set, it will be added to the correspondingsubset of articles.

This process creates unequally sized subsets of the originaldataset, which can overlap with each other, but which alsocorrespond to one selected topic.

Training LSTM language modelsHaving assembled several specialized datasets of articles, wetrain a language model for each one of the categories.

Language model tries to estimate the probability of se-quence of words w = (w1, . . . , wn) conditioned on thelearned model as p(w | model). Given the fact that each worddepends on the previous words in the sequence, we use thechain rule of probability and estimate the conditional proba-bility of the next word wi as:

n∏i=1

p(wi | w1, . . . , wi−1,model) (1)

We are using the Long-Short Term Memory (LSTM)[7, 10, 11] Recurrent Neural Network to estimate p(wi |w1, . . . , wi−1,model) while also using beam search [5] whenparsing the search tree. In this way we are creating specializedmodels, which are trained to learn the specific patterns in eachof these subsets. Given the fact that each of these datasetscontains a different number of articles and words in total, themodel architectures can also be different. Larger datasets pro-vide more data, and lend themselves to bigger models withmore parameters. Generally, we use an LSTM model withtwo layers and 128 LSTM units.

The whole dataset of articles is converted into one largecorpus of text, and is tokenized into vector representationsof words. For the model we are using a github repositoryhunkim/word-rnn-tensorflow implementation [9]. We also ex-perimented with converting each character into its vector rep-resentation (so called “character based” RNN), but empiri-cally we received worse results with this method.

Filtering generated textOnce LSTM models are trained to capture the underlyingpatterns of each individual dataset, we can generate a largeamount of text from each of these models. The resulting gen-erated text, however, contains samples which seem to be stuckin a loop or which mimic the original dataset perfectly with-out any innovation. See Figure 3 for an illustration of theseproblems. We have chosen to analyze the generated text’snovelty as compared with its original text dataset by compar-ing these two sets sentence by sentence. For each generatedsentence we used the Levenstein distance with every sentenceof the original dataset to obtain the closest match. We havedecided to keep only the sentences which are more than 30%dissimilar to their closest match in the original dataset. Thisgives us a large amount of generated and creative samples tochoose from for the task of assembling the final fake newsarticles.

Some generated paragraph examples are:

“A climate scientist at the RAND Corporation, said thatthe United States would prematurely withdraw from Syria,”Trump wrote in The Washington Post. ... Trump scoffed. Healso said he would not be able to comment on the notion. Itis unclear whether he was not reimbursed by the White Houseand the Department of Homeland Security. ... “Don’t worry,we are going to be able to get rid of the United States,” hewrites. “I think that we are going to be great,” Trump said in astatement that he was fired.”

(Category: America Now: Politics)

“Seventeen-year-old Kim Jong Il was initially sentenced to 24years in prison and fined 18 billion won (US$16.8 million) onTwitter. He also took a group photo with them from Tianan-men Square, according to South Korean pop stars visiting Py-ongyang.”

“Packs of wolves are coming from North Korea on Saturday tothe United States in exchange for a freeze on North Korean ex-ports of coal, iron, iron ore and seafood, oil and gas pipelines.”

(Category: Asia Now: North Korea)

“European Federation of Journalists, today condemned the im-portance of the digitalization of mass media. Sarah HuckabeeSanders lambastes fake news on Russia Fraud Human RightsWatch and Society of Journalists.”

“Comments about Bob are off topic. The “Fakers” at CNN,NBC, ABC, CBS have done so much dishonest reporting thatwe are not allowed to know what the beginnings of nationalsecurity.”

“We have a self-perpetuating circle, who covers anti-fake newson social media and media freedom in Croatia.”

(Category: Fake News and Journalism)

Article assembly

''It's OUT Is It "It’s not a lot of creativity. The Reverend Piepkorn told WND. President George W. Bush and Barack Obama is expected to attend the wedding of Amira Sajwani, the daughter of the Central Intelligence Agency, Geoffrey Berman, was convicted in the White House and the White House and the White House and the White House and the White House. Evident throughout his office in the United States and the White House and the rest of the country in the White House and the White House Correspondents’ Dinner. “The Chairman of the Reconnaissance General Bureau,” April 6, 2018 Ivanka Trump (C) chats with President-elect Donald Trump in Shah for the White House for Washington. "REUTERS/Carlos Barria Skye Kennedy (C), 5, Seamus Menefee (L), 7, and Jack Kennedy, 8, participate in a motorcade for the feasibility of the Interior Department said the $9.9 million hour-long Hotel in New York City, hotel room in Milwaukee, Wisconsin, Colorado, Arkansas, Arizona, Public Service activists, including Pennsylvania; Defeated .the “Order of Honor” in order to get rid of the United States and the United States and North Korea and the United States and North Korea and the United States and North Korea and the United States and North Korea and the United States and North Korea will be held in the United States and North Korea and the United States and North Korea as a whole." Jose Manuel Barroso, director of the Affairs of Bruegel, the American Lung Association for Health and Human Services Secretary Timothy M. Goodman of Connecticut (father of the Washington Post’s Jenna Johnson.

Marking: [ ] repeated snipets, [ ] seemingly loss of continuity, jumping from topic to topic

Figure 3: Sample of generated raw text with problems thatrequire manual corrections highlighted. We can see repeatedfragments of text and topical loss of focus.

Note that these selected and filtered creative generated textsamples still contain a lot of sentences which would not foola human judge. For example, there are sentences which havemany repetitions, some syntactic and grammatical errors, andmiss-spellings and also nonsense sentences. However, we dis-covered that these sentences are good enough to fool somefake news classifiers such as FakerFact [4] or Textbox [2].This shows how hard the problem of fake news detection is.

We are assembling a new news article from these generatedsentences for which we chose the following structure: (i) title,(ii) short excerpt used as a short description of the article andfinally (iii) the main body of the article. See example articlesin Figure 4.

We designed and used the below rules for a very limitedamount of intervention when assembling article excerpts.

• Around 50-100 words for the excerpts of eacharticle.

• We can delete or replace one letter from aword or one word from a sentence to fix obvi-ously redundant elements.

• We can delete sentences that do not work withthe overall subject of the paragraph.

• We can change the order of sentences to makethe article more legible.

Title: GPS-guidance system hovering over the Sea of Japan

Excerpt: 74 percent of Vladivostok, the vicinity of the United States and South Korea are now underway; Japan and brokers take care of the counterbattery role—its GPS-guidance system is now hovering over the Sea of Japan. A second official said that the United States would not surprise talks with North Korea and the United States to swallow.

Article main text: Kim Jong Un clapped his hands as he, along with his wife and hundreds of other citizens, community,” slain distinguish advised punishment, distinguish distinguish photographed unclassified comment, photographed seize confirmation, seize efforts alarmed prefers permit responsibility, prefers treating borrowed fronts undercover cheerleaders, to the cessation of a vanadium mine, bronze medalist in Snow-White-like glass sarcophagus.

You know, I am not and I have to say no,” the former Algeria boss, who hacked Sony at a tournament, and the injustice of Transnistria, and it has been a central part of the Korean War and has functionally incorporated into a polite threat, logistics, and conducts ranges of patriotism and FinCEN, that the North Korean madness may well be nearing its endgame.

Foto: “X-Band Radar headed out to sea.” by Daniel Ramirez is licensed under CC BY 2.0 Tags: KCNA NEWS AGENCY, KIM JONG UN, KOREAN PENINSULA, NORTH KOREAN EMBASSY, UNITED STATES

Title: There are no people — there are some of our features

Excerpt: Everyone was the most authentic? “Everybody spies”, the publisher of the Country Institute Zack Beauchamp Channel, founder of Inquiry revealed that he was killed in the 3.7 million federal bot anti-communist teams, as well as the Firm and a doll, the grandson of Hope.

There are no people — there are some of our features, such as forums. Religion and Mail in Switzerland, dwindled. But it is unclear whether he is not the only one with the Trump administration.

Article main text: Hand-selected phone addresses, waiting for the aircraft’s WALL or Kam retain ownership of the underbelly of the 1917 presidential century. The private-equity honor was extracted in the House… Against the program” of the United States of America and its allies are going to be guarding our border with the military,” Trump said in an interview with The Associated Press.

The Latest on the Theft of the United States would be a demonstration of moral strength. Prolong the conferences to ban atomic tests because the United States has agreed to suspend tests as long as negotiations are more likely to solve the nuclear deal with the United States and North Korea as a whole.” ”It’s OUT Is It “It’s not a lot of creativity.

Foto: “President Cox Visits Border Patrol Agents” by AFGE is licensed under CC BY 2.0 Tags: AIRCRAFT ’S WALL, FEDERAL BOT, HOMELAND SECURITY, SECRETARY KIRSTJEN NIELSEN, TRUMP, UNITED STATES

Figure 4: Excerpts from the publicly available generated fakenews articles, showing the article structure. Note that weshow only a part of the article and that the tags were auto-matically generated by [2].

For the longer article bodies, we further limited the amountof interventions, only allowing ourselves to throw away somecompletely broken sentences.

Article titles were chosen by the authors from the arti-cle body. The illustrative image was selected by searching adatabase of Creative Commons images [3] using our articletitles. We quote the author and the original name of the usedphotograph as required by the Creative Commons.

In both cases, the selection process was chosen to combatthe current shortcomings of generative LSTM models withthe least amount of intervention possible.

Tag generationAfter generating a number of articles for each thematic cate-gory, we present them on a newly built website. Finally weuse an automated NLP tagging solution from MachineboxTextbox [2] to get realistic tags for each of these articles.

ResultsFake news blog as an art projectOur project is presented in the form of a blog, which is one ofthe ways fake news is distributed online. The correspondent,Misun Lean, is a fake journalist identity created just for this

project. Her name comes as a word play on the abbreviation“ML” (hence the blog name “News by ML”) and her photo-graph shown in Figure 5 was generated using PGGAN [8], aML algorithm for generating high resolution portraits.

Figure 5: Portrait of a fake journalist Misun Lean generatedby PGGAN [8].

We select article topics often used for spreading fake news,with focus on popular global issues and regions of conflict.In particular, we have selected Asia now: North Korea andAmerica now: Politics. These topics are used either to fo-cus the attention of readership on issues occurring in othercountries, or to sway the opinions of masses during elections[15, 1]. We also selected the topic of Fake news and Journal-ism to serve as a mirror on how the problem of fake news isbeing covered by the media itself.

We designed the project website, shown in Figure 1, to fol-low the general structure of many news and opinion blogs.This website as a media art project proactively employs theculture and technology of current web environment to makean artistic statement on the phenomenon of fake news thatspreads through internet.

Article credibility analysisWe ran FakerFact [4], which is trained to classify the inten-tion of online texts, on our generated news articles. The anal-ysis did not detect any red flags, however some articles wereclassified as opinionated, sensational, or agenda driven. Forexample, the analysis said “Walt says it sounds the authormay be more focused on pushing an agenda than sharing thefacts” for our article “Israel continued to enlarge the ransack-ing of artists”, which was classified as Agenda Driven and“Hmm, Walt doesn’t see strong red flags, but it’s possible theauthor has a slight opinion” for our article “GPS-guidancesystem hovering over the Sea of Japan”, which was classifiedas Low Opinion. We consider this analysis to be consistentwith our intention to appear as an opinionated political blog.

ConclusionFor this project, we have created our own dataset, generatedevery integral part of a fake news blog including the corre-spondent’s identity, and presented it in the form of online blogwhich acts as an art project to provoke conversations aboutfake news and the human desires behind this phenomenon.In this paper, we have shared the overall pipeline and details

of our methods with a hope of helping other artists createmore projects that discusses the phenomenon of fake news inour society as well as giving the general audience an under-standing of the process of fake news generation. We share theproject at GitHub https://github.com/previtus/fake_news_generation_mark_I.

For our future work, we plan to generate a new dataset cat-egory about Artificial Intelligence. We want to capture how itis being presented online, since a lot of excitement and fearsurrounding it is based on false or exaggerated information.We are also considering creating a new dataset from “alter-native media” websites that contribute to the propagation offake news [13].

References[1] Albright, J. 2017. Faketube: Ai-generated news on

youtube.

[2] Box, M. 2018. Machine box - textbox.

[3] Commons, C. 2018. Creative commons search.

[4] FakerFact. 2017. Fakerfact.org.

[5] Graves, A. 2012. Sequence transduction with recurrentneural networks.

[6] Hankyoreh, h. 2018.제주퀴어축제차량이사람덮쳤다?‘가짜뉴스공장’또걸렸다 / jeju queer festival vehicle hashit people? ’fake news factory’ took another shot.

[7] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. 9:1735–80.

[8] Karras, T.; Aila, T.; Laine, S.; and Lehtinen, J. 2017.Progressive growing of gans for improved quality, stabil-ity, and variation. CoRR abs/1710.10196.

[9] Kim, S. 2017. Multi-layer recurrent neural networks(lstm, rnn) for word-level language models in pythonusing tensorflow. https://github.com/hunkim/word-rnn-tensorflow.

[10] Lipton, Z. C. 2015. A critical review of recurrent neuralnetworks for sequence learning. CoRR abs/1506.00019.

[11] Mikolov, T.; Karafiat, M.; Burget, L.; Cernocky, J.; andKhudanpur, S. 2010. Recurrent neural network based lan-guage model. volume 2, 1045–1048.

[12] Shu, K.; Sliva, A.; Wang, S.; Tang, J.; and Liu, H. 2017.Fake news detection on social media: A data mining per-spective. CoRR abs/1708.01967.

[13] Starbird, K. 2017. Examining the alternative mediaecosystem through the production of alternative narrativesof mass shooting events on twitter. In Proceedings of theEleventh International Conference on Web and Social Me-dia, ICWSM 2017, Montreal, Quebec, Canada, May 15-18, 2017., 230–239.

[14] Studios, B. 2018. Harry potter and the portrait of whatlooked like a large pile of ash.

[15] Tavernise, S. 2016. As fake news spreads lies, morereaders shrug at the truth.

https://github.com/previtus/fake_news_generation_mark_I

https://github.com/previtus/fake_news_generation_mark_I

https://github.com/hunkim/word-rnn-tensorflow

https://github.com/hunkim/word-rnn-tensorflow

[16] Yao, Y.; Viswanath, B.; Cryan, J.; Zheng, H.; and Zhao,B. Y. 2017. Automated crowdturfing attacks and defensesin online review systems. CoRR abs/1708.08151.

Authors BiographiesVıt Ruzicka received his B.Sc. and M.Sc. with Honors inComputer Sciences with specialization in Machine Learning,Computer Graphics and Interaction from the Czech Techni-cal University in Prague, Czech Republic in 2017. He spentnearly two exciting years of research internships at the Elec-trical and Computer Engineering department of CarnegieMellon University in USA (September 2017 - May 2018)and at the EcoVision lab of the Photogrammetry and RemoteSensing group at ETH Zurich in Switzerland (January 2019- July 2019). His research interest are Machine Learning, itsapplication to other disciplines, Computer Vision, CreativeAI and the intersections of Machine Learning and Art.Eunsu Kang is a Korean media artist who creates interac-tive audiovisual installations and AI artworks. Her current re-search is focused on creative AI and artistic expressions gen-erated by Machine Learning algorithms. Creating interdisci-plinary projects, her signature has been seamless integrationof art disciplines and innovative techniques. Her work hasbeen invited to numerous places around the world includingKorea, Japan, China, Switzerland, Sweden, France, Germany,and the US. All ten of her solo shows, consisting of individ-ual or collaborative projects, were invited or awarded. Shehas won the Korean National Grant for Arts three times. Herresearches have been presented at prestigious conferences in-cluding ACM, ICMC, ISEA, and NeurIPS. Kang earned herPh.D. in Digital Arts and Experimental Media from DXARTSat the University of Washington. She received an MA in Me-dia Arts and Technology from UCSB and an MFA from theEwha Womans University. She had been a tenured art profes-sor at the University of Akron for nine years and is currently aVisiting Professor with emphasis on Art and Machine Learn-ing at the School of Computer Science, Carnegie Mellon Uni-versity.David Gordon is an interdisciplinary artist and engineer liv-ing in the Los Angeles area. He has a specialty in simulationfor autonomous systems, and currently works in NVIDIA’sautonomous driving simulation division. He received a BCSA(Bachelors in Computer Science and Arts) from CarnegieMellon University in 2019.Manzil Zaheer earned his Ph.D. degree in Machine Learn-ing from the School of Computer Science at Carnegie MellonUniversity under the able guidance of Prof Barnabas Poczos,Prof Ruslan Salakhutdinov, and Prof Alexander Smola. Heis the winner of Oracle Fellowship in 2015. His research in-terests broadly lie in representation learning. He is interestedin developing large-scale inference algorithms for representa-tion learning, both discrete ones using graphical models andcontinuous with deep networks, for all kinds of data. He enjoylearning and implementing complicated statistical inference,data-parallelism, and algorithms in a simple way.

Documents

The Myths of Our Time: Fake News - arXiv · desires the same way myths, folk tales, and urban legends do (an example in footnote1). We generate fake news using Machine Learning (ML)