17
Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full slidesets to follow!)

Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

Embed Size (px)

Citation preview

Page 1: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

Sources, tools, datasets(thanks in advance for helping me extend this by sending tools

you‘ve found!)

Bettina BerendtVienna Summer School 2015

(full slidesets to follow!)

Page 2: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

2

Lecture 1

2

Page 3: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

3

ReferencesA good textbook on Text Mining:• Feldman, R. & Sanger, J. (2007). The Text Mining Handbook. Advanced Approaches in Analyzing

Unstructured Data. Cambridge University Press.An introduction similar to this one, but also covering unsupervised learning in some detail, and with lots of pointers to books, materials, etc.:• Shaw, R. (2012). Text-mining as a Research Tool in the Humanities and Social Sciences. Presentation at the

Duke Libraries, September 20, 2012. https://aeshin.org/textmining/ An overview of news and (micro-)blogs mining:• Berendt, B. (in press). Text mining for news and blogs analysis. To appear in C. Sammut & G.I. Webb (Eds.),

Encyclopedia of Machine Learning and Data Mining. Berlin etc.: Springer. http://people.cs.kuleuven.be/~bettina.berendt/Papers/berendt_encyclopedia_2015_with_publication_info.pdf

See http://wiki.esi.ac.uk/Current_Approaches_to_Data_Mining_Blogs for more articles on the subject.

Individual sources cited on the slides• Qiaozhu Mei, ChengXiang Zhai: Discovering evolutionary theme patterns from text: an exploration of

temporal text mining. KDD 2005: 198-207• Mihalcea, R. & Liu, H. (2006). A corpus-based approach to finding happiness, In Proc. AAAI Spring

Symposium on Computational Approaches to Analyzing Weblogs. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.6759

• Kirschenbaum, M. "The Remaking of Reading: Data Mining and the Digital Humanities." In NGDM 07: National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation. http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf

• Mueller, M. “Notes towards a user manual of MONK.” https://apps.lis.uiuc.edu/wiki/display/MONK/Notes+towards+a+user+manual+of+Monk, 2007.

• Massimo Poesio, Jon Chamberlain, Udo Kruschwitz, Livio Robaldo and Luca Ducceschi, 2013. Phrase Detectives: Utilizing Collective Intelligence for Internet-Scale Language Resource Creation. ACM Transactions on Intelligent Interactive Systems, 3(1). http://csee.essex.ac.uk/poesio/publications/poesio_et_al_ACM_TIIS_13.pdf

• Luis von Ahn (2005). Human Computation. PhD Dissertation. Computer Science Department, Carnegie Mellon University. http://reports-archive.adm.cs.cmu.edu/anon/usr0/ftp/usr/ftp/2005/abstracts/05-193.html

• Luis von Ahn: Games with a Purpose. IEEE Computer 39(6): 92-94 (2006)

Page 4: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

4

More DH-specific tools

Overviews of 71 tools for Digital Humanists•Simpson, J., Rockwell, G., Chartier, R.,

Sinclair, S., Brown, S., Dyrbye, A., & Uszkalo, K. (2013). Text Mining Tools in the Humanities: An Analysis Framework. Journal of Digital Humanities, 2 (3), http://journalofdigitalhumanities.org/2-3/text-mining-tools-in-the-humanities-an-analysis-framework/

•See also the link collection on the Voyant documentation Web page

4

Page 5: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

5

Tools (powerful, but require some computing experience)• Ling Pipe

▫ linguistic processing of text including entity extraction, clustering and classification, etc.

▫ http://alias-i.com/lingpipe/• OpenNLP

▫ the most common NLP tasks, such as POS tagging, named entity extraction, chunking and coreference resolution.

▫ http://opennlp.apache.org/• Stanford Parser and Part-of-Speech (POS) Tagger

▫ http://nlp.stanford.edu/software/tagger.shtm/• NTLK

▫ Toolkit for teaching and researching classification, clustering and parsing▫ http://www.nltk.org/

• OpinionFinder▫ subjective sentences , source (holder) of the subjectivity and words that are included in

phrases expressing positive or negative sentiments.▫ http://code.google.com/p/opinionfinder/

• Basic sentiment tokenizer plus some tools, by Christopher Potts▫ http://sentiment.christopherpotts.net

• Twitter NLP and Part-of-speech tagging▫ http://www.ark.cs.cmu.edu/TweetNLP/

Page 6: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

6

Further tools (thanks for your suggestions!)•Atlas TI: “Qualitative data analysis“

▫http://atlasti.com/▫Commercial product, has free trial version

6

Page 7: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

7

Gamergate sources

• Budac, A., Chartier, R., Suomela, T., Gouglas, S., & Rockwell, G. (2015) #GamerGate: Distant Reading Games Discourse. Paper presented at the CGSA 2015 conference at the HSSFC Congress at University of Ottawa, Ottawa, Ontario, June 2015.

• Rockwell, G. (2015). Appendix 1: Ethics of Twitter Gamergate Research.

• Rockwell, Geoffrey; Suomela, Todd, 2015, "Gamergate Reactions", http://dx.doi.org/10.7939/DVN/10253 V5 [Version]. 7

Page 8: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

8

Lecture 2

8

Page 9: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

9

Tools directly for sentiment analysis

•SentiStrength (sentistrength.wlv.ac.uk)•TheySay (apidemo.theysay.io)•Sentic (sentic.net/demo)•Sentdex (sentdex.com)•Lexalytics (lexalytics.com)•Sentilo (wit.istc.cnr.it/stlab-tools/sentilo)•nlp.stanford.edu/sentiment

9

Page 10: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

10

Lexicons• Bing Liu‘s opinion lexicon

▫ http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html• MPQA subjectivity lexicon

▫ http://www.cs.pitt.edu/mpqa/• SentiWordNet

▫ Project homepage: http://sentiwordnet.isti.cnr.it▫ Python/NLTK interface: http://

compprag.christopherpotts.net/wordnet.html• Harvard General Inquirer

▫ http://www.wjh.harvard.edu/~inquirer /• Disagree on some-to-many words (see Potts, 2013)• SenticNet

▫ http://sentic.net

Page 11: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

11

(Some) datasets

From Potts (2013), p.5

● More on Twitter datasets, including critical appraisal: Saif et al. (2013)

Page 12: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

12

More datasets

12

From Tsytsarau & Palpanas (2012)

Page 13: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

13

More datasets

• SNAP review datasets: http://snap.stanford.edu/data/

• Yelp dataset: http://www.yelp.com/dataset_challenge/

• User intentions in image capturing a dataset going beyond text▫Contributed by Desara Xhura – thanks!▫http://www.itec.uni-klu.ac.at/~

mlux/wiki/doku.php?id=research:photointentionsdata ▫Papers on this project: http://www.itec.uni-klu.ac.at/~

mlux/wiki/doku.php?id=start 13

Page 14: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

14

Surveys used for this presentation

14

Ronen Feldman: Techniques and applications for sentiment analysis. Commun. ACM 56(4): 82-89 (2013).Bing Liu, Lei Zhang: A Survey of Opinion Mining and Sentiment Analysis. Mining Text Data 2012: 415-463.Bo Pang, Lillian Lee: Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval 2(1-2): 1-135 (2007).Potts (2013). Introduction to Sentiment Analysis. http://www.stanford.edu/class/cs224u/slides/2013/cs224u-slides-02-26.pdf

Mikalai Tsytsarau, Themis Palpanas: Survey on mining subjective data on the web. Data Min. Knowl. Discov. 24(3): 478-514 (2012)

My summary of these (an earlier and longer version of the present slides): Berendt, B. (2014). Opinion mining, sentiment analysis, and beyond. Lecture at the Summer School Foundations and Applications of Social Network Analysis & Mining, June 2-6, 2014, Athens, Greece. http://people.cs.kuleuven.be/~bettina.berendt/Talks/berendt_opinion_mining_summerschool_2014.pptx

Page 15: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

15

Other referencesCarenini, G., R. Ng, and E. Zwart. Extracting knowledge from evaluative text. In Proceedings of Third Intl. Conf. on Knowledge Capture (K-CAP-05), 2005.

Ding, X. and B. Liu. Resolving object and attribute coreference in opinion mining. In Proceedings of International Conference on Computational Linguistics (COLING-2010), 2010.

Reforgiato Recupero, D., Presutti, V., Consoli, S., Gangemi, A., & Nuzzolese, A.G. (2014). Sentilo: Frame-based Sentiment Analysis. Cognitive Computation, 7(2):211-225.

Gangemi, A., Presutti, V., & Reforgiato Recupero, D. (2014). Frame-Based Detection of Opinion Holders and Topics: A Model and a Tool. IEEE Comp. Int. Mag. 9(1): 20-30.

Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM '08). ACM, New York, NY, USA, 219-230.

R. Mihalcea, C. Banea, and J. Wiebe, “Learning multilingual subjective language via cross-lingual projections,” in Proceedings of the Association for Computational Linguistics (ACL), pp. 976–983, Prague, Czech Republic, June 2007.

Mihalcea, R. & Liu, H. (2006). A Corpus-based Approach to Finding Happiness In Proc. AAAI Spring Symposium CAAW. http://www.cse.unt.edu/~rada/papers/mihalcea.aaaiss06.pdf

Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (HLT '11), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 309-319.

Popescu, A. and O. Etzioni. Extracting product features and opinions from reviews. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-2005), 2005.

Qiu, G., B. Liu, J. Bu, and C. Chen. Expanding domain sentiment lexicon through double propagation. In Proceedings of International Joint Conference on Articial Intelligence (IJCAI-2009), 2009.

Qiu, G., B. Liu, J. Bu, and C. Chen. Opinion word expansion and target extraction through double propagation. Computational Linguistics, 2011.

E. Riloff and J. Wiebe, “Learning extraction patterns for subjective expressions,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2003.

Saif, H., Fernandez, M., He, Y. and Alani, H. (2013) Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold, Workshop: Emotion and Sentiment in Social and Expressive Media: approaches and perspectives from AI (ESSEM) at AI*IA Conference, Turin, Italy.

Saif, H., Fernandez, M., He, Y. and Alani, H. (2014) SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twitter, 11th Extended Semantic Web Conference, Crete, Greece.

Tan, C., Lee, L., Tang, J., Jiang, L., Zhou, M., & Li, P. (2011). User-level sentiment analysis incorporating social networks. In Proc. 17 th SIGKDD Conference (1397-1405). San Diego, CA: ACM Digital Library.

Thelwall, M. (2013). Heart and Soul: Sentiment Strength Detection in the Social Web with Sentistrength. In J. Holyst (Ed.), Cyberemotions (pp. 1–14).

http://sentistrength.wlv.ac.uk/documentation/SentiStrengthChapter.pdf

J. M. Wiebe, T. Wilson, R. Bruce, M. Bell, and M. Martin, “Learning subjective language,” Computational Linguistics, vol. 30, pp. 277–308, September 2004.

H. Yu and V. Hatzivassiloglou, “Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2003.

15

Page 16: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

16

Lecture 3

16

Page 17: Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you‘ve found!) Bettina Berendt Vienna Summer School 2015 (full

17

References

17

• Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired 16.07. Available at http://edge.org/3rd_culture/anderson08/anderson08_index.html

• Berendt, B. (2015). Big Capta, Bad Science? On two recent books on “Big Data” and its revolutionary potential. http://people.cs.kuleuven.be/~bettina.berendt/Reviews/BigData.pdf

• Berendt, B., Büchler, M., & Rockwell, G. (2015). Is it research or is it spying? Thinking-through ethics in Big Data AI and other knowledge sciences. Künstliche Intelligenz, 29(2), 223-232.

• boyd, d. & Crawford, K. (2012). Critical questions for Big Data. Information, Communication & Society, 15:5, 662-679, DOI: 10.1080/1369118X.2012.678878.

• De Wolf, R., Vanderhoven, E., Berendt, B., Pierson, J. & Schellens, T. (submitted). Self-reflection in privacy research on social network sites.

• Kitchin, R. (2014a). The Data Revolution. Big Data, Open Data, Data Infrastructures & Their Consequences. London: Sage.

• Kitchin, R. (2014b). Big Data, new epistemologies and paradigm shifts. Big Data & Society, April-June 2014,1-12.

• Kramer, A., Guillory, J., & Hancock, J. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences 111, 8788-8790. http://www.pnas.org/content/111/24/8788.full.pdf+html

• Moretti, F. (2005). Graphs, Maps, Trees. Abstract Models for Literary History. p.30 London: Verso (cited from the paperback published in 2007)

• Pauen, M. & Welzer, H. (2015). Autonomie: Eine Verteidigung [Autonomy: A Defence], Frankfurt am Main: S. Fischer Verlag

• Tufekci, Z. (2014). What Happens to #Ferguson Affects Ferguson: Net Neutrality, Algorithmic Filtering and Ferguson. https://medium.com/message/ferguson-is-also-a-net-neutrality-issue-6d2f3db51eb0