Keywords: text analysis, stylometry, topic modeling

Goals

This tutorial will introduce the basic components of natural language processing and give users the tools to apply technique to their own data. Our focus is on explaining the why behind each component of the natural language pipeline in addition to the how. We will also focus on how to work with non-English languages.

Description and Outline

The tutorial will be based on a similar one given as a workshop at the Digital Humanities 2016 conference, itself based on the presenters text Humanities Data in R Arnold and Tilton (2015). The tutorial utilizes the instructor’s package cleanNLP (Arnold 2016). By way of example from a collection of short stories, we will introduce elements of the natural language pipeline:

tokenization (Manning et al. 2014)
lemmatization
named entity recognition (Toutanova and Manning 2000)
part of speech tagging (Toutanova et al. 2003)
dependencies (De Marneffe et al. 2014)
coreference resolution (Lee et al. 2011)

In the final hour of the tutorial, these features will be applied to the following application areas:

stylometric analysis (Tweedie, Singh, and Holmes 1996)
document clustering (Steinbach et al. 2000)
topic detection

Our focus will be on a high-level, conceptual understanding of these techniques and the potential benefits of using them over models commonly employed for text analysis within both exploratory and predictive analysis.

Prerequisites

The tutorial will be accessible to those new to text analysis and unfamiliar with natural language processing. We expect that participants have a basic working knowledge with the base R functions.

Instructor biographies

The instructors have a considerable history of successfully blending humanities scholarship and computational methods. Their NEH and ACLS-funded project Photogrammar applied image, mapping, and textual analysis to the study of Depression-Era photography, and Participatory Media is producing a digital platform for the curation of community film from the 1960s through the present day. They wrote the book Humanities Data in R [Arnold and Tilton (2015)[, which explores four core analytical areas applicable to data analysis in the humanities: networks, text, geospatial data, and images. They have also written research articles addressing the power of working at the intersection of statistics and the humanities. Full biographics can be found at their respective websites: Taylor Arnold and Lauren Tilton.

References

Arnold, Taylor. 2016. “CleanNLP: A Tidy Data Model for Natural Language Processing.”

Arnold, Taylor, and Lauren Tilton. 2015. Humanities Data in R. Springer.

De Marneffe, Marie-Catherine, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D Manning. 2014. “Universal Stanford Dependencies: A Cross-Linguistic Typology.” In LREC, 14:4585–92.

Lee, Heeyoung, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011. “Stanford’s Multi-Pass Sieve Coreference Resolution System at the Conll-2011 Shared Task.” In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, 28–34. Association for Computational Linguistics.

Manning, Christopher D, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. “The Stanford Corenlp Natural Language Processing Toolkit.” In ACL (System Demonstrations), 55–60.

Steinbach, Michael, George Karypis, Vipin Kumar, and others. 2000. “A Comparison of Document Clustering Techniques.” In KDD Workshop on Text Mining, 400:525–26. 1. Boston.

Toutanova, Kristina, and Christopher D Manning. 2000. “Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger.” In Proceedings of the 2000 Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13, 63–70. Association for Computational Linguistics.

Toutanova, Kristina, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network.” In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, 173–80. Association for Computational Linguistics.

Tweedie, Fiona J, Sameer Singh, and David I Holmes. 1996. “Neural Network Applications in Stylometry: The Federalist Papers.” Computers and the Humanities 30 (1). Springer: 1–10.

Introduction to Natural Language Processing with R

Taylor Arnold¹ and Lauren Tilton²

1. Department of Mathematics and Computer Science, University of Richmond
2. Digital Humanities, University of Richmond

Goals

Description and Outline

Prerequisites

Instructor biographies

References

Introduction to Natural Language Processing with R

Taylor Arnold1 and Lauren Tilton2 1. Department of Mathematics and Computer Science, University of Richmond 2. Digital Humanities, University of Richmond

Goals

Description and Outline

Prerequisites

Instructor biographies

References

Taylor Arnold¹ and Lauren Tilton²

1. Department of Mathematics and Computer Science, University of Richmond
2. Digital Humanities, University of Richmond