Bibliography

Biblio Citation Abstract
Weingart, S. (2011).  Demystifying Networks. the scottbot irregular.

In this scholarly blog post, Scott Weingart discusses the basic elements of network visualizations and how they relate to humanities research. Weingart carefully and clearly defines the main components of networks - nodes and edges - as well as the main types of networks - single modal, bimodal, and multimodal. Weingart employs concrete examples and simple visualizations to build a network from the ground up. As the first part in a continuing (now nine part) series, this article allows Weingart to establish the basic concepts and introduce important questions to be tackled later in the series.

Stubbs, M. (1996).  Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture.

In this scholarly monograph, Stubbs asks "how can an analysis of the pattern of words and grammar in a text contribute to an understanding of the meaning of a text?" In order to answer this query, Stubbs draws on British linguistics and computer-assisted methods of text and corpus analysis. Stubbs examines the different agents of textual interpretation (text, producer, recipient) and interrogates who is vein the social authority to interpret the meaning of texts. Through this study in computer-assisted text semantics, Stubbs offers his publication as both a pedagogical tool and a site of critical discussion.

Mimno, D. (2012).  The Details: Training and Validating Big Models on Big Data. Journal of Digital Humanities. 2,

In this video, David Mimno explains how topic modelling functions, how to remedy common topic modelling errors, and discusses the MALLET application. The presentation was recorded on 3 November 2012 as part of a NEH sponsored workshop at the University of Maryland.

Graham, S., & Milligan I. (2012).  Review of MALLET, produced by Andrew Kachites McCallum. Journal of Digital Humanities. 2,

MALLET, or MAchine Learning for LanguagE Toolkit, is an application that use topic modelling. Graham and Milligan argue that MALLET and topic modelling in general "have wide appeal across the digital humanities." Topic modelling can handle large collections of textual data but is still able to produce a close analysis of the material. Graham and Milligan assert, however, that MALLET data outputs can be difficult to understand and that organizing results in a network may be more conducive to humanities interpretations. Graham and Milligan note that the biggest issues with MALLET are its steep learning curve and its lack of inherent documentation - both of which raise methodological concerns. In order to remedy these concerns, Graham and Milligan cite a MALLET-user community that provides support for novice users.

Curzan, A. (2012).  The electronic life of texts: insights from corpus linguistics for all fields of English. (Mair, C., Meyer C., & Oostdijk N., Ed.).Language and Computers. 9–21.

More English literary and nonliterary texts “go electronic” and often online every day, from literary projects like EEBO (Early English Books Online) to linguistics projects like ARCHER (A Representative Corpus of Historical English Registers), from lexicographic projects like the Oxford English Dictionary Online to projects so ambitious they are almost uncategorizable, like Google's digitization of entire university libraries. How should researchers and teachers of English best exploit these new electronic riches? Scholars in English corpus linguistics have been pushing the boundaries and addressing the challenges of working with collections of electronic texts for decades, in ways that can usefully inform all sub-disciplines of English literature and language study. This chapter focuses on the new research opportunities and lines of questioning that electronic text collections open in a variety of fields, on the wisdom gained in corpus linguistics on best practices for working with electronic texts, and on muchneeded conversations between scholars in all sub-disciplines of English for how best to build electronic text collections so they can answer the questions we want to ask.

Ramsay, S. (2008).  Algorithmic Criticism. (Schreibman, S., & Siemens R., Ed.).Companion to Digital Literary Studies (Blackwell Companions to Literature and Culture).

Ramsay argues, that while text analysis has a history, algorithmic criticism "exists only in nascent form." While browsing, searching, and disseminating digital texts is an accepted practice, transforming texts using algorithms is still relatively unexplored territory. Ramsay discusses how the humanities both fit and rebel against the scientific "frames" of algorithmic criticism. Ramsay argues that in order to "reap the benefits of speed, automation, and scale" available through computational practices, we must "accept the compromises inherent in such transformations." Ramsay continues his essay by performing an algorithmic case study on Virginia Woolf's novel "The Waves" in order to determine the frequent words and themes attached to each of the six main characters. Ramsay asserts that both the action (hermeneutics) and the method are key components to consider in this case. In conclusion Ramsey argues that "[A]lgorithmic criticism seeks a new kind of audience for text analysis — one that is less concerned with fitness of method and the determination of interpretive boundaries, and one more concerned with evaluating the robustness of the discussion that a particular procedure annunciates."

Rhody, L. M. (2012).  Topic Modeling and Figurative Language. Journal of Digital Humanities. 2,

Rhody addresses the tension and complications of computing figurative language with topic modelling tools. Topic modelling, Rhody argues, fails when handling figurative language because topic modelling is unable to compute and preserve language's many possible meanings. However, Rhody asserts that these apparent "failures" are part of the reason topic modelling works on texts defined by their figurative language: "[S]omewhere between the literary possibility held in a corpus of thousands of English-language poems and the computational rigor of Latent Dirichlet Allocation (LDA), there is an interpretive space." Rhody recounts her experience using topic modelling to explore ekphrasis poetry. Rhody began her research by comparing topics generated by non-figurative texts with the topics generated from a collection of poetry. The result was that the thematic clarity apparent in non-figurative topics did not translate when analyzing a collection of poetry. Rhody asserts that "[T]opic models of poetry do have a form of comprehensibility, but our understanding of coherence between topic keywords needs to be slightly different in models of poetry than in models of non-fiction texts."

Schmidt, B. M. (2012).  Words Alone: Dismantling Topic Models in the Humanities. Journal of Digital Humanities. 2,

Schmidt argues that simplifying topic models of humanities data "creates an enormous potential for groundless — or even misleading — 'insights.'" The potential for erroneous insights comes from humanities scholars believing assumptions that are only partially true about topic models: that they are coherent and that they are stable. When these assumptions hold true there is great opportunity to understand the words in a massive corpora. However, Schmidt argues that topics must be interrogated because their results are not often as coherent as they seem. Schmidt suggest using graphics to represent topic model outputs because visualizations can simplify the data; he mentions Elijah Meeks' work on organizing topic models as word clouds. In conclusion, "Even when humanists understand the mechanics of LDA perfectly", Schmidt suggests that "they will not be able to engage with their fellow scholars about them effectively. That is a high price to pay." Therefore, it is important to consider research using data, done simply, in order to broaden the potential audience.

Rockwell, G., Sinclair S. G., Ruecker S., & Organisciak P. (2010).  Ubiquitous Text Analysis. The Journal of the Initiative for Digital Humanities, Media, and Culture. 2,

Sinclair, Rockwell, Ruecker, and Organisciak open this article by identifying the three main problems when it comes to the use of text analysis tools: many tools don’t work well when used together, many tools don’t properly integrate with digital content, and some of the best tools are the hardest to find. The authors of this article argue that for text analysis tools to be successful the text must be privileged. Throughout the article, various tools and their ongoing development are discussed – TAPoR, DigitalTexts, TAToo, and others. Sinclair, Rockwell, Ruecker, and Organisciak assert that capitalizing on the social digital age is one important aspect of text analysis. In conclusion, the authors all agree that integration – between tools, between tools and texts, and between tools and users – is fundamental and key.

Newman, D. J., & Block S. (2006).  Probabilistic topic decomposition of an eighteenth-century American newspaper. Journal of the American Society for Information Science and Technology. 57, 753–767.

This article begins by identifying that, due to the exponential growth in web documents, there is a growing need for a system for characterizing, classifying, and indexing this data. In this paper, Newman and Block explore three types of topic decomposition models that are designed to achieve this organized information retrieval. Throughout the entirety of the article, Newman and Block use the Pennsylvania Gazette as their case study and through this data they illustrate the use of three different methodologies and compare their success in identifying and displaying topics from the newspaper. Newman and Block conclude that this style of textual analysis is important because it identifies hidden topics in the text rather than merely collating keywords.

Stewart, G., Crane G., & Babeu A. (2007).  A New Generation of Textual Corpora: Mining Corpora from Very Large Collections. Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. 356–365.

This article considers OCR programs developed for reading classical Greek. The authors of this paper work to show how computational correction practices can create text documents with accuracy ratings comparable to "hand-crafted corpora." Three challenges of Greek OCR documents are identified: exclusion of variant readings, exclusion of multiple editions, and inability to draw connections between texts that reference each other. The authors point to work done, individually, in all of these areas by citing different digital archives and versioning softwares. However, these three challenges are not addressed by a single project. In order to grapple with this, the authors structure a multi-tiered approach to OCRing Greek texts. The authors discovered that simple error correction techniques based on word lists and morphological analyzer improve results, increase accuracy by including multiple editions. In closing, a greater is made towards trajectories of future work, including image quality, comparison errors, and recognizing accents.

Olmos, R., León J. A., Jorge-Botana G., & Escudero I. (2012).  Using latent semantic analysis to grade brief summaries: A study exploring texts at different academic levels. Literary and Linguistic Computing. fqs065.

This article examines the use of LSA (latent semantic analysis) to evaluate student summaries. The authors argue that summary is a key component of student learning, and that summary writing and LSA share certain similarities that may make them compatible for evaluation. In the study outlined in this article, 786 students were required to write 50 word summaries of either a narrative or expository piece of writing. Then human evaluators and the LSA equation were used to mark the effectiveness of the summary. The results showed that LSA held a 0.68 reliability for the narrative summaries and a 0.82 reliability for the expository summaries. Given these results, the authors concluded that it was too early to determine whether LSA was effective in evaluating student summaries.

Vashishtha, H.., Smit M.., & Stroulia E.. (2010).  Moving Text Analysis Tools to the Cloud. 2010 6th World Congress on Services (SERVICES-1). 107–114.

This article investigates the processes, challenges, and rewards of migrating digital humanities tools to alternative platforms. The authors argue that the challenge facing many digital humanities tools is that they are severely limited by the cost, time, scale, and parameters of their current platform. In order to increase a tools ability to respond accurately and effectively to large data set, a change in platform must be carried out. This article uses TAPoR and TAPoRware as a case-study of this type of transition from web service to cloud computing. Some of the challenges identified are: achieving greater functionality, deciding whether to wrap or re-implement services, and the various technical barriers. The authors encourage future work in the areas of analysis context, file format surpass, index scaling, and the development of new and diverse tools.

Shi, L.., Wei F., Liu S., Tan L., Lian X., & Zhou M.X.. (2010).  Understanding text corpora with multiple facets. 2010 IEEE Symposium on Visual Analytics Science and Technology (VAST). 99–106.

This article responded to the rising "interest in analyzing complex text documents" that are composed of multiple and varied data fields. The authors argue that there is an overwhelming need for a visual analytics tool to support this type of analysis. While existing tools focus on revealing patterns and summarizing content, this new tool is the first to integrate "interactive visualization with a multi-faceted data model of the text corpora for effective visual representation, navigation, and analytics." The authors discuss both Latent Semantic Indexing and Latent Dirichlet Allocation as well as the various visualization strategies employed to make complex data readable in a graphic representation. To illustrate the effectiveness of the tool, two case studies are provided.

Rhody, L. M. (2012).  Topic Model Data for Topic Modeling and Figurative Language. Journal of Digital Humanities. 2,

This contains the data generated from the research project Lisa Rhody discusses in her article "Topic Modeling and Figurative Language" featured in the same journal issue.

Graham, S., Weingart S., & Milligan I. (2012).  Getting Started with Topic Modeling and MALLET. The Programming Historian.

This web publication is devised as a how-to guide for working with MALLET. The goals of the lesson are: to "learn what topic modeling is and why you might want to employ it in your research", to "learn how to install and work with the MALLET", and finally to give "a good idea of how it can be used on a corpus of texts to identify topics found in the documents without reading them individually." Graham, Weingart, and Milligan begin by defining topic modelling as a tool that uncovers word patterns in a corpus of texts. Graham, Weingart, and Milligan make clear that topic modelling programs do not interpret the meaning of vocabulary in a text but are, rather, equipped to "to mathematically decompose a text." This online tutorial leads users through a step-by-step guide for installing and using MALLET.