Beyond the e-Book: The New World of Electronic Reading
Computational reading puts us in touch with an exploratory way of engaging with language, with how we use words and how we arrive at their meanings. It is a deeply human enterprise.
The e-book is in the way. I don’t mean that we should go back to reading print (though it wouldn’t hurt). I mean that our love of books has led us to create digital objects that severely limit what we can do with them. In making electronic books look and act like books—in being guided by a notion of simulation rather than reinvention—we have constrained ourselves from taking advantage of the potential of electronic reading. Long live the book. But the e-book’s days have come and gone.
Most debates about the future of reading have turned around the question of whether or not to go electronic. Books good, Internet bad. Internet free, books bulky. But that debate is over. We have gone electronic, whether we like it or not. What we have not done is take advantage of this shift. We’ve moved backward, not forward in terms of reading.
Books are like containers, full of ideas and memorabilia. E-books are like little gated communities.
Consider all the ways that e-books fall short of printed books. They are harder to navigate quickly, and they come with distracting bells and whistles when we want to pay attention. They are visually and tactilely impoverished compared to the history of illustrated and ornate books. They are harder to share than printed books and harder to hold onto—preserving e-books for future generations, well good luck with that. And for all the ways publishers have tried to build annotation tools into e-books, they’ve got nothing on the spatial memory of handwritten marginalia, dog-ears, stickies, and anything else you might want to stick in a book. Books are like containers, full of ideas and memorabilia. E-books are like little gated communities.
Skimming, holding, sharing, annotating, and focusing—these are just some of the many ways that e-books diminish our interactions with books. And yet they remain the default way we have thought about reading in an electronic environment. E-pub or Kindle, it doesn’t really matter. We have fallen for formats that look like books without asking what we can actually do with them. Imagine if we insisted that computers had to keep looking like calculators.
Escaping this rut will require not only a better understanding of history—all the ways reading has functioned in the past that have yet to be adequately re-created in an electronic world—but also a richer imagination of what lies beyond the book, the new textual structures (or infrastructures) that will facilitate our electronic reading other than the bound, contained, and pictorial objects that we have so far made available. Instead of preserving the sanctity of the book, whether in electronic or printed form, we need to think beyond the page and into that all too often derided thing called the data set: curated collections of literary data sold by publishers and made freely available by libraries. This is the future of electronic reading.
To support such a shift, we need to do a better job of bringing into relief the nonbookish things we can do with words and how this will add value to our lives as readers. We need a clearer sense of what reading computationally means beyond the host of names used to describe it today (text mining, distant reading, social network analysis). Thinking about reading in terms of data and computation isn’t about traversing the well-trodden field of open-access debates. It’s about rethinking what we mean by “access.”
In 2009 the University of Toronto English professor Ian Lancashire published a study on Agatha Christie’s late works. Using computational methods, he was able to find that her writing indicated symptoms comparable to what other researchers had found in Alzheimer’s patients, though Christie had never been diagnosed with the disease. The richness of her vocabulary declined, the number of repeated phrases increased, and, most tellingly, the use of indefinite words like “thing,” “something,” and “anything” rose dramatically. It was clear from her writing that some sort of severe cognitive decline had set in before the end of her life.
Lancashire’s research, which he continues to extend to ever more writers, is important not just for the literary discoveries it has to offer. It can also be used to understand ourselves. It is part of what Rufus Pollock calls the revolution in small data, where we use these kinds of tools to better understand ourselves. We are all writers today. Whether it is in the form of email, Facebook, Twitter, texting, or more formal writing like word documents, we generate a great deal of writing over the course of our lives. It is easy to imagine a potential screening device that tracks the same measures Lancashire used to examine Christie to study our own language habits in order to anticipate warning signs of mental decline. By the time most cognitive diseases are diagnosed, they are often well advanced and thus more difficult to treat. In Lancashire’s work, his identification of the onset of a writer’s decline always predates the official diagnosis. An app that monitors our written speech could give us indications of our mental health. It feels creepy, but also potentially useful at the same time.
Indeed, such apps already exist. There is a depression tracker called Ginger.io that takes into account not the content of what you say or write but how often you text or email and the length of those missives on average. It also tracks the range of your mobility using GPS. The onset of depression is often marked by increasing social isolation, so communicating and moving around less are very strong indicators that something might be wrong. If you opt in, these devices can communicate with your doctor, providing more diagnostic material and perhaps even prompting you to come in for an appointment. It, too, feels creepy and potentially useful.
Other tools are focusing on populations rather than individuals. Google Flu Trends, which monitors social media for disease-related words, has proven remarkably successful in monitoring the spread of outbreaks, something that the CDC has traditionally done using biological sampling, creating a significant time lag between diffusion and diagnosis. Researchers at the University of Vermont, on the other hand, aren’t interested in tracking negatives like dementia, depression, or E. coli. Instead, they have created a tool to monitor online happiness—a hedonometer—that tries to gauge the emotional content of social media. Are there certain geographical regions (countries, states, cities) that are marked by differing degrees of happiness or different topics that lend themselves to greater or lesser degrees of satisfaction? I recently tried this tool tracking keywords like “books,” “reading,” and “e-books” as they appeared on Twitter. It turns out that when we Tweet about these topics, our language is considerably happier than the average Tweet. Books, in any form, make us feel good. (Anyone who has listened in on debates about books versus e-books knows that they can also make us very angry, too.)
We also need to understand in more sophisticated ways how language works and what it is telling us. We need to be better listeners to our linguistic ecologies.
These are just a few of the ways that the study of our writing habits can inform us about our mental as well as physical well-being. And they each involve the complicated trade-offs between sharing personal data and our waning sense of privacy. It’s not all NSA spying, but it is also not all that different, either. The question we have to ask ourselves in each case is not simply whether the exchange is worth it (security versus police surveillance, accurate diagnosis versus the anxiety caused by the constant monitoring of our well-being). We also need to understand in more sophisticated ways how language works and what it is telling us. We need to be better listeners to our linguistic ecologies. The ethics of textual interpretation have never been more pronounced.
This is all well and good—and useful you might say—but what about literature? Surely such rudimentary tools have nothing to tell us about the lexical and syntactic light and magic shows we call novels and poetry. Just such an argument was recently put forth in the conservative Los Angeles Review of Books, by Stephen Marche. Marche, who spends the first half of his article insulting professors and then the second half imitating one (from about ca. 1950), invokes the “mystery of language” and the loss of “taste” and “refinement” that comes with treating literature as data. Never has such a spirited defense been given of literary criticism’s efficacy in producing middle-class aristocrats.
Such knowledge embargoes are most often driven by fear. They are invoked in the name of others (but what about the kids!), but they are really about preserving the speaker’s own authority. Only I have access to this higher meaning and, well, you wouldn’t understand anyway.
Moving from data to “literature” is not unlike the problem of how mind emerges as a product of the brain. Arriving at complex notions like meaning, plot, or character from the raw material of words is remarkable and complicated, no more so than how consciousness and feeling happen from a bunch of neurons.
But literature is data. Unlike the fluidity of sports or economics, translating literary texts into quantifiable units is surprisingly straightforward. Literature is comprised of discrete units called words, which are themselves comprised of other discrete units called morphemes, phonemes, and letters, all of which appear according to differing degrees of regularity marked off by discrete units called punctuation. As linguists like Harald Baayen and others have shown, our acquisition and interpretation of language is incredibly probabilistic. The distributions and likelihood of words and word forms is how we construct meaning.
Moving from data to “literature” is not unlike the problem of how mind emerges as a product of the brain. Arriving at complex notions like meaning, plot, or character from the raw material of words is remarkable and complicated, no more so than how consciousness and feeling happen from a bunch of neurons. (Indeed these two worlds are deeply intertwined, as meaning is ultimately a matter of how the linguistic networks of texts interact with the neurological networks of the brain.) But to say they have nothing to do with each other is just theology by a different name. Computation gives us new ways of thinking about these relationships just as it gives us new objects to think with. It puts us in touch with the cognitive steps through which we arrive at the meanings of texts and the linguistic techniques through which we weave our stories. As seen through the lens of computation, reading can be thought of as a refrain: the way language repeats itself in successive variations forms the foundation of meaning.
One of the most robust fields in computational literary study to date has been research on authorship attribution—establishing the authorial identity of anonymous or falsely attributed texts. Beyond its usefulness as a tool (did Shakespeare write all his plays?), it brings with it a striking insight about language. The words that are most indicative of a person—the language that most marks us as ourselves—are the family of insignificant words like articles, conjunctions, and prepositions (the, and, but, by, of, etc.). It turns out we are most unique in the way we use highly non-unique words. I find this a beautiful commentary on our humanity.
As many commentators have pointed out, reading computationally changes the scale of our reading. In so doing, it can provide access to larger cultural trends. Ted Underwood at the University of Illinois, for example, using a sample collection of over four thousand works of writing from the eighteenth and nineteenth centuries, has shown how novels and poetry begin to distinguish themselves from other kinds of writing at the turn of the century through a much higher recourse to older English words, words that would be designated as less learned and more “everyday” (that is, more Germanic and less Latinate). We have long talked about the distinction of literature as a unique social category that emerges in the nineteenth century. Underwood not only shows this distinction in action, he shows how such distinctiveness was a product of a socially lower kind of language. Literature was special because it was more universal.
In my own work, I’ve been studying the history of the modern novel and its relationship to narratives of conversion. I’m interested in the idea of devotional rather than critical reading, the way we become deeply attached to the books we read (and whether this, too, changes in a computational environment). Looking at a corpus of several hundred works over the course of two centuries, I found that the novel, rather than a genre like autobiography, correlates most strongly with the language and narrative structure of conversion inherited from a classical archetype like Saint Augustine. To my surprise, the novel is the heir to a model of reading where we experience and presumably incorporate a profound sense of personal change.
Computational reading doesn’t have to take place at a broad macroanalytical level (a term coined by Matthew Jockers at the University of Nebraska, one of the leading researchers in the field). It also has much to tell us about the granularity of literary language. In a project in which I am studying the nature of poets’ corpuses—whether the corpus, in distinction from the individual work, exhibits certain features that change over time—I was putting together a table of words to represent the corpus of Friedrich Hölderlin, one of the great German Romantic poets. I found that when I tried to retain only those words that were common to the whole corpus—if I asked the computer to keep only those words that appeared in 60 percent of the poems, a common step in text analysis—I found zero words. When I asked the computer to keep only those words that appeared in one-third of the poems, I found just two: “life” and “heart” (Leben and Herz). In other words, across a collection of 253 poems, there were only two words out of more than eleven thousand that appeared at least one-third of the time. Those words happened to be two of the most elementary words in the German language.
I often go to read poetry to experience language in its singularity. Poems mark out clearings in our cluttered linguistic lives.
There is a basic point here that poems are often short and so the odds of words overlapping between them are less likely, especially when not dealing with high probability words like conjunctions, articles, or prepositions. But I was still taken aback by just how sparse Hölderlin’s poetic vocabulary was. No word of significance appears in over half of the poems, and only two appear in one-third of the corpus. You have to go down to 10 percent to get a list of words that exceeds one hundred. There are just 116 words in Hölderlin’s vocabulary that appear in 10 percent of his poetic corpus, or, in other words, about twenty-five poems.
It made me understand in a very visceral way something I had always felt but never been able to articulate: that I often go to read poetry to experience language in its singularity. I know this is a very modern way of thinking about poetry, but it is one that I seek out when I pick up a poem to read. The repetitions within the poem stand contrapuntally to the repetitions of language we experience in our everyday life. Poems mark out clearings in our cluttered linguistic lives.
This survey is just the tip of the iceberg of the numerous approaches to computational reading that exist today. For those who are profoundly skeptical of this practice, my hope is that these examples will indicate both the diversity as well as the creativity behind the process. It puts us in touch with an exploratory way of engaging with language, with how we use words and how we arrive at their meanings. From my perspective, it is a deeply human enterprise.
So what does this mean for readers? How will this kind of reading—for it is surely still reading—change our relationship to written texts? First, it will encourage readers to develop a more architectural relationship to reading. Books, at least since the nineteenth century, have come to us as ready-made objects (“Temples of the Mind,” in Thomas Carlyle’s words). Data sets are extremely amorphous. They have to be assembled, a process which requires a host of imaginative choices. They feel more like grains of sand in your hands than the sturdy walls of a church. But reading in this way will also make us more aware of the importance of context to reading. Books are amazing at closing themselves, and us, off from the rest of the world. As I’ve written elsewhere, books are the ultimate difference engines. Reading computationally puts us in touch with the linguistic environments in which writing originates and circulates. Indeed, it makes this boundary between text and context more fluid. We become entangled with a larger linguistic field, mindful of the circular ways that a text and its environment mutually shape one another.
If reading computationally will change the scale of our reading, allowing us to think more in terms of textual worlds, it will also make us more conscious of the small-scale, the individual units of reading. However paradoxical it may sound, reading computationally makes us more conscious of words. It forces us to consider how those singular things, and their many repetitions, coalesce into more grandiose notions like character, plot, form, and ideas. Learning to move between quantity and quality, between letter and number, will ultimately bring together two ways of thinking that have been diverging for far too long. In my most utopian moments, I imagine a time when literacy and numeracy will be indistinguishable from one another.
I finally realized how much we have shortchanged ourselves when it comes to electronic reading one day not too long ago when an announcement arrived in my in-box. It wasn’t another press release touting the e-book’s so-called “rise.” Such a meteoric ascent could only inevitably be followed by a subsequent fall. As Schiller said of Albrecht von Wallenstein, one of the great generals of the Thirty Years War: “No one could stand where he fell.” The announcement concerned instead the creation of the German Digital Library, one of a host of new national endeavors to create digital libraries as counterparts to the paper-based kind, including the recently established Digital Public Library of America.
These are highly significant ventures. They are revolutionizing our textual heritage in no uncertain terms. More people will have access to more texts, fulfilling the library’s historic mission as an information conduit. But “access” is still understood too narrowly. The German Digital Library is in fact a library of PDFs—really other libraries’ PDFs (it aggregates German libraries’ book scans). The German Digital Library interface presents you with an image of a book that is presented by the Bavarian State Library’s image of a book, underneath of which is presumably a book. We have layers of visual representations of books piled on top of one another—image libraries, but not text libraries. Texts aren’t static pictures. They are things that you can do things with, like Augustine who played Virgilian lots with the letters of Paul as part of his conversion to Christianity or the medieval intellectual, Ramon Llull, who devised a spinning wheel to answer all the truths about the universe drawn from scripture.
There are more and more books in the world every day, and yet we can do increasingly less with them. We need to reverse this trend. Google Books made us think in terms of everything: digitizing every book ever, even though hundreds of thousands of books don’t exist anymore. This is too imprecise to tell us anything except in the most general terms, a point that also holds true for the error-laden texts that exist beneath the image scans. What we need now are well-curated selections of texts according to both broad and creatively imagined categories. The contemporary literature collection, the history of memoir collection, the collection of books about left-handed detectives. Like books, these collections will cost a not-inconsiderable sum of money to create, preserve, and distribute. But unlike e-books, they will allow for a great deal of personalization. Data is the new marginalia. There is a tremendous commercial, social, and creative value lying around in all that data.
Yes, data. The raw material, the clay, the molecules and atoms of texts. Words aren’t everything. But they are too important to bury in a pretty picture.