Thesaurus Linguae Graecae

A few petabytes, congresscritters and lawyers to go.

By Scott Bradner

I started playing with digitized literature almost 25 years ago. A lot has changed in the digital books biz since then. Some of the history, current status, future possibilities and clashing business models in this area were recently explored in a cover "manifesto" in the New York Times Magazine by Wired writer Kevin Kelly. Spoiler: it will all come out fine in the end, but the length of time you will have to wait depends on when Congress stops moving the copyright goal posts.

In the summer of 1982 a Classics graduate student working in the computer lab I ran in the Harvard Psychology Department got a copy of the Thesaurus Linguae Graecae, a large batch of classical Greek literature that had been typed into computers someplace outside of the US with David Packard paying the bill. I, along with people in the Harvard Classics and English departments, convinced the university administration to pay for a huge, for the time, 300 MB disk dive to store this text as well as a collection of Middle English literature. Over the next few years the graduate student, Greg Crane, (http://www.perseus.tufts.edu/About/grc.html) now a professor at Tufts University, put together the first version of what became the Perseus Project (http://www.perseus.tufts.edu/PerseusInfo.html) a web-like mixture of text and clickable links to other material (but done many years before the web and search engines showed up).

This very well indexed, on-line, text changed what sort of things that would be reasonable PHD thesis topics. Before Greg's work a student could get a thesis based on years of index-card based investigations on how specific words where used in classical Greek, after Greg that became a weekend task.

Kelly's Times Magazine article (http://www.nytimes.com/2006/05/14/magazine/14publishing.html) explores what happens in a future where you might have petabytes of digital material being attacked by cutting edge search engines. Kelly estimates that a 50 petabyte disk farm could hold all the 32 million books, 750 million articles and essays, 25 million songs, 500 million images, 500K movies, TV shows and short films and 100 billion public web pages. Quite a bit of the material is already digitized; new books, DVD movies and CD music for example. The article describes multiple projects underway to try to catch up with digitizing older books and discusses the legal and access issues that congress ever extending the copyright period is causing.

A few years ago in the column I quoted a student who told me "if it is not on the web then it does not exist." ("How big is the world? http://www.networkworld.com/archive/1999b/1018bradner.html) The same point was reinforced last week when I suggested that a graduate student see if he could find some information on a particular topic in the library that was one floor down from my office and he admitted to only being in the library once or twice and not to look anything up.

Kelly paints a picture where physical libraries might not be needed, other than for books published by companies whose lawyers are not ready to embrace a searchable digital world. In Kelly's future world books are no longer individual items but instead are parts of a vast relational database on steroids where your biggest problem will be figuring out how to ask the question you want the answer to. And to figure out what is left that could be a good thesis topic. All in all, a very good read.

disclaimer: If physical libraries fade away Harvard is going to wind up with a lot of prime real estate that will be bitterly fought over but I did not ask the view of the university library folk about the NYT article so the above is my own review.