DHQ: Changing the Center of Gravity: Transforming Classical Studies Through Cyberinfrastructure
Winter 2009: v3 n1
Of particular interest, I think, for how transformational the mass digitisation of text will be is
Classics in the Million Book Library By Gregory Crane and others.
Some excerpts [and look at the reference system!]
2. On October 27, 2008, the number of books available from the
Internet Archive exceeded 1,000,000. While the million books is only a
fraction of the size of the seven million that Google boast, the
million books available from the Internet Archive are freely
downloadable – anyone can analyze them and publish the results. The
collection available from the Internet Archive provides the foundation
for transparent services and, even more important, transparent
7. This paper begins by stressing that we have moved beyond islands of
digital content in a vast sea of print. Where our first generation
collections were autonomous, carefully curated, discipline-specific
islands, we now see emerging a world where we dynamically generate
collections of heterogeneous materials from vast and constantly
expanding digital libraries over which no individual discipline or
project exercises control.
8. Our discussion then moves to the services that humanists need to
exploit very large collections. These include not only advanced
services for information extraction, multilingual technologies, and
visualization but simple access to the scanned page images with which
to support domain-optimized document analysis. These services require
the rise of a new, fourth generation of digital corpora. Our first
digital corpora included accurate transcriptions with markup of
surface features (e.g., we simply indicate that a word is in italics).
A second generation began to add semantic markup (e.g., a phrase is in
italics because it is the title of a work or a Latin quotation). The
third generation created much larger collections by shifting the focus
of manual labor from carefully edited typing to industrial scanning of
page images. We need fourth-generation collections that can seamlessly
integrate image-books, accurate transcriptions, and machine actionable
knowledge in various formats.
9 These fourth generation collections are a qualitatively new
phenomenon. They allow us to design collections that are not only more
comprehensive but more diverse than we could ever produce in print
culture. These collections are unbounded and can include not only
texts but every category of data about their subjects – high
resolution images, three-dimensional models, geographic data sets, and
anything that we can represent in digital form. Even if we restrict
ourselves to linguistic data, fourth generation collections are a
qualitative advance over print: we can include not only images of
neatly printed modern books but non-print representations of language
such as three dimensional models of words engraved on stone and
digital sound recordings.
13. We might summarize the current situation as follows: Google has begun
creating on-line a digital collection that would be more comprehensive
than the greatest university libraries ever produced – and the
university libraries themselves control the resources needed to do the
job were Google to falter: our retrospective collections are being
digitized. The OCA has created a public, scalable infrastructure
whereby we can, in fact, build high quality collections within the
existing library infrastructure: if massive projects miss anything,
smaller efforts can fill in the gaps and create curated collections.
The US government, under a conservative, pro-business administration,
has made the most profitable monopolies on which publishers had
depended illegal and declared open access a condition of its most
generous funding agency: the richest publishers must learn to make
money under open access.
It's all very interesting. It means any new Ph.Ds will easily be able
to come up with entirely new approaches.