NovelTM co-investigator Ted Underwood has announced the release of a first version of the automatic mapping of genre to select pages for fiction from the HathiTrust Digital Library. Over the past two years, Ted has developed a method for accessing digital collections based on the classification of genre–”a way of starting with a whole library, and dividing it by genre at the page level, using machine learning.” He explains that although there are millions of books available in digital libraries, it can be difficult to locate works of fiction or poetry based on the search parameters of “genre” as many older volumes do not necessarily have that information attached. To further complicate matters, “volume-level information wouldn’t be enough to guide machine reading in any case, because genres are mixed up inside volumes.” As Underwood explains, “One of the main obstacles to studying the full sweep of literary history is, quite simply, that it’s hard to find the texts. In a collection that may contain millions of volumes, how do we identify the volumes, and pages, that are actually prose fiction?” Ted Underwood offers an initial solution, and a collection of 102,349 volumes of prose fiction drawn from HathiTrust Digital Library, 1700-1922.
Read the entire post here: