enowning

Wednesday, August 24, 2016

The Open Commons of Phenomenology has a B&T PDF, and other texts.

I've found with my Gestausgabe App that the problem is not getting the texts, but getting high quality texts. There are a lot of PDFs circulating with OCRs of books. While the scans of each page in the PDFs are legible, the actual text content (obtained by OCRing the scans) is very low quality. Most munge umlauts and Greek, even if they do a decent job of plain text. In order to have a quality digital archive -- searches will find what you're looking for -- the texts needs to be cleaned up. And that's the medieval aspect of the enterprise. Using just the GA PDFs circulating today (over seventy volumes by my count), it would take a dozen monks or grad students years to clean up all the texts. With my GA App, I'm principally using texts from ebooks from the publishers. They usually have the original text, including authors' errata, although in some cases, they still use odd encodings for the Greek, which require coding a module to translate to Unicode.

¶ 7:41 PM

Comments:

So does Klostermann offer ebooks of the GA?

# posted by

Richard P : 7:59 PM

Klostermann does not. I have an ebook/PDF of SuZ Niemeyer. I wrote a program to extract the pages from the PDF and write each page as an HTML file. I then added the files to the GA App and indexed them. So searching in the GA App searches the whole of SuZ. But I only have OCR scans of the GA volumes, so I have only added random GA pages that I've cleaned up manually.

# posted by

enowning : 8:18 PM

About Me