Adventures in OCR and PDFs - January 21, 2024

Adventures in OCR and PDFs – January 21, 2024

Decided to read Frederick Copleston's History of Philosophy. It is harder than usual to obtain an ebook version of this text. There is a scanned version on the Internet Archive, but it has couple of volumes missing

I managed to find a scan of the first volume but as expected, it consists of images. The image format is jbig2 which is common for scanners. Tried to use OCR and extract text from the PDF using Tesseract. Tesseract doesn't support jbig2 format, so installed ocrmypdf. Thanks to Chocolatey, this was super simple.

After about 15ish minutes (my sense of time is wack so it could have been much more, or less), I had a PDF that I could highlight and index! I also got a text file with the extracted text. That's a win!