Skitalinskaya, GabriellaGabriellaSkitalinskayaDüpont, NilsNilsDüpont2022-05-192022-05-192021-02https://media.suub.uni-bremen.de/handle/elib/591210.26092/elib/1517Many social science researchers face the challenge of dealing with textual data that is only available on actual paper or ill-scanned PDF files, and require knowledge of image processing techniques and optical character recognition (OCR) software to obtain satisfactory results to enable further automated text post-processing. Based on sample scans of researches at the Collaborative Research Center “Global Dynamics of Social Policy” (SFB 1342), we compare the results of several open-source and commercial tools available for OCR. We evaluate each tool’s performance across three tasks, namely extracting plain text, recognizing the text style and its structure (hOCR), and extracting tables focusing not only the ability to accurately retrieve data from each cell but also the ability to properly capture the table layout. In this report, we summarize our findings and give recommendations for consideration when planning OCR projects.enCC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives)https://creativecommons.org/licenses/by-nc-nd/4.0/optical character recognitioncomputational social sciencessoftware tools300OCR ReportBericht, Reporturn:nbn:de:gbv:46-elib59127