OCR Report
Veröffentlichungsdatum
2021-02
Autoren
Zusammenfassung
Many social science researchers face the challenge of dealing with textual data that is only available on actual paper or ill-scanned PDF files, and require knowledge of image processing techniques and optical character recognition (OCR) software to obtain satisfactory results to enable further automated text post-processing. Based on sample scans of researches at the Collaborative Research Center “Global Dynamics of Social Policy” (SFB 1342), we compare the results of several open-source and commercial tools available for OCR. We evaluate each tool’s performance across three tasks, namely extracting plain text, recognizing the text style and its structure (hOCR), and extracting tables focusing not only the ability to accurately retrieve data from each cell but also the ability to properly capture the table layout. In this report, we summarize our findings and give recommendations for consideration when planning OCR projects.
Schlagwörter
optical character recognition
;
computational social sciences
;
software tools
Institution
Dokumenttyp
Bericht, Report
Serie(s)
Band
7
Zweitveröffentlichung
Nein
Sprache
Englisch
Dateien![Vorschaubild]()
Lade...
Name
WeSIS_Technical_Papers_No 07_Skitalinskaya et al.pdf
Size
1.59 MB
Format
Adobe PDF
Checksum
(MD5):87a7ac9c5fb28f92dbb8c7d9114ff68c