Textricator is a tool for extracting text from computer-generated PDFs and generating structured data (CSV or JSON). If you have a bunch of PDFs with the same format (or one big, consistently formatted PDF) and you want to extract the data to CSV or JSON, Textricator can help! It can even work on OCR'ed documents! (Textricator is not an OCR tool. It will not work on raster (scanned) documents. You must process scanned documents with an OCR tool that provides good results for your documents before using Textricator.)
It is developed by Measures for Justice and was announced at the Code for America Summit 2018. We welcome feedback! Create an issue or pull request on Github or email us. Please let us know if you are using Textricator, or if you need help using it. If you are having trouble getting it to work with your documents, let us know; we will help if we can!
Textricator is deployed to Maven Central. A tgz binary distribution is included in addition to binary, source, and documentation jars.
Source code, documentation, and examples are available on Github.
Measures for Justice has used Textricator to collect thousands of pages of data. The tool doesn't require programming skills; instead a user describes the structure of the document using a yaml file. Textricator can extract data from PDFs in almost any layout--not just tables, but complex reports generated from tools like Crystal Reports. You tell Textricator attributes of the fields you want to collect it chomps through the document collecting them and writing out your records.
This is Textricator's mascot, Chompy: