Extract Text
There is an easy way to extract PDF text: extract text from your PDF documents with the help of Extract Text action. If you wondered how to extract text from a PDF, you can't go wrong using PROCESIO.
The result (extracted text) contains 2 parts:
- text: all text from the PDF concatenated in a string;
- words: each word from the PDF is extracted separately in a list of JSONs, each with details such as: ID, row #, position, font etc.
1. Create a process and give it a name.
2. Drag the Extract Text action to the canvas and link it to the other actions.
3. Create the variables needed for the configuration of the action, and then add them to the configuration panel:
4. Save, Validate and Run the process.
6. The process will ask for an input file (.pdf file).
7. Click Run.
8. Click Check Instance to view the results.
You will see the extracted text in Output variable.
Example:
This is the PDF we used for this example:
This is a preview of the extracted text:
The result contains 2 parts:
- text: all text from the PDF concatenated in a string;
- words: each word from the PDF is extracted separately in a list of JSONs, each with details such as: ID, row #, position, font etc.