PDF

Extract Text

4min



There is an easy way to extract PDF text: extract text from your PDF documents with the help of Extract Text action. If you wondered how to extract text from a PDF, you can't go wrong using PROCESIO.



The result (extracted text) contains 2 parts:

  • text: all text from the PDF concatenated in a string;
  • words: each word from the PDF is extracted separately in a list of JSONs, each with details such as: ID, row #, position, font etc.

How to configure the Extract Text action?

1. Create a process and give it a name.

2. Drag the Extract Text action to the canvas and link it to the other actions.

Document image


3. Create the variables needed for the configuration of the action, and then add them to the configuration panel:

Document image


4. Save, Validate and Run the process.

6. The process will ask for an input file (.pdf file).

7. Click Run.

8. Click Check Instance to view the results.

You will see the extracted text in Output variable.

Document image


Example:

This is the PDF we used for this example:

Document image


This is a preview of the extracted text:

Document image


The result contains 2 parts:

  • text: all text from the PDF concatenated in a string;
  • words: each word from the PDF is extracted separately in a list of JSONs, each with details such as: ID, row #, position, font etc.



Updated 16 Feb 2024
Doc contributor
Doc contributor
Did this page help you?