PDF

Area Mapper

6 min

overview the area mapper action facilitates mapping values found in specific areas of a document to corresponding keys it involves defining configurations to specify the areas, keys, and filters for mapping inputs & outputs \[i] extracted words ( list\<object> ) the output extracted from a pdf using the pdf extract text action \[i] area configuration ( list\<json> ) list of json objects containing configuration options for a set of areas \[o] mapped areas ( list\<object> ) key value list containing value maps for each area configuration details the area configuration json structure comprises the following properties maptokey (mandatory)( string ) key that will be found in the output data under the key field onpages (optional)( list\<int> ) specify the pages in the document to search if not specified or <=0, it searches all pages betweenxleft ( int ), betweenxright ( int ), betweenytop ( int ), betweenybottom ( int ), isbelowword ( string ), isaboveword ( string ), isontherightofword ( string ), isontheleftofword ( string ), regexcaseinsensitivepattern ( string ), regexcasesensitivepattern ( string ) (all optional) these are all optional fields, but at least any 2 fields must be filled to extract text from the pdf maxnoofrowspertarget (mandatory) ( int ) specifies how many rows identified should be considered as a result words identified on maxnoofrowspertarget are concatenated into a single text e g if your data in the pdf is written on 2 rows of text and you want this information from both rows mapped to one key result, then you should set this parameter to 2 takefirstelements (optional) ( int ) if null or <=0, it's considered to take all elements this parameter specifies how many "instances" identified should be put on the output it does not refer to how many words to consider, but how many "instances" are found takefirstelements defines the maximum number of values to output for the maptokey each "instance" can have 1 or more lines of text the images below present the configuration parameters visually depicted to express the way they need to be used to extract data from a pdf area configuration json example 1 \[ { "maptokey" "vat number", "onpages" \[1], "isbelowword" "c i f", "betweenxleft" 320, "betweenxright" 448, "betweenytop" 714, "betweenybottom" 696, "regexcasesensitivepattern" "^(ro)?\[0 9]+", "maxnoofrowspertarget" 1, "takefirstelements" null }, { "maptokey" "idfactura", "onpages" \[1], "isbelowword" "numar", "betweenxleft" 320, "betweenxright" 448, "betweenytop" 755, "betweenybottom" 735, "isontheleftofword" "data ", "maxnoofrowspertarget" 1 }, { "maptokey" "issuedate", "onpages" \[1], "betweenxleft" 417, "betweenxright" 505, "betweenytop" 755, "betweenybottom" 740, "maxnoofrowspertarget" 1, "regexcasesensitivepattern" "\\\b\\\d{2}\\\\ \\\d{2}\\\\ \\\d{4}\\\b" }, { "maptokey" "bank", "onpages" \[1], "isbelowword" "bank ", "betweenytop" 43, "betweenxleft" 283, "betweenxright" 360, "betweenybottom" 37, "maxnoofrowspertarget" 1 }, { "maptokey" "bankaccount", "onpages" \[1], "isbelowword" "konto ", "betweenytop" 43, "betweenxleft" 433, "betweenxright" 529, "betweenybottom" 37, "maxnoofrowspertarget" 1 } ] mapped area output json example 2 the json below is an output example for the above area configuration json example \[ { "page" 1, "key" "vat number", "values" \[ "12345" ] }, { "page" 1, "key" "idfactura", "values" \[ "4444444444" ] }, { "page" 1, "key" "issuedate", "values" \[ "14 09 2022" ] }, { "page" 1, "key" "bank", "values" \[ "bank name fake" ] }, { "page" 1, "key" "bankaccount", "values" \[ "ro99 fake 0000 000x xx00 00000" ] } ] important notes ensure that the necessary configurations accurately map the desired areas to keys verify the output to ensure the expected values are correctly mapped

Extract Text

Extract Embedded Files