Lists

DeDuplicate

13min
description the deduplicate action allows procesio users to identify duplicates within a simple list based on specified criteria and return a new list containing only the unique values parameters input list to process ( list\<object> ) input duplicate identification json configuration this should be a β€œmonaco editor”, allowing json output duplicates found ( list\<object> ) output duplicates correlation ( list\<object> ) the correlation between the found duplicates and the main item that they are duplicates of output deduplicate list ( list\<object> ) initial list minus the duplicates json configuration the configuration structure should be as follows { &#x9;"matchtype" "exact", &#x9;"fuzzyalgorithm" "levenstein", &#x9;"fuzzythreshold" 0 7, &#x9;"soundexdistance" 4, &#x9;"similarfirstnwords" 1, &#x9;"similarlastnwords" 1, &#x9;"ignoredterms" \[ &#x9; "ltd", &#x9; "usa" &#x9;], &#x9;"sort" "asc" } matchtype specifies the matching type fuzzyalgorithm algorithm used for fuzzy matching fuzzythreshold threshold for fuzzy matching soundexdistance distance used for soundex matching similarfirstnwords number of first words to compare similarlastnwords number of last words to compare ignoredterms terms to ignore during matching sort specifies the sorting order matchtype types exact an exact match needs to be validated (=) similar compares 2 elements to see if special characters do not exist for example, "john doe" and "john doe" are a match (before comparing, special characters and space characters are replaced with an empty string) phone numbers for instance, "376 323 1111" and "323 1111" are a match (if after replacing special characters and spaces with an empty string, the string contains only numbers, a field1 contains field2 or field2 contains field1 assessment is made) regardless of whether "www", "http(s) //", exist to enhance domain or url comparison if an email is written with "@" or " at " or "\[at]", it should be a match fuzzy this utilizes the algorithms available in similarity search action to compare one string against the other read more details here https //docs procesio com/how to/similarity search contains checks if field1 contains field2 or field2 contains field1 soundex checks if two words are similar when spoken this matching algorithm evaluates the distance in β€œsounding” on a scale from 0 to 4, where 4 means the most similar β€œsounding” and 0 means that the words are very different similarwordmatch this checks if the first n words in a string are the same as the first n words in another string or if the last n words in a string are the same as the last n words in another string properties ignoredterms this parameter is of type list\<string> and replaces the words in this property with an empty string before making the match assessment fuzzyalgorithm & fuzzythreshold only used when matchtype = fuzzy if matchtype is something else, those properties will be ignored so, they can be null or absent fuzzyalgorithm needs to be one from this here https //docs procesio com/how to/similarity search , written exactly as is written in the link provided fuzzythreshold is a double between 0 and 1, where 0 means anything can pass and 1 means that it needs to have 100% confidence from experimenting, we noticed 0 7 is a good threshold to experiment with soundexdistance used only when the matchtype = soundex and it can be set to 0, 1, 2, 3, or 4, where 4 means the words are very similar when spoken similarfirstnwords & similarlastnwords only used when matchtype = similarwordmatch in which case at least one should be >0 if both are =0 then the similarwordmatch will not be executed at all since it does not make sense if matchtype is something else, those properties will be ignored, so, they can have any value or can even not be present important notes matching algorithms work only on primitive fields matching is executed row by row, which will slow down with larger lists