Lists

Find Duplicates

19 min

overview you can utilize this action to identify duplicates within a list of complex json objects, based on specified criteria and return a new list containing only the unique values parameters input list to process the list containing objects to be analyzed for duplicates this input is of type list\<object> and the object needs to be of complex type, like a data model or a json structure duplicate identification json configuration this parameter utilizes a monaco editor allowing json input the json configuration structure and guiding information are provided below, in the json configuration section output duplicates found output list containing duplicate objects found duplicates correlation output list correlating duplicates with their original items deduplicate list output list containing the initial list minus the duplicates json configuration the json configuration for this action follows a structured format in the below example, we have all the matchtype types presented with their relevant parameters the ignoredterms parameter is optional and can be used for all matchtype types { 	"idfield" "id", 	"fieldmatches" \[ 	 { 	 "fieldname" "field1", 	 "matchtype" "exact", 	 "ignoredterms" \["ltd","usa"] 	 }, 	 { 	 "fieldname" "field2", 	 "matchtype" "similar", 	 "ignoredterms" \["ltd","usa"] 	 }, 	 { 	 "fieldname" "field3", 	 "matchtype" "fuzzy", 	 "ignoredterms" \["ltd","usa"], 	 "fuzzyalgorithm" "levenstein", 	 "fuzzythreshold" 0 7 	 }, 	 { 	 "fieldname" "field4", 	 "matchtype" "contains", 	 "ignoredterms" \["ltd","usa"] 	 }, 	 { 	 "fieldname" "field5", 	 "matchtype" "soundex", 	 "ignoredterms" \["ltd","usa"], 	 "soundexdistance" 4 	 }, 	 { 	 "fieldname" "field6", 	 "matchtype" "similarwordmatch", 	 "ignoredterms" \["ltd","usa"], 	 "similarfirstnwords" 1, 	 "similarlastnwords" 1 	 } 	], 	"generalignoredterms" \["ltd","usa"], 	"alternatefields" \[ 	 { 	 "fieldname" "field1", 	 "fieldalternatenames" \["field2","field3"] 	 }, 	 { 	 "fieldname" "field3", 	 "fieldalternatenames" \["field4","field1","field2"] 	 } 	], 	"sort" \[ 	 {"field" "field1","ordertype" "asc"}, 	 {"field" "field2","ordertype" "desc"} 	] } matchtype types exact an exact match needs to be validated (=) similar compares 2 elements to see if special characters do not exist for example, "john doe" and "john doe" are a match (before comparing, special characters and space characters are replaced with an empty string) phone numbers for instance, "376 323 1111" and "323 1111" are a match (if after replacing special characters and spaces with an empty string, the string contains only numbers, a field1 contains field2 or field2 contains field1 assessment is made) regardless of whether "www", "http(s) //", exist to enhance domain or url comparison if an email is written with "@" or " at " or "\[at]", it should be a match fuzzy this utilizes the algorithms available in similarity search action to compare one string against the other read more details here https //docs procesio com/how to/similarity search contains checks if field1 contains field2 or field2 contains field1 soundex checks if two words are similar when spoken this matching algorithm evaluates the distance in “sounding” on a scale from 0 to 4, where 4 means the most similar “sounding” and 0 means that the words are very different similarwordmatch this checks if the first n words in a string are the same as the first n words in another string or if the last n words in a string are the same as the last n words in another string properties ignoredterms this parameter is of type list\<string> and replaces the words in this property with an empty string before making the match assessment fuzzyalgorithm & fuzzythreshold only used when matchtype = fuzzy if matchtype is something else, those properties will be ignored so, they can be null or absent fuzzyalgorithm needs to be one from this here https //docs procesio com/how to/similarity search , written exactly as is written in the link provided fuzzythreshold is a double between 0 and 1, where 0 means anything can pass and 1 means that it needs to have 100% confidence from experimenting, we noticed 0 7 is a good threshold to experiment with soundexdistance used only when the matchtype = soundex and it can be set to 0, 1, 2, 3, or 4, where 4 means the words are very similar when spoken similarfirstnwords & similarlastnwords only used when matchtype = similarwordmatch in which case at least one should be >0 if both are =0 then the similarwordmatch will not be executed at all since it does not make sense if matchtype is something else, those properties will be ignored, so, they can have any value or can even not be present generalignoredterms used to add ignored terms for all matches if you set something in this property it is as if you have settled the same value for ignoredterms in all matches generalignoredterms and ignoredterms can be used together, as the terms set in generalignoredterms will be added on top of the words set in ignoredterms when evaluating the matches alternatefields this configuration describes if the data that normally resides in a certain field might be found in another field as well the consequence of setting alternate fields for a particular field is that when evaluating a match for a certain field, the same match assessment will be executed for the alternate fields as well and the result of all those alternate matches will be evaluated with or logical condition to determine the result of the overall match important if fields1 has the alternate fields field2 & field3 it does not mean that field2 or field3 has field1 as an alternate field sort this configuration describes that after evaluating the matches, the output list should be sorted based on the cumulative rules described here if no setting is present here, no sorting operation will be performed idfield this represents a field within the data structure that represents a unique value throughout the entire data set (e g the row identifier) important notes the matching algorithms work only on fields that are of type primitive and not on data structures if the config gives a field that is another jobject/jarray, the action will not work and will return an error the matching algorithm is executed row by row for every row in the list, the algorithm is executed for the next rows until the end of the list if a duplicate is found then the duplicate is removed from the list and added to the duplicates list this means the action will execute proportionally slower as the list is bigger a row match is true only if all matches are true

Last N Elements

DeDuplicate