Lists

Find Duplicates

19min

Overview

You can utilize this action to identify duplicates within a list of complex JSON objects, based on specified criteria and return a new list containing only the unique values.

Parameters

Input

  • List to process: The list containing objects to be analyzed for duplicates.

This input is of type list<object> and the object needs to be of complex type, like a data model or a JSON structure.

  • Duplicate Identification JSON Configuration: This parameter utilizes a Monaco editor allowing JSON input.

The JSON Configuration structure and guiding information are provided below, in the JSON Configuration section.

Output

  • Duplicates found: Output list containing duplicate objects found.
  • Duplicates correlation: Output list correlating duplicates with their original items.
  • Deduplicate list: Output list containing the initial list minus the duplicates.

JSON Configuration

The JSON configuration for this action follows a structured format.

In the below example, we have all the MatchType types presented with their relevant parameters.

The IgnoredTerms parameter is optional and can be used for all MatchType types.

JSON
ο»Ώ

MatchType Types

Exact

An exact match needs to be validated (=).

Similar

  • Compares 2 elements to see:
    • If special characters do not exist. For example, "John-Doe" and "John Doe" are a match (before comparing, special characters and space characters are replaced with an empty string).
    • Phone numbers. For instance, "376-323-1111" and "323-1111" are a match (if after replacing special characters and spaces with an empty string, the string contains only numbers, a Field1 contains Field2 OR Field2 contains Field1 assessment is made).
    • Regardless of whether "WWW", "HTTP(s)://", exist to enhance domain or URL comparison.
    • If an email is written with "@" or " at " or "[at]", it should be a match.

Fuzzy

This utilizes the algorithms available in Similarity Search action to compare one string against the other. Read more details here.

Contains

Checks if Field1 contains Field2 OR Field2 contains Field1.

Soundex

Checks if two words are similar when spoken. This matching algorithm evaluates the distance in β€œsounding” on a scale from 0 to 4, where 4 means the most similar β€œsounding” and 0 means that the words are very different.

SimilarWordMatch

This checks if the first N words in a string are the same as the first N words in another string OR if the last N words in a string are the same as the last N words in another string.

Properties

IgnoredTerms

  • This parameter is of type list<string> and replaces the words in this property with an empty string before making the match assessment.

FuzzyAlgorithm & FuzzyThreshold

  • Only used when MatchType = Fuzzy. If MatchType is something else, those properties will be ignored so, they can be null or absent.
  • FuzzyAlgorithm needs to be one from this here, written exactly as is written in the link provided.
  • FuzzyThreshold is a double between 0 and 1, where 0 means anything can pass and 1 means that it needs to have 100% confidence. From experimenting, we noticed 0.7 is a good threshold to experiment with.

SoundexDistance

  • Used only when the MatchType = Soundex and it can be set to 0, 1, 2, 3, or 4, where 4 means the words are very similar when spoken.

SimilarFirstNwords & SimilarLastNwords

  • Only used when MatchType = SimilarWordMatch in which case at least one should be >0. If both are =0 then the SimilarWordMatch will not be executed at all since it does not make sense. If MatchType is something else, those properties will be ignored, so, they can have any value or can even not be present.

GeneralIgnoredTerms

  • Used to add ignored terms for all Matches. If you set something in this property it is as if you have settled the same value for IgnoredTerms in all Matches. GeneralIgnoredTerms and IgnoredTerms can be used together, as the terms set in GeneralIgnoredTerms will be added on top of the words set in IgnoredTerms when evaluating the matches.

AlternateFields

  • This configuration describes if the data that normally resides in a certain field might be found in another field as well. The consequence of setting alternate fields for a particular field is that when evaluating a match for a certain field, the same match assessment will be executed for the alternate fields as well and the result of all those alternate matches will be evaluated with OR logical condition to determine the result of the overall match.
  • IMPORTANT: If Fields1 has the alternate fields Field2 & Field3 it does not mean that Field2 or Field3 has Field1 as an alternate field.

Sort

  • This configuration describes that after evaluating the matches, the output list should be sorted based on the cumulative rules described here. If no setting is present here, no sorting operation will be performed.

IdField

  • This represents a field within the data structure that represents a unique value throughout the entire data set (e.g. the row identifier)

Important Notes

  • The matching algorithms work only on Fields that are of type primitive and not on data structures. If the config gives a field that is another JObject/JArray, the Action will not work and will return an error.
  • The matching algorithm is executed row-by-row. For every row in the list, the algorithm is executed for the next rows until the end of the list. If a duplicate is found then the duplicate is removed from the list and added to the duplicates list. This means the action will execute proportionally slower as the list is bigger.
  • A row Match is TRUE only if all matches are TRUE.

ο»Ώ