Duplicate identification will allow you to automate the way you work with lists that contain duplicates. In the following you will see an example in the form of a scenario.
As a PROCESIO user, I want to have an action that identifies duplicates in a List (the list must contain data models). The action will help me obtain a new list containing only the unique values.
The example requires a process with just the Duplicate identification action with the following configuration.
Step 1. Create a new process and give it a name.
Step 2. Drag the Duplicate identification action to the canvas and link it to the other actions.
Step 3. We will use in our example the employeelist data model.
Step 4. Add the needed variables.
We will use the data type created in the previous step to create the input variable <%input_list%> and the output variables <%duplicates%> and <%unique%>. The <%correlation%> output variable will use the preconfigured correlation data type that you can find in the Type dropdown.
Step 5. Click the Duplicate list action to access its configurations and click the button Configure.
- add the variables like in the following screenshot
- we used the following JSon configuration for our example. You can change it to better suit your needs:
Full/Complex JSon configuration explanation:
- FieldName - is the attribute in the data model (the header in the excel).
- IdField and FieldMatches - are used because the Duplicate action works with complex lists ( lists of type "data model" ).
- MatchType - pattern matching is made by using several algorithms, the values accepted (you can use several in a combined manner) are:
- Exact - an exact match needs to be validated (=). The Exact match type uses all configurations.
- Similar - it compares 2 elements to see if:
- special characters do not exist, E.g. John-Doe and John Doe are a match (before comparing, it replaces special characters and space characters with string.empty).
- phone numbers E.g. 376-323-1111 and 323-1111 are a match (if after replacing special characters and spaces with string.empty the string contains only numbers, a Field1 contains Field2 OR Field2 contains Field1 assessment is made).
- regardless of whether WWW, HTTP(s)://, exists to enhance domain or URL comparison.
- an email is written with @ or " at " or "[at]" (should be a match).
- Fuzzy - this uses the algorithms available in StringSimilarity Action to compare one string against the other (read this for more details: Actions | String similarity ).
- Contains - this checks if Field1 contains Field2 OR Field2 contains Field1.
- Soundex - this checks if 2 words are similar if you would speak them. See implementation example. This matching algorithm evaluates the distance in “sounding” on a scale from 0 to 4, where 4 means the most similar “sounding” and 0 means that the words are very different.
- SimilarWordMatch - This checks if the first N words in a string are the same as the first N words in another string OR if the last N words in a string are the same as the last N words in another string.
- IgnoredTerms is a optional setting that is used to replace the words in this property with string.empty before making the Match assessment.
- FuzzyAlgorithm and FuzzyThresholdproperties are only used when MatchType = Fuzzy. If MatchType is something else, those properties will be ignored so, they can be null or can even not be present.
- SimilarFirstNwords and SimilarLastNwords properties are only used when MatchType = SimilarWordMatch, in which case at least one should be >0. If both are =0 then the SimilarWordMatch will not be executed at all since it does not make sense. If MatchType is something else, those properties will be ignored, so, they can have any value or can even not be present.
- SoundexDistance - This is used only when the MatchType = Soundex and it can be set 0, 1, 2, 3 or 4, where 4 means that the words are very similar when spoken.
- The GeneralIgnoredTerms property should be used to add ignored terms for all Matches. If you set something in this property, it is as if you have settled the same value for IgnoredTerms in all Matches. GeneralIgnoredTerms and IgnoredTerms can be used together, as the terms set in GeneralIgnoredTerms will be added on top of the words set in IgnoredTerms when evaluating the matches. Some examples of usage:
- The AlternateFields setting describes if the data that normally resides on a certain field might be found in another field as well. The consequence of setting alternate fields for a particular field is that when evaluating a match for a certain field, the same match assessment will be executed for the alternate fields as well and the result of all those alternate matches will be evaluated with OR to determine the result of the overall match. For example, when executing the match for Field1 in the JSON example above ("MatchType":"Exact"), because for Fields1 we have alternate fields ("FieldAlternateNames":["Field2","Field3"]) the same match will be executed for Fields: Fields2 & Fields3 and, to determine if we have a field match for a certain Match setting, the following operation will be executed (pseudo-code):
IMPORTANT: If Fields1 has the alternate fields Field2 & Field3 does not mean that Field2/Field3 has Field1 as an alternate field.
- Sort is a optional setting that describes that after evaluating the matches, the output list should be sorted based on the cumulative rules described here. If no setting is present here, no sorting operation will be performed.
- The matching algorithms work only on Fields that are of type primitive and not on data structures (if the config gives a field that is another JObject/JArray, the Action was not tested in this conditions)
- a Row Match is TRUE only if all matches are TRUE, so the evaluation looks like this (pseudo-code for the JSON config from above):
Step 6. Save and Validate the process.
Step 7. Each time you will run the process you are expected to add the list you wish to check, for this example we used:
Step 8. Click Run and add the list.
Step 9. Click Check instance to see the results in the output section.