Thursday 26 June 2014

Deleting duplicate records from Relational Target

This is a typical interview question, however catch here is, its not asking you to eliminate the records but to delete from a table. Elimination of records is just a half way task and it can be possible by number of ways.. some of them are listed below
1. Fetch only distinct records from Source Qualifier itself
2. Use Aggregator transformation and aggregate using required ports for which you want to remove duplicates
3. Use Sorter transformation and select Distinct record property

Scenario -- Here, scenario is slightly different, as you have been asked to delete the records from the table, its not just elimination.

How -- How will you delete the records from the table then?

1. Pull required table from which you want to delete the duplicate records as an Informatica Source as well as Target.

create table emp
(
id number,
name varchar(20)
);



2. Next task is, you want to find out the duplicate records at first place. So, add a sorter transformation to sort incoming records.


3. Now the task is to tag the duplicate records.. there are number of ways by which you can tag duplicate records here..
    a. Simplest method is add Aggregator Transformation and take the count of the primary key column. If count is >1 means that record is duplicate

    b. Add Lookup Tranformation on Target, and use lookup over-ride which return you the count of incoming primary key. If count is >1 means that record is duplicate
    c. Implement Old Val, Current Val logic using Variable and Out-Put port at Expression and tag each row with its uniqueness.

4. Once you identify the duplicate rows from your source, filter / pass only those rows ahead which are duplicate and required for deletion. To filter such rows, use Filter Transformation

5. Now the main part is to delete the duplicate records, which can only be done by tagging the incoming row with Delete Type, which can be done by Update Strategy
    a. Make sure that "Treat Source Rows As" property at workflow will be "Data Driven", or you can also change that to "Delete" in such case you don't have to add Update Strategy in a mapping
    b. Change the property of Update Strategy to DD_DELETE

6. Connect the primary key port [on the basis of which you have identified the duplicate records] to target table; Also make sure that you have defined the Primary Key at Informatica Target for that column.

Below is the overall mapping structure of this implementation.

After running the workflow, result set is as below


Note -- the only problem with this setup is, it will delete all the duplicate records from target table. If you are expecting to keep 1 record out of available duplicates, then that won't be possible.
 

No comments:

Post a Comment