Thursday 26 June 2014

Deleting duplicate records from Flat File Target

In earlier post we have seen how can we delete the records from relational target
http://gyaankatta.blogspot.in/2014/06/deleting-duplicate-records-from.html

However, deleting a record from Flat File target is not that simple.

Why -- If we do not select option - Append If Exists - at Flat File target, informatica will delete the contents of the flat file at start of the mapping itself. So, if we do the implementation as we seen in our earlier post, it will read and insert 0 records and ultimately at file we will have 0 records.

Also, if we select the option - Append If Exits - it will just append all incoming records, instead of updation or deletion.

Apart from that, if you try to use Update Strategy with DD_Delete, still it won't work on Flat File as a target and all records will always get inserted.

How -- To achieve the desired output, we will read the content of the file in some dummy file and again process the dummy file for deleting the duplicate records.

Source Records before running the workflow

[oracle@gogate Informatica]$ cat emp.txt
id,name
1,mandar
2,sudhir
3,ashish
4,sakshi
3,ashish
3,ashish
3,ashish
3,ashish
3,ashish
3,ashish
3,ashish
2,sudhir
3,ashish
4,sakshi
1,mandar


1. Using command task, we will first create a dummy file of an actual source file from where we have to delete the duplicate records.


 Below unix command  we will specify at command task to create a dummy file at same directory where the original file resides
less /u01/app/oracle/Informatica/emp.txt > /u01/app/oracle/Informatica/emp_bkp.txt
2. We will create a second session, which will read dummy file instead of an original file.
     a. Here, to duplicate record removal logic will be different that of relational table. As in a flat file, all records will get deleted first place [at start of a session], we just have to insert the unique records.
   
     b. To eliminate the duplicate records from target we have number of ways as we discussed in earlier post. However in this case we will go with Aggregator Transformation, and will select Group By ports for both columns.

     c. Aggregator Transformation itself will remove all duplicate rows and we will directly connect all incoming ports to Flat File target.

Records after executing the workflow

[oracle@gogate Informatica]$ cat emp.txt
2,sudhir
3,ashish
4,sakshi
1,mandar


Below is the whole mapping and workflow structure..



Deleting duplicate records from Relational Target

This is a typical interview question, however catch here is, its not asking you to eliminate the records but to delete from a table. Elimination of records is just a half way task and it can be possible by number of ways.. some of them are listed below
1. Fetch only distinct records from Source Qualifier itself
2. Use Aggregator transformation and aggregate using required ports for which you want to remove duplicates
3. Use Sorter transformation and select Distinct record property

Scenario -- Here, scenario is slightly different, as you have been asked to delete the records from the table, its not just elimination.

How -- How will you delete the records from the table then?

1. Pull required table from which you want to delete the duplicate records as an Informatica Source as well as Target.

create table emp
(
id number,
name varchar(20)
);



2. Next task is, you want to find out the duplicate records at first place. So, add a sorter transformation to sort incoming records.


3. Now the task is to tag the duplicate records.. there are number of ways by which you can tag duplicate records here..
    a. Simplest method is add Aggregator Transformation and take the count of the primary key column. If count is >1 means that record is duplicate

    b. Add Lookup Tranformation on Target, and use lookup over-ride which return you the count of incoming primary key. If count is >1 means that record is duplicate
    c. Implement Old Val, Current Val logic using Variable and Out-Put port at Expression and tag each row with its uniqueness.

4. Once you identify the duplicate rows from your source, filter / pass only those rows ahead which are duplicate and required for deletion. To filter such rows, use Filter Transformation

5. Now the main part is to delete the duplicate records, which can only be done by tagging the incoming row with Delete Type, which can be done by Update Strategy
    a. Make sure that "Treat Source Rows As" property at workflow will be "Data Driven", or you can also change that to "Delete" in such case you don't have to add Update Strategy in a mapping
    b. Change the property of Update Strategy to DD_DELETE

6. Connect the primary key port [on the basis of which you have identified the duplicate records] to target table; Also make sure that you have defined the Primary Key at Informatica Target for that column.

Below is the overall mapping structure of this implementation.

After running the workflow, result set is as below


Note -- the only problem with this setup is, it will delete all the duplicate records from target table. If you are expecting to keep 1 record out of available duplicates, then that won't be possible.
 

Sunday 22 June 2014

Parametrizing Username and Password of Relational Connection ..

Why --It's always advisable to parametrize  as much as values you can, as it will be helpful for maintainability and also to accommodate the future changes. However, parametrization of Relational Connection is not straight forward.
If you have relational connection created already with pre-defined Schema and Password at connection itself, your job is quite easy. However, hard coding of password at connection level is not advisable and you will find Infra Architecture always reluctant for it.

How -- When you have relational connection [Source or Target] always parametrize 3 values
1. Connection Name
2. User / Schema Name
3. Password

Lets create a simple mapping - m_ClobToBlob - which will have both source and target as relational table. We will parametrize the source connection and keep target connection hardcoded at workflow level.

We have connection created as ORCL which is pointing to HR schema having password as magogate.

















Let's create a corresponding parameter file ClobToBlob.param for this session with entries as below

[MyWork.s_m_ClobToBlob]
$DBConnection_Source=ORCL
$Param_Passwd=magogate

Specify the parameter file path at session properties as below

Now, once you specify parameter file path , validate the workflow / mapping and run the workflow.

Your workflow will get completed successfully. Just verify if the parameter $DBConnection_Source got initialized with corresponding values or not.

Now, change the ORCL connection at workflow and change Password from hardcoded value to Parameter value as below


Also, edit the corresponding parameter file entries as below and again run the workflow
[MyWork.s_m_ClobToBlob]
$DBConnection_Source=ORCL
$Param_Passwd=1exNvZTHYXEznb8c/ckAUA==

Even though you have specified the password value at your parameter file, surprisingly your workflow will fail giving below error

Severity    Timestamp    Node    Thread    Message Code    Message
ERROR    5/6/2014 9:09:06 AM    node01_gogate    DIRECTOR    TM_6279    The session instance [s_m_ClobToBlob] encountered the following run-time validation error: [Required Password is missing for Connection ORCL.

Verified byte code for the Java transformation. The transformation is valid.
Verified byte code for the Java transformation. The transformation is valid.
].


Now, instead of specifying direct password value in parameter file, you convert your password using below command at Unix and give encrypted value at parameter file against $Param_Passwd .




 Let's run the workflow after doing these changes and you will find that it will get succeed.


Verify the corresponding session log, and you will find that the password reflected in session log is encrypted and same as what we have specified in parameter file. So informatica internally converts encrypted password to the original one.

Also, if you parametrize the password you always have to specify encrypted password in parameter file.