Newly inserted or updated row of pentaho data integration

Question

Newly inserted or updated row of pentaho data integration

I am new to Pentaho data integration; I need to integrate one database in another place as ETL Job. I want to count the number of inserts / updates during an ETL job and paste this account into another table. Can someone help me with this?

+5

pentaho etl pdi

Sreejith Oct 20 '15 at 7:18

source share

3 answers

The Compare Fields step will take 2 streams as input for comparison, and its output will take 4 different streams for the entries Identical, Modified, Added, and Deleted. You can read these 4 and then process the Modified, Added, and Deleted entries using Insert / Update.

+3

user3123116 Oct 21 '15 at 15:30

source share

You can do this from the "Logging" option in the "Transformation" settings. Follow these steps:

Click Change menu → Settings.
Go to the Logging tab
Choose Step from the menu on the left.
Indicate Connect to Log and Log Table Name (Tell StepLog)
Select the required fields for logging ( LINES_OUTPUT ) for the inserted account and LINES_UPDATED for the updated account)
Click SQL button and create a table by clicking Run
Now all the steps will be recorded in the log table (StepLog), you can use it for further actions.
Enjoy

+2

Prijesh meppayil Oct 26 '15 at 6:30

source share

YuvalHerziger · Accepted Answer · 2015-10-20T08:02:28+0000

I do not think that there is a built-in function to return the number of affected rows of the Insert / Update phase in the PDI today.

However, most database providers can provide you with the ability to get the number of rows affected from a given operation.

In PostgerSQL, for example, it would look like this:

/* Count affected rows from INSERT */ WITH inserted_rows AS ( INSERT INTO ... VALUES ... RETURNING 1 ) SELECT count(*) FROM inserted_rows; /* Count affected rows from UPDATE */ WITH updated_rows AS ( UPDATE ... SET ... WHERE ... RETURNING 1 ) SELECT count(*) FROM updated_rows;

However, you are trying to do this from a PDI job, so I suggest you try moving to the point where you are running the SQL script.

Suggestion: Save the source data in a file on the target database server, then use it, possibly with the bulk upload function, to insert / update, and then save the number of affected rows in the PDI. Note that you may need to use the SQL script step in the job area.

EDIT: implementation is a matter of design choice, so the proposed solution is one of many. At a very high level, you can do something like the following.

Transformation I - Retrieving Data from a Source
- Retrieve data from a source, be it a database or something else
- Prepare it for output so that it matches the structure of the target structure
- Save the CSV file using the text file output step in the file system.
Parent work
- If the PDI server matches the destination database server:
  - Use the Execute SQL script step to:
    - Read the data from the file and do INSERT / UPDATE
    - Record the number of rows affected in a table (ideally, this table may also contain a timestamp for the operation so that you can track things).
- If the PDI server does NOT match the target database server:
  - Upload the source data file to the server, for example. with the steps of uploading FTP / SFTP files.
  - Use the Execute SQL script step to:
    - Read the data from the file and do INSERT / UPDATE
    - Enter the number of rows affected in the table.

EDIT 2: Another Proposed Solution

As suggested by @ user3123116, you can use the Compare Fields step (if this is not part of your environment, check it out on the market).

The only drawback that I see is that you have to query the target database before inserting / updating, which, of course, is less efficient.

In the end, it might look like this (note that this is only part of the comparison and calculation):

Also note that you can separate the input source data streams (COPY, not DISTRIBUTE) and do your insert / update, but this stream should wait for the field comparison stream to complete the query in the target database, otherwise you may get incorrect statistics.

Newly inserted or updated row of pentaho data integration

More articles: