Newly inserted or updated row of pentaho data integration

I am new to Pentaho data integration; I need to integrate one database in another place as ETL Job. I want to count the number of inserts / updates during an ETL job and paste this account into another table. Can someone help me with this?

+5
source share
3 answers

I do not think that there is a built-in function to return the number of affected rows of the Insert / Update phase in the PDI today.

However, most database providers can provide you with the ability to get the number of rows affected from a given operation.

In PostgerSQL, for example, it would look like this:

/* Count affected rows from INSERT */ WITH inserted_rows AS ( INSERT INTO ... VALUES ... RETURNING 1 ) SELECT count(*) FROM inserted_rows; /* Count affected rows from UPDATE */ WITH updated_rows AS ( UPDATE ... SET ... WHERE ... RETURNING 1 ) SELECT count(*) FROM updated_rows; 

However, you are trying to do this from a PDI job, so I suggest you try moving to the point where you are running the SQL script.

Suggestion: Save the source data in a file on the target database server, then use it, possibly with the bulk upload function, to insert / update, and then save the number of affected rows in the PDI. Note that you may need to use the SQL script step in the job area.

EDIT: implementation is a matter of design choice, so the proposed solution is one of many. At a very high level, you can do something like the following.

  • Transformation I - Retrieving Data from a Source
    • Retrieve data from a source, be it a database or something else
    • Prepare it for output so that it matches the structure of the target structure
    • Save the CSV file using the text file output step in the file system.
  • Parent work
    • If the PDI server matches the destination database server:
      • Use the Execute SQL script step to:
        • Read the data from the file and do INSERT / UPDATE
        • Record the number of rows affected in a table (ideally, this table may also contain a timestamp for the operation so that you can track things).
    • If the PDI server does NOT match the target database server:
      • Upload the source data file to the server, for example. with the steps of uploading FTP / SFTP files.
      • Use the Execute SQL script step to:
        • Read the data from the file and do INSERT / UPDATE
        • Enter the number of rows affected in the table.

EDIT 2: Another Proposed Solution

As suggested by @ user3123116, you can use the Compare Fields step (if this is not part of your environment, check it out on the market).

The only drawback that I see is that you have to query the target database before inserting / updating, which, of course, is less efficient.

In the end, it might look like this (note that this is only part of the comparison and calculation): field comparison

Also note that you can separate the input source data streams (COPY, not DISTRIBUTE) and do your insert / update, but this stream should wait for the field comparison stream to complete the query in the target database, otherwise you may get incorrect statistics.

+4
source

The Compare Fields step will take 2 streams as input for comparison, and its output will take 4 different streams for the entries Identical, Modified, Added, and Deleted. You can read these 4 and then process the Modified, Added, and Deleted entries using Insert / Update.

+3
source

You can do this from the "Logging" option in the "Transformation" settings. Follow these steps:

  • Click Change menu → Settings.
  • Go to the Logging tab
  • Choose Step from the menu on the left.
  • Indicate Connect to Log and Log Table Name (Tell StepLog)
  • Select the required fields for logging ( LINES_OUTPUT ) for the inserted account and LINES_UPDATED for the updated account)
  • Click SQL button and create a table by clicking Run
  • Now all the steps will be recorded in the log table (StepLog), you can use it for further actions.
  • Enjoy
+2
source

Source: https://habr.com/ru/post/1234087/


All Articles