How to create multiple output files in Talend based on column from SQL input

I need to create several output files based on the value (column) from sql input in Talend Open Studio.

My tMSSQLInput returns about 50,000 rows in which one of the columns is creation_name

Building A Building B Building C ....

Thus, all lines with the value “Building A” should be in the excel file named “buildingA.xls”, all lines with the “Building B” should be in the excel file with the name “buildingB.xls”, etc. d.

I'm trying to use tLoop or tForEach along with tIterateToFlow, but I'm not sure I know how to implement it.

Thanks in advance.

+5
source share
3 answers

Gabriele's answer looks pretty good to me.

However, if you find yourself in a situation where you have a huge amount of data on many buildings to the extent that you can store any lowercase lines that are in memory, but not all, then I would be inclined to use a slightly different approach.

In this example, I use the components of the MySQL database only because I have a local MySQL database, but everything about this task is true for Oracle or MS SQL Server:

Job layout

At the very beginning, we open a database connection using the tMySqlConnection component in this case. The remaining 2 database components (tMySqlInput and tMySqlRow) then use general connection information.

We start by capturing the list of buildings in the database using the following query in tMySqlInput:

"SELECT DISTINCT building FROM filesplittest" 

This returns each individual building.

Then we sort through all the buildings, which allows us to store only the records for this particular building in memory for the rest of the job.

Then we use the tMySqlRow component to pull the data for this particular building of the iteration using a prepared statement. An example request that I use looks like this:

 "SELECT building, foo, bar FROM FileSplitTest WHERE building = ?" 

Then we configure the prepared statement in the advanced settings:

tMySqlRow advanced settings for prepared statement

Where I said that the first parameter (Parameter Index = 1) is the building value that we extracted earlier, and tFlowToIterate usefully clicked on globalMap for us, so we extract it from there using ((String)globalMap.get("row6.building")) in this case (this is the" building "column that was in the row6 stream).

When using a prepared statement, you need to get the data as a record set object, so you want to set the tMySqlRow schema as follows:

tMySqlRow schema

And then we analyze it using the tParseRecordSet component:

tParseRecordSet component

With a circuit matching this example:

tParseRecordSet schema

Then we need to iterate over this dataset, adding it to the corresponding CSV. To do this, we use another component tFlowToIterate and slightly annoy the detour through the tFixedFlowInput component to read the record data from the global map before passing it to tFileOutputDelimited:

tFixedFlowInput configuration to read data in from the globalMap

Finally, we add it to the CSV, named after the building:

tFileOutputDelimited append and dynamic file name from globalMap

Please note that the append flag is set, otherwise each iteration of the job will overwrite the previous one. We also call the file the value in the building column.


As Gabriele said: if your data fits well in memory at any time, you can simplify the work by simply reading your data in the tHashOutput component and then filtering the data in a hash:

Simplified job layout with hash and keeping everything in memory

We start by reading all the data into the tHashOutput component, which then stores the data in memory throughout the job. Talend sometimes hides these components for some odd reason, but you can turn them on again by adding them back to the Project Properties → Designer → Palette settings:

How to re-enable the tHash components

Next, we read the data from the hash using the tHashInput component (associated with the previous tHashOutput component - do not forget to add the same scheme to the tHashInput component), and then use the component and the tAggregateRow group, creating "to effectively accept non-building values:

tAggregateRow settings

Then we iterate over the different values ​​for the “build” using tFlowToIterate, and then filter the hash (being read a second time) by the value of the building that is currently being iterated:

tFilterRow configuration

And finally, we’ll once again add a file with a name after the value in the building column:

<T411>

+7
source

I think it's better to do this in a two-page job

  • First you will get a list of files to be created.
  • Then you direct the lines to its file

I would develop such a job

 tMSSSQL_Input_1------>tCacheOut_1 | | OnSubjobOk | | v tCacheIn_1------->tAggregateRow------>tFlowToIterate / / (iterate) / / / +---------------------------------+ | | v tCacheIn_1------->tFilterRow-------->tFileOutDelimited 

Let me explain what happens.

  • In the first subheading, you unload the table into the memory buffer ( tCacheOut available in Talend Exchange is a good component, but the ready tHashInput / tHashOutput will do the job, too) - this is for querying the database only once, but if performance is not required, you You can run multiple queries and avoid using a memory buffer.
  • Then you first read the dump to make it different from your buildings (using tAggregateRow against the building column).
  • Then you are going to switch to a stream of iterations storing the current value of the building in a global variable, let it " my_building "
  • Then you read your dump a second time and filter only the lines of the current building. In facts, you can use globalMap.get("my_building") in your filtering conditions.
  • Finally, you are going to save these lines in the corresponding file, again using globalMap.get("my_building") to parameterize your file name.
+3
source

One way to do this is through a process, for example:

tMySqlInput → tFlowToIterate → tFixedFlowInput -. > TFileOutputDelimited

Here in tFlowToIterate (you can add your key, for example FileName, which will get the value from the column in the tMySqlInput schema)

In tFileOutputDelimited, you can use this (String)glotbalMap.get("FileName") to create a path to the file name, where this file name is suitable for every line from tMySqlInput.

0
source

Source: https://habr.com/ru/post/1201891/


All Articles