Copy multiple tables into one table (from multiple databases)

I have several identical databases (distributed on several servers) and you need to collect them at one point for data mining, etc.

The idea is to take Table1 , Table2 , ..., TableN from each database and combine them and put the result in one large database.

To be able to write queries and to know which database each row came from, we will add one DatabaseID column to the target table, describing where the row came from. Editing source tables is not an option; it belongs to some proprietary software.

We have ~ 40 servers, ~ 170 databases and we need to copy ~ 40 tables.

Now, how should we implement this, given that it should be:

  • Easy setup
  • Ease of maintenance
  • It is advisable to easily configure if the database schema changes.
  • Reliable logging / alarm if something doesn’t work
  • It is not too difficult to add more tables to copy.

We looked at SSIS, but it seemed like we would need to add each table as a source / transform / destination. I assume that it will also be bound to the database schema. Right?

Another option is to use SQL Server replication, but I don't see how to add a DatabaseID column to each table. It seems that it is only possible to copy data and not modify it. Maybe we could copy all the data into separate databases and then run a local job on the target server to join the tables? It also seems like a lot of work if we need to add more tables to copy, since we will have to redistribute new publications for each database (manual work?).

The final parameter (?) Is to write a custom application for our needs. A great investment of time, but at least he will do exactly what we would like.

To make this worse ... we are using Microsoft SQL Server 2000. We will upgrade to SQL Server 2008 R2 within 6 months, but we would like the project to be used earlier.

Let me know what you guys think!

UPDATE 20110721

We are done with the F # program, which opens a connection to SQL Server, where we need aggregated databases. From there, we query 40 linked SQL servers to retrieve all rows (but not all columns) from some tables and add an extra row to each table to indicate which DatabaseID file the row came from. Server configuration for extraction, tables and columns - a combination of a text file configuration and hard-coded values ​​(heh: D). This is not super fast (sequential sampling so far), but it is absolutely controllable, and the data processing that we do takes much longer.

Future improvements may include:

  • improve error handling if this proves to be a problem (if the server is not connected to the network, etc.).
  • perform parallel sampling to reduce the total amount of time to complete the sampling.
  • find out if only some lines are enough to extract, for example, only what was added / updated.

In general, it turned out to be quite simple, no dependencies on other products, and it works well in practice.

+3
source share
2 answers

Nothing unusual, but you could not do something like

 DROP TABLE dbo.Merged INSERT INTO dbo.Merged SELECT [DatabaseID] = "Database1", * FROM ServerA.dbo.Table UNION ALL SELECT [DatabaseID] = "Database2", * FROM ServerB.dbo.Table ... UNION ALL SELECT [DatabaseID] = "DatabaseX", * FROM ServerX.dbo.Table 

<strong> Benefits

  • Easy setup
  • Ease of maintenance
  • Easy setup
  • Easy to add tables

disadvantages

  • Performance
  • Trusted Registration
+4
source

We had a similar requirement when we used a different approach. First, a central database for data collection was created. Then we created an inventory table to store a list of target servers / databases. Then, a small vb.net-based CLR procedure that takes the path of the SQL query indicates the name of the SQL instance and the target table that will store the data (this will eliminate the configuration of the linked server when adding new targets). It also adds two additional columns to the result set. Target server name and timestamp when capturing data.

Then we configure the queue / service of the service broker and drag the list of target servers for interaction.

The above CLR procedure is authenticated with another procedure that cancels the message, executes SQL on the specified target server. Then the wrapper procedure is configured as an activated procedure for the queue.

With this, we can achieve a bit of parallelism for capturing data.

Benefits:

  • Easy to set up Easy to manage (add / remove goals)
  • The same structure works for multiple queries
  • Logging to verify failed requests.
  • It works independently of each goal, so if one of you answer, the others continue.
  • A workflow can be gracefully paused by disabling the queue (for maintenance on a central server), and then resume collection by re-enabling it.

Inconvenience:

  • requires a good understanding of service brokers.
  • must correctly handle poisonous messages.

Please let me know if this helps.

0
source

Source: https://habr.com/ru/post/946479/


All Articles