I have several identical databases (distributed on several servers) and you need to collect them at one point for data mining, etc.
The idea is to take Table1 , Table2 , ..., TableN from each database and combine them and put the result in one large database.
To be able to write queries and to know which database each row came from, we will add one DatabaseID column to the target table, describing where the row came from. Editing source tables is not an option; it belongs to some proprietary software.
We have ~ 40 servers, ~ 170 databases and we need to copy ~ 40 tables.
Now, how should we implement this, given that it should be:
- Easy setup
- Ease of maintenance
- It is advisable to easily configure if the database schema changes.
- Reliable logging / alarm if something doesnβt work
- It is not too difficult to add more tables to copy.
We looked at SSIS, but it seemed like we would need to add each table as a source / transform / destination. I assume that it will also be bound to the database schema. Right?
Another option is to use SQL Server replication, but I don't see how to add a DatabaseID column to each table. It seems that it is only possible to copy data and not modify it. Maybe we could copy all the data into separate databases and then run a local job on the target server to join the tables? It also seems like a lot of work if we need to add more tables to copy, since we will have to redistribute new publications for each database (manual work?).
The final parameter (?) Is to write a custom application for our needs. A great investment of time, but at least he will do exactly what we would like.
To make this worse ... we are using Microsoft SQL Server 2000. We will upgrade to SQL Server 2008 R2 within 6 months, but we would like the project to be used earlier.
Let me know what you guys think!
UPDATE 20110721
We are done with the F # program, which opens a connection to SQL Server, where we need aggregated databases. From there, we query 40 linked SQL servers to retrieve all rows (but not all columns) from some tables and add an extra row to each table to indicate which DatabaseID file the row came from. Server configuration for extraction, tables and columns - a combination of a text file configuration and hard-coded values ββ(heh: D). This is not super fast (sequential sampling so far), but it is absolutely controllable, and the data processing that we do takes much longer.
Future improvements may include:
- improve error handling if this proves to be a problem (if the server is not connected to the network, etc.).
- perform parallel sampling to reduce the total amount of time to complete the sampling.
- find out if only some lines are enough to extract, for example, only what was added / updated.
In general, it turned out to be quite simple, no dependencies on other products, and it works well in practice.