Will this method work to scale SQL queries?

I have a database containing one huge table. At the moment, the request can take from 10 to 20 minutes, and I need it to drop to 10 seconds. I spent several months trying different products, such as GridSQL. GridSQL works fine, but uses its own parser, which does not have all the necessary functions. I also optimized my database in different ways, without getting the acceleration I need.

I have a theory on how to scale queries, which means that I use several nodes to run a single query in parallel. A prerequisite is that the data is separated (vertically), one section is placed on each node. The idea is to accept an incoming SQL query and just run it exactly the same as on all nodes. When the results are returned to the node coordinator, the same query is performed by combining the result sets. I understand that a cumulative function, such as an average, needs to be rewritten into the account and summed with nodes, and that the coordinator divides the sum of the sums with the sum of the counters to get the average value.

What problems are not easy to solve with this model. I believe that one problem will be the counter function.

Edit: I get so many good suggestions, but no one turned to the method.

+4
source share
8 answers

David

Do you use all the features of GridSQL? You can also use separation of exception exceptions, effectively splitting your large table into several smaller tables. Depending on your WHERE clause, when the request is processed, it can look at a lot less data and return results much faster.

Also do you use multiple logical nodes per physical server? By configuring it this way, you can use otherwise idle kernels.

If you track servers at runtime, is the IO or CPU bottleneck?

It is also mentioned that you can compare rows in your fact table into pivot tables / cubes. I don’t know enough about Tableau, will it automatically use the appropriate cube and deploy only when necessary? If so, it looks like you will make big profits by doing something like this.

+1
source

This is a data volume issue, not necessarily an architecture issue.

If on 1 machine or 1000 machines, if you end up summing 1,000,000 lines, you will have problems.

Instead of normalizing the data, you need to de-normalize it.

You note in a comment that your database is “perfect for your purpose,” when obviously this is not so. It is too slow.

So, something has to give. Your ideal model does not work, because you need to process too much data in too short a time. It looks like you need several datasets of a higher level than your original data. Perhaps a storage solution. Who knows, not enough information to really say.

But there are many things that you can do to satisfy a specific subset of requests with good response times, while maintaining special requests that respond "in 10-20 minutes."

Edit Comment:

I am not familiar with GridSQL or what it does.

If you send several identical SQL queries to separate "shard" databases, each of which contains a subset, then a simple selection query will scale on the network (that is, you will eventually be connected to a network with a controller), since - present, in parallel, without citizenship.

The problem becomes, as you mentioned, secondary processing, especially sorting and aggregates, since this can only be done in the final “raw” result set.

This means that your controller will inevitably become your bottleneck and, in the end, no matter how you "scale", you still have to deal with the problem of data volume. If you send a query to 1000 nodes and inevitably have to sum or sort the result set of 1000 rows from each node, which leads to 1 mm rows, you still have a long result time and a big need for data processing on one machine.

I don’t know which database you are using, and I don’t know the specifics of individual databases, but you can see how if you really share your data on several disk spindles and have a decent, modern, multi-corrector, the database implementation itself can handle most of this scaling in terms of parallel disk spindle requests for you. What implementations actually do this, I cannot say. I just suggest that this be possible (and some of them can do it).

But, my general sense is, if you work, in particular, aggregates, then you are probably processing too much data if you get to the source sources every time. If you analyze your queries, you may be able to “pre-sum” your data at different levels of detail to avoid the problem of data saturation.

For example, if you keep separate web hits, but are more interested in activity based on each hour of the day (and not on the subseconds that you can register), a summary of only hours of the day can reduce your data significantly.

Thus, scaling can certainly help, but it may not be the only solution to the problem, rather it will be a component. The data warehouse is designed to solve these problems, but does not work with ad hoc queries. Rather, you need to have a reasonable idea of ​​what queries you want to support and develop accordingly.

+6
source

One huge table - can this be normalized at all?

If you mainly make queries of choice, do you consider normalization to the data warehouse that you then query, or do analysis services and a cube to do your preprocessing for you?

From your question, what you are doing looks like the cube is optimized and can be done without having to write all the plumbing.

+5
source

When trying a custom solution (grid) you introduce a lot of difficulties. Perhaps this is your only solution, but first you tried table layout (your own solution)?

+4
source

I would seriously consider the OLAP solution. The cube trick was once built; it can be requested in a variety of ways that you may not have considered. And as @HLGEM mentioned, did you specify indexing?

Even in millions of lines, a good search should be logarithmic, not linear. If you have at least one query that results in a scan, then performance will be destroyed. We may need an example of your structure to find out if we can help more?

I also completely agree with @Mason, you have profiled your request and examined your request plan to find out where your bottlenecks are. Adding speed enhancing nodes makes me think that your request might be CPU related.

+2
source

My assumption (based only on my gut) is that any benefits that you can see from parallelization will be absorbed by re-aggregation and subsequent requests for results. In addition, I think that writing can be complicated with pk / fk / restrictions. If this were my world, I would probably create many indexed views on top of my table (and other views) that were optimized for the specific queries that I need to fulfill (with which I successfully worked in 10 million row tables).

+1
source

If you run an incoming request, not partitioned, on each node, why does any node finish before one node completes the same request? I do not understand your execution plan?

I think this partly depends on the nature of the queries being performed and, in particular, on how many lines the result contributes to the final result set. But surely you need to somehow break the request among the nodes.

+1
source

Your query scaling method works fine.

In fact, I applied this method to: http://code.google.com/p/shard-query

It uses a parser, but supports most SQL constructs.

It does not yet support count (separate expr), but it is doable, and I plan to add support in the future.

I also have a tool called Flexviews (google for flexviews materialized views)

This tool allows you to create materialized views (pivot tables) that include various aggregate functions and associations.

Combined together, these tools can provide significant scalability improvements for OLAP queries.

0
source

Source: https://habr.com/ru/post/1304156/


All Articles