Are SQL execution plans based on schema or data or both?

I hope this question is not too obvious ... I already found a lot of good information about the interpretation of the implementation plans, but there is one question that I did not find the answer to.

Is the plan (or rather the relative cost of the processor) only based on the scheme or the actual data in the database?

I am trying to do some analysis of where indexes are needed in my product database, but I work with my own test system, which does not have close to the amount of data that the product will have in this area. I see some strange things, such as it is estimated that the cost of the processor actually increases slightly after adding the index, and I wonder if my data set is so small.

I am using SQL Server 2005 and Management Studio to execute plans

+4
source share
6 answers

It will be based on both the schema and the data. The chart tells you which indexes are available, the data says it's better.

The answer may vary to a small extent depending on the DBMS used (you did not specify), but they all support statistics against indexes to find out if the index helps. If an index breaks 1000 rows into 900 different values, this is a good index to use. If the index yields only 3 different values ​​for 1000 rows, it is not really selective , so it is not very useful.

+4
source

SQL Server is a 100% cost optimizer. Other RDBMS optimizers are usually a combination of cost-based and rule-based, but SQL Server, for better or worse, is completely cost-dependent. The rule-based optimizer will be one that can say, for example, the order of the tables in the FROM clause defines the driving table in the join. There are no such rules in SQL Server. See Processing SQL Queries :

SQL Server Query Optimizer is a cost-based optimization. Each possible implementation plan has a corresponding cost in terms of the number of computations of the resources used. The query optimizer should analyze the possible plans and select the one with the lowest estimated cost. Some complex SELECT statements have thousands of possible execution plans. In these cases, the query optimizer does not analyze all possible combinations. Instead, it uses sophisticated algorithms to search for an execution plan that has a cost close enough to the minimum possible cost.

SQL Server Query Optimizer doesn’t just choose an execution plan with the lowest cost of resources; This selects a plan that returns results to a user with a reasonable cost in resources and returns the quickest results. For example, processing a parallel request usually uses more resources than processing it sequentially, but completes the request faster. The SQL Server Optimizer will use a parallel execution plan if the load on the server is not affected.

The query optimizer relies on the distribution of statistics, estimates the cost of resources, various ways of extracting information from a table or index. Distribution statistics are stored for columns and indexes. They indicate the selectivity of values ​​in a particular index or column. For example, in the table representing cars, many cars have the same manufacturer, but each car has a unique car identification number (VIN). The VIN index is more selective than the manufacturer index. If the index statistics are not current, the query optimizer may not choose for the current state of the table. For more information about supporting index statistics, see Using Statistics to Improve Performance Query .

+2
source

Both schemes and data.

The statistics are taken into account when constructing the query plan, using them to approximate the number of rows returned by each step in the query (since this can affect the performance of different types of connections, etc.).

A good example of this is the fact that it does not bother using indexes on very small tables, since this operation is faster in this situation.

+1
source

I cannot speak for all RDBMS systems, but Postgres specifically uses the estimated table sizes as part of its efforts to build query plans. As an example, if a table has two rows, it can choose to sequentially scan the table for the part of the JOIN that uses this table, whereas if it has 10,000+ rows, it can choose to use an index or hash scan (if any of them are available. ) By the way, earlier it was possible to run poor query plans in Postgres, combining VIEW instead of real tables, since there were no estimated sizes for VIEW.

Part of how Postgres builds its query plans depends on the settings in the configuration file. More information on how Postgres creates its query plans can be found on the Postgres website.

+1
source

For SQL Server, there are many factors that contribute to the final execution plan. At a basic level, statistics play a very large role, but they are based on data, but not always on all data. Statistics are also not always relevant. When creating or restoring an index, statistics should be based on a FULL / 100% sample of data. However, the sampling rate for automatic statistics updates is much lower than 100%, so you can try a range that is actually not representative of most of the data. The estimated number of rows for the operation also plays a role, which can be based on the number of rows in the table or statistics of the filtered operation. Thus, outdated (or incomplete) statistics can lead the optimizer to choose a less optimal plan, since several rows in the table can make it completely ignore indexes (which can be more efficient).

As mentioned in another answer, the more unique (i.e. selective) the data, the more useful the index will be. But keep in mind that the only guaranteed column for statistics is the leading (or “leftmost” or “first”) index column. SQL Server may not collect statistics for other columns, even some of them are not included in any indexes, but only if AutoCreateStatistics DB is set (and it is by default).

In addition, having foreign keys can help the optimizer when these fields are in the query.

But one area not addressed in the question is the question of the request itself. A query that has slightly changed but still returns the same results may have a fundamentally different execution plan. It is also possible to terminate the use of the Index using:

 LIKE '%' + field 

or wrapping a field in a function, for example:

 WHERE DATEADD(DAY, -1, field) < GETDATE() 

Now keep in mind that read operations (ideally) are faster with indexes, but DML operations (INSERT, UPDATE and DELETE) are slower (with lots of CPU and disk I / O) since indexes need to be supported.

Finally, it is not always necessary to rely on “estimated” CPU values, etc. for value. The best test:

 SET STATISTICS IO ON run query SET STATISTICS IO OFF 

and focus on "logical readings." If you reduce logical reads, you must improve performance.

In the end, you will need a dataset that comes close to what you have in Production in order to configure the setting for both the indexes and the queries themselves.

0
source

Features of Oracle:

The declared value is actually the estimated lead time, but it is given in a somewhat secret unit of measurement, which is associated with the estimated time to read the blocks. It is important to understand that the estimated cost in any case does not indicate the speed of execution, if each estimate made by the optimizer was not 100% ideal (which never happens).

The optimizer uses the scheme for many things when deciding which transformations / heuristics can be applied to a query. Some examples of schemas that are significant in evaluating xplans:

  • Foreign key restrictions (can be used to eliminate the table)
  • Separation (exclude whole data ranges)
  • Unique restrictions (e.g. unique vs index),
  • Non-null constraints (anti-joins are not available with non-in () in nullable columns
  • Data types (type conversions, specialized date arithmetic)
  • Materialized views (to rewrite a query as a whole)
  • Dimension Hierarchies (for defining functional dependencies)
  • Check restrictions (restriction is introduced if it reduces the cost)
  • Index Types (b-tree (?), Bitmap, join, function based)
  • The order of the columns in the index ( a = 1 on {a, b} = range scan, {b, a} = skip scan or FFS)

The core of the estimates comes from the use of statistics collected from actual data (or prepared). Statistics are collected for tables, columns, indexes, sections, and possibly something else.

The following information is collected:

  • The number of rows in the table / section
  • Average row / column length (important for costing full scans, hash joins, sorts, temporary tables).
  • The number of zeros in col (is_president = 'Y' is quite unique)
  • Distinctive values ​​in col (last_name is not very unique)
  • Min / max value in col (helps unlimited range conditions like date > x )

... to help evaluate the nr expected rows / bytes returned when filtering data. This information is used to determine which access paths and connection mechanisms are available and suitable taking into account the actual values ​​from the SQL query compared to statistics.

Among other things, there is also a physical row order that affects how a “good” or attractive index becomes a full table scan. For indexes, this is called the "clustering factor" and is a measure of how much the row order matches the order of index entries.

0
source

Source: https://habr.com/ru/post/1336861/


All Articles