Should Join tables be created as indexed tables (clustered indexes)?

In the general case ... tables should be joined (i.e., associative tables) in the form of indexed ordered tables (Oracle), clustered indexes (SQL Server) .... or simple old heap tables (with separate indexes on 2 columns).

As I can see, are there any advantages:

Speed ​​improvement. You avoid looking at the heap table.

Space improvement. You completely eliminate the heap table, so you probably save ~ 30% of the space.

Disadvantages:

Skipping an index scan (applicable only to Oracle) .. will be faster than a full table scan, but slower than an index scan. Thus, the search in the second column of the composite key will be slightly slower (Oracle), much slower (MSSQL).

A full index scan will be slower than a full table scan - therefore, if in most cases the Cost Optimizer is based on the use of hash nodes (which do not take advantage of indexes) ... you can expect worse performance. (Assuming that the RDBMS does not filter tables first).

Which makes me doubt whether any indexes for Join tables are really needed if you primarily do Hash Joins.

+6
source share
3 answers

My personal rule is to create associative objects with two tables, like indexed tables, with the primary key constraint being the “direction” of access, which I believe will be more widely used. Then I usually add a unique index to cover the reverse order of the keys, so in all cases the optimizer should be able to use access to a unique scan or scan range.

Three tables (or more) of associative objects usually require significantly more analysis.

In addition, the optimizer will use hash join indexes; usually fast full scans, but indexes nonetheless.

+3
source

I just listed and will talk about several possible solutions that I hope will help you solve. A join table contains two or three columns. A foreign key to the left table, say a , and a foreign key to the right table, say b . An optional column is the row identifier for the "join table", for example id .

Solution 1: Columns a,b . No clustered index (heap), indices on (a,b) and (b,a)
Both columns are stored in three places. It supports queries on both a and b , and search b does not require bookmark searches, since a part of the index (b,a) . A decent choice, but triple storage seems like a waste. The heap is not needed, but must be supported during insert and update requests.

Solution 2: Columns a, b . Cluster index on (a,b) , index on (b,a)
All data is stored twice. Can search on a and b without searching for bookmarks. This will be the best approach. He trades disk storage for speed.

Solution 3: Columns a, b . Cluster index on (a,b)
All data is saved only once. It can serve as a search on a , but not on b . To move from the table from right to left, you will need to scan the table. It rates for disk space. (A hash join is mentioned in your question. A hash join always performs a full scan.)

Solution 4: Columns id, a, b . Cluster index (id) , index on (a) and (b)
Searching a or b requires a bookmark search. Both a and b are stored twice on disk, once in their own index and once in a cluster key. This is the worst decision I could think of.

This list is by no means exhaustive. Solution 2 would be a good default choice. I would go for it if another solution turned out to be much better in tests.

+3
source

I am not familiar with Oracle terminology, but for SQL Server the question is formulated in a way that is confusing. To clarify:

  • A clustered index defines the physical order of a table
  • A non-clustered index is basically a copy of the main table, sorted by assigned keys
  • You can designate (“enable”) additional columns in a non-clustered index, which can allow the query optimizer to use these columns to satisfy queries, rather than search bookmarks.
  • A heap is a table without an index of any type. All requests in a heap require scanning.
  • Full non-clustered index scans are faster than a full table scan, provided that the index is narrower than the table and that you do not need search queries.

So, keeping in mind, the keys used for joins should usually have either a clustered or non-clustered index associated with them to avoid scanning tables. You can include additional columns in your nonclustered indexes as needed — and prefer clustered indexes for queries that span an adjacent range of key values ​​with access to many columns in a row.

0
source

Source: https://habr.com/ru/post/904854/


All Articles