Relational v Hierarchical Data Models

When the relational model was proposed by FE Codd, created a time database used by a hierarchical model . I understand that the relational model felt like a significant improvement in the hierarchical approach.

My intuition is that it "makes sense" for several reasons.

  • The relational model seems to be “query agnostic” in that instead of a data form that reflects the way it is requested, it is structured so that any question can be asked relatively easily.
  • The relational model also makes volatility simple. You validate or remove facts by adding rows to the table (adding tuples to the set) or deleting them. In contrast, in a hierarchical parameter, you need to add or remove from some other object that introduces secondary questions, for example, should a parent be created if it does not exist, or deleted if it is empty.
  • The relational model easily models relationships that do not fit easily into the parent child’s approach, such as the relationship between three objects.
  • The relational model seems to be better suited for schema growth, as new kinds of facts can be added using new tables. Doing this with some caution should not violate existing tables (and facts) or the services that depend on them.

However, although it seems that the relational data model has its advantages, I would like to get some idea of ​​why, apparently, it was apparently much higher at that time and, apparently, there are still .

I would greatly appreciate arguments in some distilled form or, ideally, in one or more documents or other documents or a canonical reference that goes through the reasoning behind it.

For clarity, I do not ask about real implementations of any approach or their relative use of resources in terms of storage or computation, if this does not really matter for the answer.

Thanks.

+5
source share
2 answers

It is not entirely correct to say "created databases of the time when the hierarchical model is used."

Firstly (for nit-pick), these are database management systems that do / do not use any physical structure. "Databases" are database projects that can use all kinds of abstractions. Entity-relational modeling has been and remains popular as a design tool regardless of the possible physical platform.

Secondly, at a time when hierarchical models were common for large "big iron" databases, indexed-sequential was much more common in what was used (for example, DEC PDP-8 / -11, IBM System / 34, / 36, ICL 1900 / ME29, Honeywell DPS4 / DPS7).

We could say that a system with indexed sequences on disk grew out of systems with punch cards or magnetic tape, using a batch update by copy. Where the "sequential" comes from.

You say you don’t want to ask about the actual implementation; but the answer is about the actual implementation. Reading a disc is consistently more efficient than random access (which requires reading a puzzle). Therefore, unlike a disk, the memory is called "Random Access Memory". (This was long before the operating system became so cheap that we could store the whole database in memory.)

Similarly, a hierarchical model was organized to provide quick access to commonly used query routes. A hierarchy places closely related nodes on the same physical disk patch. Thus, it was easy to move from the customer to these customer orders to these order item positions.

The disadvantage was difficult to navigate the hierarchy - for example, to find all the order lines for item P5432, regardless of which customer / order. (In addition, if you want to get customers who order the P5432, you need to work “backward” in the hierarchy. If all this is on one patch of the disk, I hope you do not need to look too much / maybe it is in the same disk loaded in RAM.)

Similarly, an indexed sequential organization endorsed one particular index - the primary key. If you want to search by the name of the Client, and not by the number that requires a “secondary index” with all kinds of ugly organizations, in order to maintain index buckets somewhere near the data. And the notorious “bucket overflow”, which could stop the car dead when you corrected one tynya urgent spelling mistake in the name, so you moved it to a completely different letter position.

(By the way, NoSQL databases, which are repositories with key values ​​with only one key, seem to fall into all these traps for secondary indexing. They need a second key repository to provide an alternative index, all kinds of entertainment that supports their synchronization. Back to the future !)

The biggest problem that Codd had in implementing the Relational Model was to convince IBM that the model could efficiently support queries through several “access paths”. You will see that many of his early works speak of abstracting "navigation" from the author-requestor / programmer. In fact, there were many tradeoffs in the original System / R system because

a) IBM Engineers simply did not understand the mathematical abstractions Codd spoke about;

b) they were scared of recklessness, they would act like a dog, and they would lose their job.

[um! personal opinion: but the group got together afterwards, and there are some memories somewhere on the Internet.] These trade-offs continued to this day in SQL; which, frankly, is a bunch and should be killed as just an interesting proof of concept.

How did the Codd model go (or rather, the SQL model, not Codd)?

  • Drive technology improved - especially search time

  • someone calculated hash indexing and b-trees and saved all indexes for the table in separate memory to the actual data; instead of trying to hold it like a magtape serial store.

  • Larry Ellison sniffed, something happened, and stole members of the IBM engineering team to create the same in Oracle. In addition, Michael Stonebreyer formed the Ingres.

The race is over! There was no time to stop and fix it. Implement what you have (i.e., SQL proof of concept) and rush it to the market, ready or not. (Sounds like a familiar story?)

Your points about the superiority of the relational model are all well made. They mainly follow from normalization methods. I would say, however, that they were not well understood in the late 70s / 80s. Design patterns were similar to hierarchical or indexed sequential data models that simply carried over to “flat” tables. In particular, there was a tendency to create “wide” tables in order to combine everything we know about a Client on a single patch of the disk, rather than vertically. (Due to the fear of performance when merging partitions.) This meant a lot of inapplicable or "unknown" fields - this is an abomination of SQL null.

So, your "improvements" so far only partially achieved. One day, perhaps we will see a DBMS created for the relational model. For now, we have to put up with SQL.

+7
source

AntC gives a super response with a lot of historical background. I wanted to get an answer from the links made in the comments to make sure that they were not skipped.

Absolutely amazing reading - this is a recording of the 1981 Codd Turing Award Lecture . Thanks reaanb for offering it .

In the article, Codd sets out the case against modern databases in this way:

These systems are burdened by applications with numerous concepts that are not relevant to their data retrieval and data manipulation tasks, making them think and code at an unnecessarily low level of structural details (the CODASYL DBTG “set of member-owners” is an outstanding example.

No commands were provided for processing several records at a time - in other words, the DBMS did not support set processing, and as a result, programmers were forced to think and code in terms of repeating loops that are often not needed (here we use the word "set" in its traditional in the mathematical sense, not in the sense of the associated structure of CODASYL DBTG};

The end-user needs for direct interaction with databases, especially unforeseen interactions, were not sufficiently recognized. It was assumed that the query feature could be added to the DBMS after some time.

With the problem described, shortly after Codd outlined his replacement requirements - goals that he wants to fulfill.

The most important motivation for the research work, which resulted in the relational model, was to provide a clear and precise boundary between the logical and physical aspects of database management, including database design, data extraction and data manipulation. We call this the goal of data independence.

The second task was to make the model structurally simple so that all kinds of users and programmers could have a general idea of ​​the data and therefore could communicate with each other about the database. We call it objective sociability.

The third goal was to introduce high-level language concepts, but not specific syntax I, to allow users to express operations on large pieces of information at the same time. This entails providing the basis for a processing oriented on a given order, that is, the ability to express in one application the processing of several sets of records at a time. We call this job processing goal.

There were other goals, such as providing a solid theoretical foundation for organizing and managing databases, but these tasks are less relevant to our current topic of performance.

And then, finally, he presents the relational model and shows how it meets the desired goals and solves objections to modern systems.

In my opinion, this document remains extremely relevant.

This gives a strong idea of ​​why Codd proposed a relational approach, and he actually throws the gauntlet for current systems to fit his expectations and expectations. I think that they do not justify themselves, especially in how we tend to deploy in heterogeneous / polyglot client server systems. Can we argue that they have overcome the objections of Codd?

Paper is affordable and easy to read for anyone who is just familiar with the field.

The final rather extraordinary excerpt from the article stands as a warning of orthodoxy (emphasis mine):

Relational algebra, which includes these and other operators , is a power criterion . It is not intended to be the standard language to which all relational systems should relate. The task of processing a set of relational models is designed to satisfy through sub-languages ​​of data having at least the strength of relational algebra without using iterative or recursive operators.

So, Codd does not claim that the relational model is the best or only approach. He says that he has the necessary properties to achieve his goals, but does not have to be the last word in database design.

Codd Original Paper is also worth a look at relational databases. Thanks to philipxy for this , it is also readable paper and has a lot of excuses and background. This has more formal content, but you can’t get anything if you have some understanding of relational databases.

+2
source

Source: https://habr.com/ru/post/1271975/


All Articles