Do relational databases provide a possible backend for the process historian?

Question

Do relational databases provide a possible backend for the process historian?

In manufacturing, large amounts of data are often read at high frequency from several different data sources, such as NIR instruments, as well as general instruments for measuring pH, temperature, and pressure. This data is often stored in a process historian, usually for a long time.

In this regard, process historians have different requirements than relational databases. In most queries, the process historian requires either timestamps or time ranges for operation, as well as a set of variables of interest.

Frequent and many INSERTs, many SELECTs, few or no UPDATES at all, almost no DELETE.

Q1. Are relational databases a good backend for the process historian?

A very naive implementation of a process historian in SQL might be something like this.

  + ------------------------------------------------ +
 |  Variable |
 + ------------------------------------------------ +
 |  Id: integer primary key |
 |  Name: nvarchar (32) |
 + ------------------------------------------------ +

 + ------------------------------------------------ +
 |  Data |
 + ------------------------------------------------ +
 |  Id: integer primary key |
 |  Time: datetime |
 |  VariableId: integer foreign key (Variable.Id) |
 |  Value: float |
 + ------------------------------------------------ +

This structure is very simple, but probably slow for the usual operations of process historians, since it lacks “sufficient” indexes.

But, for example, if the Variable table will consist of 1,000 rows (rather an optimistic number), and the data for all these 1,000 variables will be selected once per minute (also an optimistic number), then the data table will grow from 1,440,000 rows per day. Let's continue with the example, estimate that each row takes about 16 bytes, which gives about 23 megabytes per day, not counting the extra space for indexes and other overheads.

23 megabytes as such may not be that many, but keep in mind that the number of variables and samples in the example was optimistic and that the system should work 24/7/365.

Of course, archiving and compression come to mind.

Q2. Is there a better way to do this? Perhaps using some other table structure?

+4

sql database database-design

dalle Feb 03 '10 at 20:35

source share

12 answers

I am working with a SQL Server 2008 database with similar characteristics; heavy when inserting and selecting, light to update / delete. About 100,000 "nodes" all samples at least once per hour. And there is a twist; all incoming data for each "node" should be compared with the history and used for checking, forecasting, etc. Oh, there’s another twist; data should be presented in four different ways, therefore there are essentially 4 different copies of this data, none of which can be obtained from any other data with reasonable accuracy and within a reasonable time. 23 megabytes would be an easy walk; we are talking about hundreds of gigabytes in terabytes here.

You will learn a lot about the scales in the process, what methods work and what don’t, but modern SQL databases certainly fit the task. Is this the system I just described? It runs on a 5-year-old IBM xSeries with 2 GB of RAM and a RAID 5 array, and it works great, no one has to wait more than a few seconds even for the most complex queries.

Of course, you will need to optimize. You will often need to denormalize and maintain pre-computed aggregates (or data warehousing) if that is part of your reporting requirement. You may need to think a little outside the box: for example, we use several custom CLR types to store raw data and CLR aggregates / functions for some more unusual transactional reports. SQL Server and other database engines may not offer everything you need, but you can circumvent your limitations.

You will also want to cache - strongly. Drive hourly, daily, weekly. Invest in an external server with as much memory and caches as you can. This is in addition to what kind of storage solution you come up with, if applicable.

One of the things you probably want to get rid of is the "Id" key in your hypothetical Data table. I assume that Data is a leaf table - usually in these scenarios, and this makes it one of the few situations where I recommend using a natural key over a surrogate. The same variable probably cannot generate duplicate lines for the same timestamp, so all you really need is a variable and a timestamp as your primary key. As the table grows larger and larger, having a separate index on variable and timestamp (which, of course, needs to be covered) will easily remove huge spaces - 20, 50, 100 GB. And, of course, every INSERT should now update two or more indexes.

I really believe that RDBMS (or the SQL database, if you want) is just as capable of this task as any other if you take sufficient care and planning in your design. If you just start gluing tables together without worrying about performance or scale, then of course you will run into problems later, and when the database is several hundred GB, it will be difficult to dig yourself out of this hole.

But is it possible? Absolutely. Constantly monitor performance and over time you will find out what kind of optimizations you should make.

+4

Aaronaught Feb 03 '10 at 21:24

source share

I believe that you are heading on the right path . We have a similar situation, we are working. The data comes from various transport / automation systems of various technologies, such as manufacturing, auto, etc. We are mainly dealing with large 3: Ford, Chrysler, GM. But we had a lot of data coming from customers like CAT.

We have finished extracting the data into the database and as long as you properly index your table, keep updates to a minimum and plan maintenance (rebuild indexes, clean old data, update statistics), then I see no reason for it to be bad decision; actually i think this is a good solution.

+1

Jonh Feb 03 '10 at 20:39

source share

Yes, a DBMS is suitable for this, although this is not the fastest option. You will need to invest in a smart system to handle the load. I will review the rest of my answer to this problem.

It depends on how hard you try to challenge this problem. There are two main constraints: how fast you can insert data into the database: bulk I / O speed and search time. A well-designed relational database will perform at least 2 attempts for each insertion: one to start the transaction (in case the transaction cannot be completed), and the other to complete the transaction. Add extra storage to this to search for entries in the index and update them.

If your data is large, the limiting factor will be how fast you can record data. For a hard drive, it will be around 60-120 MB / s. For a solid state drive, you can expect more than 200 MB / s. . Of course, you want to get additional disks for the RAID array. Corresponding digit storage bandwidth AKA serial I / O speed.

When writing a large number of small transactions, the limitation will be how quickly your disk can find a place and write a small piece of data, measured in IO per second ( IOPS ). We can estimate that a transaction requires 4-8 transaction requests (a reasonable case with transactions enabled and an index or two plus some integrity checks). For a hard disk, the search time will be a few milliseconds, depending on the RPM disk. This will limit you to a few hundred records per second. For a solid state drive, the search time is less than 1 ms, so you can write several transactions THOUSANDS per second.

When updating indexes, you will need to make a small attempt O (log n) to find where to update, so the database will slow down as the number of records increases. Remember that the database cannot write in the most efficient format, so the data size may be larger than you expect.

So, in general, YES, you can do this using the DBMS, although you will want to invest in a good storage so that it can keep up with the speed of input. If you want to reduce costs, you may need to convert data for a certain age (say, 1 year) into a secondary compressed archive format.

EDIT: The DBMS is probably the easiest system to work with new data, but you should strictly consider the HDF5 / CDF format that someone suggested for storing old archive data. This flexible and widely supported format provides compression and provides compression and VERY efficient storage of large time series and multidimensional arrays. I believe that it also provides some methods for indexing data. You should be able to write small code to extract from these archive files if the data is too old to be in the database.

+1

Bobmcgee Feb 03 '10 at 20:45

source share

Of course, a relational database is suitable for data mining after the fact.

Various experiments in the physics of nuclei and particles with which I was connected investigated several points due to the fact that they do not use the DBMS at all, keeping only cycle summaries or brief summaries and slowly changing environmental conditions in the database right up to the bit collected to the database (although it was first put on disk).

When and where data transfer speed allows more and more groups to move in the direction of as much data as possible to the database.

+1

dmckee Feb 03 '10 at 21:39

source share

IBM Informix Dynamic Server (IDS) has TimeSeries DataBlade and RealTime Loader, which can provide the corresponding functionality.

Your naive scheme writes each reading 100% independently, which makes it difficult to correlate between readings - both for the same variable at different times, and for different variables (approximately) at the same time. This may be necessary, but when working with subsequent processing, this greatly complicates life. What part of the problem depends on how often you will need to perform correlations in all 1000 variables (or even a significant fraction of 1000 variables, where a significant value can be as little as 1% and almost certainly start at 10%).

I would like to group key variables into groups that can be written together. For example, if you have a monitor unit that records temperature, pressure, and acidity (pH) in one place, and possibly on a hundred monitors in a factory that is being monitored, I would expect to group three readings plus a location identifier (or monitor identifier) and time in one line:

 CREATE TABLE MonitorReading ( MonitorID INTEGER NOT NULL REFERENCES MonitorUnit, Time DATETIME NOT NULL, PhReading FLOAT NOT NULL, Pressure FLOAT NOT NULL, Temperature FLOAT NOT NULL, PRIMARY KEY (MonitorID, Time) );

This saves the need to make self-connections in order to see that the three readings were in a certain place at a certain time, and uses about 20 bytes instead of 3 * 16 = 48 bytes per line. If you insist that a unique integer ID is required for writing, this increases to 24 or 28 bytes (depending on whether you use a 4-byte or 8-byte integer for the ID column).

+1

Jonathan leffler Feb 03 '10 at 21:44

source share

Probably a data structure that will be more optimal for your case than a relational database.

Having said that, there are many reasons associated with a relational database, including reliable code support, backup and replication technology, and a large community of experts.

Your use case is similar to large-scale financial applications and telecommunication applications. Both often insert data and often execute queries that are time-based and include other selection factors.

I was working on a mid-sized billing project that handled cable bills for millions of subscribers. This means that on average about 5 rows per subscriber accounts for several million subscribers per month in the financial transaction table. This was easily handled by a mid-size Oracle server using (now) 4 year old hardware and software. Large billing platforms can have 10 times more records per unit of time.

Properly designed and with the right equipment, this case can be well handled by modern relational databases.

0

Eric J. Feb 03 '10 at 20:40

source share

A few years ago, our client tried to load a DBMS with real-time data collected from monitoring plant equipment. This does not work in a simplified way.

Are relational databases a good backend for the process historian?

Yes, but. It should store summary data, not details.

You will need an external interface and embedded files. Periodic reports and digests can be uploaded to the DBMS for further analysis.

To do this, you will need Data Warehousing methods. Most of what you want to do is to split your data into two main parts ---

Facts. Data containing units. Actual measurements.
Sizes. Various attributes of facts - date, location, device, etc.

This will lead you to a more complex data model.

  Fact: Key, Measure 1, Measure 2, ..., Measure n, Date, Geography, Device, Product Line, Customer, etc. Dimension 1 (Date/Time): Year, Quarter, Month, Week, Day, Hour Dimension 2 (Geography): location hierarchy of some kind Dimension 3 (Device): attributes of the device Dimension *n*: attributes of each dimension of the fact

0

S. Lott Feb 03 '10 at 21:14

source share

You can look at KDB. It is specially optimized for this use: many inserts are few or no updated or deleted. However, it is not as simple as traditional RDBMS.

0

shura Feb 03 '10 at 21:48

source share

Another aspect to consider is what you choose. Relational / SQL databases are great for performing complex joins that depend on multiple indexes, etc. They really cannot be beaten. But if you do not, they are probably not so great.

If all you do is keep records each time, I’m tempted to flip your own file format ... even just output the material as CSV (groans from the audience, I know, but it's hard for me to beat for widespread recognition)

It really depends on your indexing / search requirements and your willingness to write tools for this.

0

NickZoic Feb 04 '10 at 6:30

source share

You might want to take a look at the Data Flow Manager (SDMS) system.

Without addressing all your needs (long-term perseverance), sliding windows in time and series and frequently changing data are their points of strength.

Some useful links:

Large AFAIK database manufacturers should all have some kind of prototype version of SDMS in the works, so I think this is a paradigm that is worth checking out.

0

Agos Feb 04 '10 at 23:41

source share

I know what you're asking about relational database systems, but these are unicorns. SQL DBMSs are probably not well suited to your needs, because no SQL system (I know) provides reasonable means for processing temporary data. depending on your needs, you may or may not have another option in specialized tools and formats, see e. rrdtool

-1

just somebody Feb 03 '10 at 20:55

source share

Robert Harvey · Accepted Answer · 2010-02-03T20:46:14+0000

It looks like you're talking about telemetry data (timestamps, data points).

We do not use SQL databases for this (although we use SQL databases to organize it); instead, we use binary stream files to collect actual data. There are several binary file formats available for this, including HDF5 and CDF. The file format we use here is a proprietary compressible format. But then we are dealing with hundreds of megabytes of telemetry data at a time.

You may find this article interesting (links to a Microsoft Word document):
http://www.microsoft.com/caseStudies/ServeFileResource.aspx?4000003362

This is an example from the McClaren group that describes how SQL Server 2008 is used to collect and process telemetry data from Formula 1 race cars. Note that they do not actually store telemetry data in a database; instead, it is stored in the file system, and the FILESTREAM function of SQL Server 2008 is used to access it.

Do relational databases provide a possible backend for the process historian?

More articles: