Using hashing to group related records

Question

Using hashing to group related records

I am working on an executing company, and we need to pack and send many orders from the warehouse to our customers. To increase efficiency, we would like to group identical orders and pack them most optimally. By identical, I mean the presence of the same number of order lines containing the same SKUs and the same ordinal values.

To achieve this, I thought about hashing each order. Then we can group the hash to quickly see which orders are the same.

We are moving from the Access database to the PostgreSQL database, and we have .NET-based systems for loading data and a general order processing system, so we can either hash during data loading or transfer this task to the database.

First, my question is, should the hash be managed by the database, possibly with triggers, or if the hash will be created on the fly using a view or something else?

And secondly, it would be better to calculate the hash for each order line, and then combine them to find the hash of the order hash for grouping, or I just have to use a trigger for all CRUD operations in the order line table, - calculates a single hash for the entire order and stores the value in the order table?

TIA

+4

database-design hash

Neil dobson May 22 '10 at 1:30

source share

1 answer

mdma · Accepted Answer · 2010-05-22T01:41:19+0000

If you have requirements limiting this, you can put a hash where you feel most comfortable. For example, it can be much easier to code in .net than in SQL. This is an effective approach if the orders in the database are not changed directly, but through the level of data access used by all your applications. The data access layer can then control the hash.

Even with a hash, you still have to check that the hashed orders are really the same. This is due to the fact that it is very difficult to create an ideal hash function - a collision-free function, where all hash objects have a different meaning, for data that can vary greatly in structure.

This suggests that you will need a query (or code) that, given the set of orders, determines which of them are actually equal, grouping them into equivalence sets. For instance. mapping orders to the same hash code - they are really equal. If you start here, then this query can also be used to find duplicate orders from the entire database. This may not be so fast, and in this case you can look at the performance improvement using hashing during the insertion / updating of the order.

Using hashing to group related records

More articles: