Joining very large lists

Let's put some numbers first: The largest of the list is about 100 million entries. (but it is expected to grow to 500). Other lists (5-6 of them) are millions, but in the foreseeable future they will be less than 100 million. They are always connected based on one identifier. and never with other parameters. What is the best algorithm for joining such lists?

I was thinking about lines of distributed computing. Have a good hash (circular hash types where you can add a node function and not much data movement), and these lists are split into several smaller files. And since they are always connected with a common identifier (which I will hash), it comes down to joining small files. And perhaps use the nix connection commands for this.

A database (at least MySQL) would join using a merge join (since that would be on the primary key). Will it be more effective than my approach?

I know that it is best to test and see. But given the tycoon of these files, its quite a lot of time. And I would like to do some theoretical calculations, and then see how this happens in practice.

Any understanding of certain ideas would be helpful. I do not mind if it takes a little longer, but prefers the best use of the resources that I have. You do not have a huge budget :)

+3
source share
1 answer

Use the database. They are designed to perform joins (with the correct indexes, of course!)

+5
source

Source: https://habr.com/ru/post/1760696/


All Articles