Best way to store pairs with weights for 500,000 users?

I am creating a site on which I want to combine people of common interest. I do this by calculating the weight between each user and determining who is best suited - those who have a lot of weight:

Example:

user 1 with user 2 = weight of 1 user 1 with user 3 = weight of 10 user 1 with user 4 = weight of 20 

I want to put the scales in the database. The problem is that if I have 500,000 users, this is 500,000 x 500,000 possible combinations, or 125,000,000,000 records in the mysql database. It is not possible to insert so much data into one of many tables.

My question is: is there a way to handle so many pairs with weights using a different type of database? I read about vectors and things, but I don’t know enough to appreciate this.

I checked the documentation about:

  • NoSQL Databases: MongoDB
  • Databases of objects: (db4o, Versant)
  • Graphic Databases: neo4j, sones ...
  • Wide speaker: Hadoop, HBASE
  • Document Storage: CouchDB
  • Key Vault: Redis, Voldemort
  • Grid Databases: Gigaspaces ..
  • XML databases.

But of them I do not see a solution. Has anyone experienced this problem and can give me a hint?

+4
source share
7 answers

I am going to go on a limb and say that there is no good solution to this question. There seems to be no way to avoid saving user values ​​/ weight of 125B, given the question asked.

Finding another type of database will not help. You just can't get around the fact that you have 125B values ​​that you need to save.

There are several ways around this.

  • Find the relationship between users and weights. For instance. if the weight is always equal to the sum of the two user identifiers (provided that the user has an identifier), then you do not need to save the weights.
  • Calculate on the fly and do not store
+1
source

From your explanation, I do not think that these weights should be stored at all. This is a kind of cache for some of the calculations you did. You do not need to save the result, because you can repeat the calculations when you need it. You can still store your scales, but just remember that it is cached and the data in it can be deleted when the cache is full.

By the way, users usually have filters. These filters can automatically ignore 95% of your user base. You can take advantage of this.

+1
source

From the question, it seems that the structure is a grid where each user is connected to the others (500K X (500k -1)). It sounds very complicated. Having made some heuristic assumptions, optimizations may be possible.

Supposed case 1: not every pair of users can have weight, this can lead to a sparse matrix. So why not save only non-zero weights

Suspected Case 2: I have a strong feeling that the range of weights may be limited. I don’t think there will be 500 thousand different weights, perhaps 500 different weights. If so, create 500 different groups under which user pairs are stored. Not much space saving, but a partitioning method.

To save space by using case 2, eliminate the need to store users in these groups. The set of characteristics of interest (lower bound and upper bound). To get a match for this user, follow these steps:

  • Go through 500 groups with odd weights and select the most suitable lower and upper bounds. You will not know the exact user, but now you know how he / she displays.
  • Search user table for users who fall into this border
  • Perform a more detailed analysis of the actual user group returned in step 2.

My assumptions may be wrong. In this case, I just gave a friend.

0
source

As long as your design includes storing all weights for all combinations, you cannot avoid the storage problem. Intelligent space optimization can only be achieved by optimizing the design itself. The quests below offer some good approaches. A rare approach to the matrix may work in the beginning, but it can become useless as more and more users connect. It would be better to identify fixed buckets (ranges) of weights instead of absolute weight values, for example.

Alternatively, see if you can abandon the topology with a fully connected mesh and accept something like rarely connected clusters or hierarchies, etc. If so, then each of these clusters can be assigned an identifier, and you can have a weight for each user with his own cluster (degree of membership) and weights for the connection between the clusters. The weight for connecting user-1 in cluster-1 with user-2 in cluster-2 can then be obtained as a function of inter-cluster weights and the degree to which users belong to their own clusters.

0
source

I think this is a very simple but interesting question, especially if you cannot use any tricks to reduce the amount of stored weights. Ultimately, you have key-value pairs, where keys are made up of user pairs. As long as you only want to get individual weights for given pairs of users, you can use shards.

If your data does not change often and you have several computers to work with, then you should be able to implement your own scalding strategy or use Gizzard to manage a simple cluster with compatible key data storage on each computer. (Gizzard requires all operations to be commutative and idempotent.)

0
source

Are you ready to create a solution from scratch?
If you do, perhaps you should create 500,000 files, one for each user, and store 500,000 weights in each file, sorted by user ID, with a fixed length. Then you can go to a specific place in the desired file and read the value without using delimiters or actually storing user identifiers. (If your user IDs are not numbers from 1-500000, you will also need a mapping from the user ID to the new ID from 1-500000, and you will sort by that ID)

What level of detail do you need on the scale?
You can round each weight to the nearest multiple n / (2 ^ k) that suits your needs. In the case of three decimal places, you can save each number as 10 bits, and k = 10. Thus, each file will only be 500000 * 10 bits = 625Kb, and the entire data set will be 312.5Gb. You can even compress files and only unzip them when necessary, depending on what trade-offs you are willing to make between speed and space. This solution also assumes that changes are rarely made, and you only get one value at a time (or some range of values).

0
source

The problem does not exist, in my opinion. Since it is unrealistic that one person knows 500 thousand people. Maybe one person is known per 500,000 people, but this person probably knows only a small part of them personally, for example. Lady Gaga

Probably a realistic average of 300 for social networks in a lifetime. Thus, you have "only" 150-200 million relationships.

I would go with a db chart since it is pretty easy to model relationships with them.

-1
source

Source: https://habr.com/ru/post/1342585/


All Articles