Creating a unique invalid base 36 id

For an application similar to URL abbreviation services, I want to create invalid id that you all know, I suppose. Here is an example of such an id:

http://example.com/sd23t9

What is a good and effective technique for creating them with minimal (or absent) risk of collision when inserting them as a primary key into a database table?

EDIT:
Piskvor makes an excellent point of view. I should have mentioned that I had in mind the minimum risk of a collision before reaching the limit of 36 ^ 6.

EDIT 2
Ah, scrap, that, his point illustrates much more than that, of course. Hmmm. Perhaps a table with an identifier was ever born (for example, I already read elsewhere)? Will this be the most effective method when I get involved with 36 ^ 6 and non-sequential constraints?

+4
source share
6 answers
Set ID length. // eg 6 do { Generate a short random ID of the given length Collision? - No: DONE! - Yes: increase ID length } while true 

There is always a risk of collision for any finite identifier length: based on your example, that you will have [a-z0-9] {6} identifiers, as soon as you have 2,167,782,336 identifiers, the collision is 100% guaranteed (no more available keys). Because of your birthday, you will come across much faster. With such a small key space, there is no way to avoid collisions - instead, you will need some kind of recovery after collisions.

You can generate an identifier until it collides - but it will become slower as you fill up the key space: imagine the key space [az] with [an] and [pz] already taken - now every new random identifier is more likely everything will collide than not; and when you completely fill the keyspace, the loop will not stop at all.

The algorithm that I propose may be too cautious in this regard: if it detects a collision, it will try to gradually increase the identifiers (since it assumes that β€œcollision => is not possible to check for a shorter key space”). Although not efficient, it can find a conflict-free identifier over several iterations.

+5
source

A bit of a strange idea. Why not use permutations? for example, you have a set of values ​​[0-9a-z] when generating the first identifier. you take the first permutation in lexicographical order. then second, etc. to make it seem less guessed, you can change the rules of the lexicographical order. let's say "a" comes after "t" or something like that. you can use a tuple instead of full permutations. This will ensure no collisions.

Actually this idea represents a two-way hash function. basically, if you can somehow encode the number "1" to get something like "q8d3dw" and be able to decode "q8d3dw" to "1", you can be sure that this function will give you unique lines for all values ​​from 1 to 36 ^ 6.

The problem is choosing this feature. An easy way would be to associate β€œ1” with β€œ000000”, β€œ2” and β€œ000001”, β€œ12” - β€œ00000b”. Basically arrange all the available lines in lexicographical order and select the line at the position that is id. This, however, is very easy to guess. So what you can do is artificially change the rules of the lexicographical order. Say instead of the usual order (0,1,2,3 ... a, b, c ... x, y, z), you can shuffle it a bit and get something like (a, 5, t, 3. ..). This will lead to slightly more confusing results. However, he will still be completely guessed, because the first element will be "aaaaaa", the second "aaaaa5", then "aaaaat". Thus, you can even more change the rules of the lexicographic order, making them dependent on the position of the character. Say the order for the first char id (a, 5, t, 3 ...) for the second (y, 7.3, r ...), etc.

Now I will not publish pseudo-code, as it will be quite long. And I do not advise you to go this route if you are not interested in creating such kind of algorithms for entertainment :). However, if you go with this route, it can be a very efficient way to generate this identifier without the chance of a collision. And I advise you to read Volume 4, The Art of Computer Programming, by Dr. Donald Knuth. There are many suggestions for implementing such algorithms.

+3
source

@ivan: here is the implementation.

First you need 6 lines, here is the code to create them:

 $letters = range('a', 'z'); $numbers = range(0, 9); $char_list = array_merge_recursive($letters, $numbers); for ($i = 0; $i <= 5; $i++) { shuffle($char_list); echo '$mysterykey[' . $i . "] = '" . implode($char_list) . "';\n"; } 

Then you can use the output from this function:

 function int_to_x36($value) { $mysterykey[0] = 'awkbs81t3jyg20v4qhoxzcuenr65dfimlp97'; $mysterykey[1] = 'ut17qclz96n3msky8jwp4r25gdvhxo0biaef'; $mysterykey[2] = 'cewszx142nph05mi9ulafbdvy36o8gq7trjk'; $mysterykey[3] = '37hp28wjdqe5vnlzxofrsybu4im90k16agtc'; $mysterykey[4] = 'n9a3jksl5ct0eqfzo7pvgyd4m2hiu18xrb6w'; $mysterykey[5] = 'mq0nbk3zs529gu4tecli8j176vardxoypfwh'; $x36 = array(); for ($i = 5; $i >= 0; $i--) { $x36[$i] = 0; $y = pow(36, $i); if ($value >= $y) { $z = floor($value/$y); $value = $value - ($z * $y); $x36[$i] = $z; } } $encoded = ''; for ($i = 0; $i <= 5; $i++) { $encoded .= $mysterykey[$i][$x36[$i]]; } return $encoded; } function x36_to_int($value) { $mysterykey[0] = 'awkbs81t3jyg20v4qhoxzcuenr65dfimlp97'; $mysterykey[1] = 'ut17qclz96n3msky8jwp4r25gdvhxo0biaef'; $mysterykey[2] = 'cewszx142nph05mi9ulafbdvy36o8gq7trjk'; $mysterykey[3] = '37hp28wjdqe5vnlzxofrsybu4im90k16agtc'; $mysterykey[4] = 'n9a3jksl5ct0eqfzo7pvgyd4m2hiu18xrb6w'; $mysterykey[5] = 'mq0nbk3zs529gu4tecli8j176vardxoypfwh'; $value36 = str_split($value); $x36 = array(); for ($i = 0; $i <= 5; $i++) { $x36[$i] = strpos($mysterykey[$i], $value36[$i]); } $x = 0; for ($i = 5; $i >= 0; $i--) { $x += $x36[$i] * pow(36, $i); } return $x; } 

I'm sure I forgot something, but it seems to work.

+2
source

For example, a large random number and SHA-256 Hash? You can cut it later to suit your needs.

+1
source

If the site is large enough, and I mean big - as in the case "we need hundreds of new identifiers assigned per second" (which will have other problems, for example, exhaust the 36 ^ 6 key space under per year), you can pre-generate keys and assign them.

Below is an example of pseudo-code - on such a site with a lot of traffic, you probably need to optimize the algorithms used.

Pre-generation is trivial - start at 000000 and go all the way to zzzzzz , insert each row in the table and mark it unused:

  ID | used ============== 000000 0 000001 0 ... zzzzzz 0 

When you receive a request for a new identifier, select random and mark it (danger! Concurrency problems!):

 SELECT ID FROM ids WHERE RAND() LIMIT 1 UPDATE ids SET used = 1, url=what_you_want_shortened WHERE ID = whatever_you_got_from_previous_query 

You will run into performance issues (e.g. with WHERE RAND() , table locking, etc.), but this is doable.

+1
source

If you are not averse to bringing a .NET dll, I created a project to do just that. The source on GitHub is here , and the binaries are in the IdGenerator NuGet Package .

It gives ordered sequences and / or random values ​​of a user-defined length, all in base-36. Identifiers are universal, even with multiple instances of an identifier generator or in a distributed environment.

 var generator = new Base36IdGenerator( numTimestampCharacters: 11, numServerCharacters: 4, numRandomCharacters: 5, reservedValue: "", delimiter: "-", delimiterPositions: new[] {15, 10, 5}); // This instance would give you a 20-character Id, with an // 11-character timestamp, 4-character servername hash, // and 5 character random sequence. This gives you 1.6 million // hash combinations for the server component, and 60 million // possible combinations for the random sequence. Timestamp is // microseconds since epoch: Console.WriteLine(generator.NewId()); // 040VZC6SL01003BZ00R2 // Argument name specified for readability only: Console.WriteLine(generator.NewId(delimited: true)); // 040VZ-C6SL0-1003B-Z00R2 

Of course, if you are more interested in having the string not guessed, and not have an ordered sequence, you could simply use the GetRandomString method:

 // 6-characters give you a little over 2 billion possible // combinations (36 ^ 6). 7 characters is about 78 billion. Console.WriteLine(generator.GetRandomString(length: 6)); 

The basic principle:

  • Get microseconds from an era (64-bit number)
  • Get a random number (64-bit) between 0 and (36 ^ desired length) (13 max)
  • Get the hash of the server name (simple Sha1)
  • Convert each component to base-36 string
  • Folder on the left from 0 to the desired length

The Base36 encoder itself is located at http://www.stum.de/2008/10/20/base36-encoderdecoder-in-c/ .

0
source

Source: https://habr.com/ru/post/1334649/


All Articles