How to develop a string matching algorithm where input is an exact string and search values ​​are regular?

This is a design issue.

Reference Information. We get a web request in our system from many different sites (for the widget that we issue), from which we get the referrer line (if it exists). We use the referrer to decide on some things in the application. The problem is that I need to look at the list of "sites" (URLs, partial URLs, URLs containing wildcards) to determine what to do. This list can be on the order of many thousands of sites. I need to be able to ask something like "Site Service" (or something else) if the referrer matches any on the list of sites. I need to do this quickly, say 5-10 ms, give or take a few ms, and return a positive or negative result.

Here is an example:

Request - Referrer = http://www.stackoverflow.com/users/120262?tab=accounts

The list of sites may contain URLs, for example:

  • users.stackoverflow.com - (non-compliance)
  • www.stackoverflow.com/users - (match)
  • www.stackoverflow.com/users/120262 - (match)
  • www.stackoverflow.com/users/* - (match)
  • */users/* - (match)
  • www.stackoverflow.com/users/239289 - (no coincidence)
  • *.stackoverflow.com/questions/ask - (does not match)
  • */questions/* - (no coincidence)
  • www.stackoverflow.com - (match)
  • www.msdn.com - (no coincidence)
  • *.msdn.com - (no coincidence)
  • developer.*.com - (no coincidence)

You get the idea ...

The problem I'm dealing with is how to handle this with performance and scalability.

The contractor, that I need to quickly make a decision so that I can proceed to the real processing that should happen.

Scalability is that a list of thousands of "sites" is set up for each subsidiary site that we have, and each of them can have many site lists, which makes thousands of site lists containing thousands of sites.

, , () . , , , .

.

+3
3

, , , , - , , wilcards "*", .

, , , , .

, , , , , Aho-Corasick:

http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

, , , , , . , " ", . "developer..com" "". ".com". , . , URL- . .com, -, , ( , , "a.com.developer.foo", , , "developer..com" ).

, Aho-Corasick , . , , . , Google " Deep Packet Inspection ", .

, Cisco Systems. , - Cisco, .

+2

URL-, 5-10 , ( ) , ( ). , : - , ? , .

, :

  • URL- - . , , "/" ( , ). "/" - , "/" - .
  • , ".".
  • , "/" .

, URL, :

"www.stackoverflow.com/users/239289"

:

"www", "stackoverflow", "com", "/" , "users", "239289"

, "/" , .

tokenize URL - PHP, ( ). , :

function tokenize_url($url) {
    $pos = strpos($url, '/');
    if ($pos === 0) {
        // It a path-only entry.
        $hostname = '*';
        $path = substr($url, 1);
    } else if ($pos !== false) {
        // It a normal URL with hostname and path.
        $hostname = substr($url, 0, $pos);
        $path = substr($url, $pos + 1);
        if ($path === false) {
            $path = '';
        }
    } else {
        // No slash found, assume it a hostname only.
        $hostname = $url;
        $path = '';
    }

    if ($hostname !== '') {
        $hostname_tokens = explode('.', $hostname);
    } else {
        $hostname_tokens = array();
    }

    if ($path !== '') {
        $path_tokens = explode('/', $path);
    } else {
        $path_tokens = array();
    }

    return array_merge($hostname_tokens, array('/'), $path_tokens);
}

, , URL-, URL- ( ). , ( , ), O (1) . , , "%!%!%" node.

- , :

function compile_site_list($site_list) {
    $root = array();

    foreach ($site_list as $url) {
        $tokens = tokenize_url($url);
        $node = &$root;

        for ($i=0; $i<count($tokens); $i++) {
            // The "%" forces PHP to evaluate this as a string, no matter what.
            // Sadly, casting to a string doesn't do it!
            $token = $tokens[$i] . '%';

            // If this is our first time seeing this string here, make a
            // blank node.
            if (!(isset($node[$token]))) {
                $node[$token] = array();
            }

            if ($i < (count($tokens) - 1)) {
                // If we're not at the end yet, keep traversing down.
                $node = &$node[$token];
            } else {
                // If we're at the end, mark it with our special marker.
                $node[$token]['%!%!%'] = 1;
            }
        }
    }

    return $root;
}

, URL- , compile_site_list() - .

URL. -, , , :

function scrub_url($url) {
    // Get rid of the protocol (if present).
    $pos = strpos($url, '://');
    if ($pos !== false) {
        $url = substr($url, $pos + 3);
    }

    // Get rid of any query string (if present).
    $pos = strpos($url, '?');
    if ($pos !== false) {
        $url = substr($url, 0, $pos);
    }

    return $url;
}

, , URL-, , . "%!%!%" , .

, , . , , ( "/" ), , .

- .

:

function search_compiled_list($url, $compiled_site_list) {
    $url = scrub_url($url);
    $tokens = tokenize_url($url);

    return do_search($tokens, $compiled_site_list);
}

function do_search($tokens, $compiled_site_list) {
    // Base cases for recursion:
    if (isset($compiled_site_list['%!%!%'])) {
        // If we're at a node that has our marker hanging off it - we found it!
        return true;
    } else if (count($tokens) === 0) {
        // Otherwise, if we got somewhere and have no tokens left, we didn't
        // find it.
        return false;
    }

    // The "%" on the end forces PHP to evaluate this as a string, no
    // matter what.
    $token = $tokens[0] . '%';

    if (isset($compiled_site_list[$token])) {
        // Found an exact match!
        $result = do_search(array_slice($tokens, 1),
            $compiled_site_list[$token]);
        if ($result === true) {
            return true;
        }
    }

    // Didn't find an exact match - let check for wildcards.
    if ((isset($compiled_site_list['*%'])) && ($tokens[0] !== '/')) {
        // If we're matching the wildcard, let it consume as many tokens
        // as it wants.  The only thing it can't consume is /.
        for ($i=1; $i<count($tokens); $i++) {
            $result = do_search(array_slice($tokens, $i),
                $compiled_site_list['*%']);
            if ($result === true) {
                return true;
            }
        }
    }

    return false;
}

, - $site_list, URL-, :

$url_to_check = "http://www.stackoverflow.com/users/120262?tab=accounts";
$compiled_site_list = compile_site_list($site_list);
$result = search_compiled_list($url_to_check, $compiled_site_list);
var_dump($result);

URL-, , , , , . , , , , /. ( , , .)

, URL-, , URL- ( ), . , - , 2 , .

+5

I'm not sure I understood your problem correctly, but it looks like a prefix tree (aka trie) . Download all the URLs of your site into one, and then quickly go down using the referrer URL. It should get you into a subset of “matches” very quickly. There are a number of trie vari worth paying attention to.

+1
source

Source: https://habr.com/ru/post/1771838/


All Articles