Benchmarking in BaseX: how to set up

I am currently an intern in a research team that makes large sets of texts (cases) searchable. Not only can you search for literary lines, but, more importantly, you can also search for similar syntactic dependency structures as a given input, without having to know any programming language or corpus annotation. It may be clear that this tool is for linguists.

At the beginning of the project - before I was involved in the project - the tool was limited to rather small cases (up to 9 million words). The goal is to make large text sets searchable. We are talking about + 500 million words. Attempts have been made to theoretically improve speed by reducing the search space (see this article ), but this has not yet been verified. The result of this attempt is a new file structure. Let us call this structure B compared to the raw structure A. We expect B to provide faster query results with BaseX.

My question is: what is the best way to test and compare both approaches with a Perl script? Below you will find my current BaseX script request locally. He takes two arguments. The directory where the different files are stored. These files each separately store XPath. Those XPaths that I have chosen for comparison. The second argument is the limit of the returned results. If set to "0", the limit is not set.

Since some parts of the data set are so incredibly huge, we also divided them into different files of the same size, called treebankparts. They are stored in tags <tb>inside treebankparts.lst.

#!/usr/bin/perl

use warnings;

$| = 1;    # flush every print

# Directory where XPaths are stored
my $directory = shift(@ARGV);

# Set limit. If set to zero all results will be returned
my $limit = shift(@ARGV);

# Create session, connect to BaseX
my $session = Session->new([INFORMATION WITHHELD]);

# List all files in directory
@xpathfiles = <$directory/*.txt>;

# Read lines of treebank parts into variable
open( my $tfh, "treebankparts.lst" ) or die "cannot open file treebankparts.lst";
chomp( my @tlines = <$tfh> );
close $tfh;

# Loop through all XPaths in $directory
foreach my $xpathfile (@xpathfiles) {
    open( my $xfh, $xpathfile ) or die "cannot open file $xpathfile";
    chomp( my @xlines = <$xfh> );
    close $xfh;

    print STDOUT "File = $xpathfile\n";

    # Loop through lines from XPath file (= XPath query)
    foreach my $xline (@xlines) {
        # Loop through the lines of treebank file
        foreach my $tline (@tlines) {
            my ($treebank) = $tline =~ /<tb>(.+)<\/tb>/;
            QuerySonar( $xline, $treebank );
        }
    }
}
$session->close();

sub QuerySonar {
    my ( $xpath, $db ) = @_;

    print STDOUT "Querying $db for $xpath\n";
    print STDOUT "Limit = $limit\n";
    my $x_limit;
    my $x_resultsofxp = 'declare variable $results := db:open("' . $db . '")/treebank/alpino_ds'
      . $xpath . ';';
    my $x_open       = '<results>';
    my $x_totalcount = '<total>{count($results)}</total>';
    my $x_loopinit   = '{for $node at $limitresults in $results';

    # Spaces are important!
    if ( $limit > 0 ) {
        $x_limit = ' where $limitresults <= ' . $limit . ' ';
    }
    # Comment needed to prevent `Incomplete FLWOR expression`
    else { $x_limit = '(: No limit set :)'; }

    my $x_sentenceinfo = 'let $sentid := ($node/ancestor::alpino_ds/@id)
        let $sentence := ($node/ancestor::alpino_ds/sentence)
        let $begin := ($node//@begin)
        let $idlist := ($node//@id)
        let $beginlist := (distinct-values($begin))';

    # Separate sentence info by tab
    my $x_loopexit = 'return <match>{data($sentid)}&#09;
        {string-join($idlist, "-")}&#09;
        {string-join($beginlist, "-")}&#09;
        {data($sentence)}</match>}';
    my $x_close = '</results>';

    # Concatenate all XQuery parts
    my $x_concatquery =
        $x_resultsofxp
      . $x_open
      . $x_totalcount
      . $x_loopinit
      . $x_limit
      . $x_sentenceinfo
      . $x_loopexit
      . $x_close;

    my $querysent = $session->query($x_concatquery);

    my $basexoutput = $querysent->execute();
    print $basexoutput. "\n\n";

    $querysent->close();
}

(Please note that this is a stripped-down version and that it may not work as is). This snippet does not use structure B!)

: XPath, XPath, , . Sub BaseX. XQuery BaseX (, Perl script). , , : script, .

, script. , . , , A B. ( ) , ? script, ?

, . , . script, . . . . , , . , . , XPath, ?

( -, , , , SO.)

+4
2

: Perl - , . (, , .) , , XQuery, Perl.

1000 , , , 1000 , . : script bash dbms ; - 2000. - , 500 ; . ( , , . [ ] , , , script dbms.)

: , , , , , . , , , . A B, : for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done , for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done? , , , A B . , , , . , , , , - , . , .

Perl script, XQuery.

: corpus , : ( , -) , , dbms ( BaseX), BaseX, BaseX, , -, . , , , , , , BaseX.

, , XQuery

2 + 3

42

BaseX ; - . ( : , ? , BaseX , - ?...)

: , , , , . , " X Y?" " X Y ?" , . ( , , , .)

0

. , BaseX perl script, perl script, , XQuery ( XPath, ). XQueries, , XQuery . BaseX, API Perl . perl , .

, , script , . XQueries A B perl script.

, , , , Java JIT- ( BaseX java, JIT , BaseX). Client/ .

: BaseX, ( -V, , , ). , -r, , .

, script, , . , . , 500 .

: BaseX, BaseX , , , ML.

+1

Source: https://habr.com/ru/post/1629775/


All Articles