Indexing a large database using Lucene / PHP

Afternoon

Attempt to index a row table by 1.7 million with a Zend Lucene port. On small tests of several thousand lines, it worked fine, but as soon as I try to collect lines up to several tens of thousands, it expires. Obviously, I could increase the time that php allows you to run the script, but after seeing that it took ~ 10,000 lines in 360 seconds, I would not want to think how many seconds it would take to make 1.7 million.

I also tried to make a script by running a few thousand, updating, and then starting the next few thousand, but it cleans up the index every time.

Any ideas guys?

Thanks:)

+4
source share
3 answers

I'm sorry to say this because the developer Zend_Search_Lucene is a friend and he worked very hard on it, but unfortunately, it is not suitable for creating indexes on datasets of any non-trivial size.

Use Apache Solr to create indexes. I tested that Solr is more than 300 times faster than Zend for creating indexes.

You can use Zend_Search_Lucene to issue queries on an index created using Apache Solr.

Of course, you can also use the PHP PECL Solr extension, which I would recommend.

+3
source

Try to speed it up by selecting only the fields you need from this table.

If it is something to run as a cronjob or a worker, then it should be launched from the CLI, and for this I do not understand why changing the timeout would be bad. You only need to create the index once. After that, the new entries or updates for them are minor updates to your Lucene database.

0
source

Some information for all of you is posting as an answer, so I can use code styles.

$sql = "SELECT id, company, psearch FROM businesses"; $result = $db->query($sql); // Run SQL $feeds = array(); $x = 0; while ( $record = $result->fetch_assoc() ) { $feeds[$x]['id'] = $record['id']; $feeds[$x]['company'] = $record['company']; $feeds[$x]['psearch'] = $record['psearch']; $x++; } //grab each feed foreach($feeds as $feed) { $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::UnIndexed('id', $feed["id"])); $doc->addField(Zend_Search_Lucene_Field::Text('company', $feed["company"])); $doc->addField(Zend_Search_Lucene_Field::Text('psearch', $feed["psearch"])); $doc->addField(Zend_Search_Lucene_Field::UnIndexed('link', 'http://www.google.com')); //echo "Adding: ". $feed["company"] ."-".$feed['pcode']."\n"; $index->addDocument($doc); } $index->commit(); 

(I used google.com as a temporary link)

The server on which it works is a local installation of Ubuntu 8.10, 3Gb RAM and a dual-processor Pentium 3.2GHz.

0
source

Source: https://habr.com/ru/post/1306934/


All Articles