PHP word index, performance and reasonable results

I am currently working on an index for a search function. The indexer will work on the data from the "fields". Fields look like this:

  Field_id   Field_type   Field_name   Field_Data
- 101        text         Name         Intel i7
- 102        integer      Cores        4 physical, 4 virtual
- 103        select       Vendor       Intel
- 104        multitext    Description  The i7 is intel next gen range of cpus.

The indexer will generate the following results / index:

  Keyword    Occurrences
- intel      101, 103, 104
- i7         101, 104
- physical   102
- virtual    102
- next       104
- gen        104
- range      104
- cpus       104   (*)
- cpu        104   (*)

So it looks nice and fine, however there are some issues that I would like to sort out:

  • filter out common words (as you may have noticed, "from the list of" missing "and" intel ")
  • As for "cpus" (plural versus singular), would it be better to use a specific type (singular or plural), then both (that is, "cpus" is different from "cpu")?
  • , ( : test = > fish = > fish leaf = > leaves)
  • MySql, ; 500 , .
  • , "vendor: intel", (_), , sql?
  • ; , , , !
  • , , , , , ; -)
  • , , .

(, , , i7;-))

+3
7

/.

Sphinx ( , ), , , .

, , , . , . , , , , , , , , PHP/MySQL.

PHP, Sphinx. , .

:

, ( , , , " " "" intel ")

11.2.8. -

- - , . , - , .

"cpus" ( ), ( ), ( , "cpus" "cpu" )?

11.2.9. wordforms

Word charset_table. . (, , "", "", "" "" ). , , .

, ( : test = > fish = > fish leaf = > leaves)

Sphinx

( "-" - . .

, "vendor: intel", (_), , SQL-?

3.2.

. , - - (.. author_id forum_id SQL-); post_date; post_date .

( , ) , , API , .

5.3. ( ):

: @vendor intel

/ /etc ?

8.6.1.

Query() , ( SetLimits()) . > - ( PHP, ) :

"":
, , ( , SetArrayResult()).

"":
, (.. ) . .

"total_found":
( ).

"":
, (, ) ( "", "" ).

"":
, (, ). , .

"":
, (, ). , .

. 11 13 from PHP.

+1

( ) , php . http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/

preg_replace , .

, , 's', 'ed' .. . . 200 .

, , ​​ Lucine (solr), . . .

+3

. Java , , , PHP.

+1

( , : "" "" "intel" )

( ) .

"cpus" ( vs ), ( ), (.. "cpus" "" )?

. , , ; , LIKE, .

, ( : test = > fish = > = > )

Inflector. ..: Inflect::plural('fish') 'fish'. , .

MySql, ; 500 ,

, .

, ": intel", (_), , sql?

, . , / .

; , , , !

. , .

+1

( , ;-)), , ( ).

/ /etc ? , , , . , , ? , /?

, , , . , /, , - - , , - .

0

There is a PHP implementation of the Brill Part of Speech tagger in php / ir . This can serve as a basis for identifying those words that should be dropped and those that you want to index, and also identifies plurals (and the root of the singular). This is not ideal, although a custom dictionary for processing technical terms, it may be useful for solving your first three questions.

0
source

Source: https://habr.com/ru/post/1756126/


All Articles