How should I handle the weight of duplicate records in the MyISAM search index?

Question

I use the result myisam_ftdump to create a search suggestion table. This process went smoothly, but many words appear in the index several times. Clearly, I could have just SELECT distinct term FROM suggestions ORDER BY weight, but does that not punish words for appearing more than once?

If so, is there a concise formula for concatenating strings?

If this is not the case, which lines should I keep (for example, with the highest weight, with the lowest weight)?

Data examples

+-----+------------+----------+
| id  | word       | weight   |
+-----+------------+----------+
| 670 | young      | 0.416022 |
| 669 | york       |  0.54944 |
| 668 | years      | 0.281683 |
| 667 | years      | 0.416022 |
| 666 | wrote      | 0.416022 |
| 665 | written    |  0.35841 |
| 664 | writing    |  0.29518 |
| 663 | wright     | 0.281683 |
| 662 | witness    | 0.281683 |
| 661 | wiesenthal | 0.452452 |
| 660 | white      |  0.35841 |
| 659 | white      | 0.281683 |
| 658 | wgbh       | 0.369332 |
| 657 | weighs     |  0.35841 |
+-----+------------+----------+

See especially the “whites” and “years”.

0
source share
1 answer

, myisam_ftdump -d. , myisam_ftdump -c .

, , , .

doc -c vs. -d:

  -c, --count         Calculate per-word stats (counts and global weights).
  -d, --dump          Dump index (incl. data offsets and word weights).
+1

Source: https://habr.com/ru/post/1727487/


All Articles