关于 coreseek 中文搜索复合词/同义词搜索不到结果问题

0 0

关于 coreseek 中文搜索复合词/同义词搜索不到结果问题

首先描述有点长，请耐心看完，我尽量将各式写的清晰点

以下描述都以搜索 “世界” 一词为例；

====================我是分割线======================

词库是放在数据库中(一个是为了去重，另外一个是方便管理),表记录为40多万，然后通过查询这个词库表生成词典 unigram.txt，并且在生成是将加入繁体字，格式如图：

请输入图片描述

这个文件中不存在重复的词

通过命令

  mmseg -u unigram.txt

生成 unigram.txt.uni 并重命名为 uni.lib

接着使用 unigram.txt 文件通过源码里提供的 build_thesaurus.py 脚本来生成复合词（同义词）文件

  python /mmseg-3.2.14/script/build_thesaurus.py unigram.txt > unigram_thesaurus.txt

unigram_thesaurus.txt 文件格式如图：

请输入图片描述

然后通过 unigram_thesaurus.txt 文件来生成 thesaurus.lib

  mmseg -t unigram_thesaurus.txt

====================我是分割线======================

下面是重新建立索引的步骤：

1，停掉搜索进程，killall searchd

2，将coreseek 安装目录下将 var/data/ 下的文件全部删除

3，修改配置文件(test.conf),将 charset_dictpath 指向新生成 thesaurus.lib，uni.lib 的目录

4，重新创建索引 /coreseek/bin/indexer -c /coreseek/etc/test.conf --all

5,启动搜索服务 /coreseek/bin/searchd -c /coreseek/etc/test.conf &

====================我是分割线======================

数据源games表中数据含有 “世界” 的游戏名称的数据，如图：

请输入图片描述

====================我是分割线======================
mmseg 的分词如下：

  /mmseg -d /data/words/ /work/fyj/tmp/t.txt
  

  

  坦克世界/x
  

  魔兽世界/x 魔兽/s 世界/s
  

  仙侠世界/x 仙侠/s 世界/s
  

  

  tail /work/fyj/tmp/t.txt
  

  

  坦克世界
  

  魔兽世界
  

  仙侠世界

不知道“坦克世界”为什么没有分开

然后搜索 “世界” 一词

  /coreseek/bin/search -c /coreseek/etc/test.conf '世界'
  

  

  结果：
  

  

  Coreseek Fulltext 3.2 [ Sphinx 0.9.9-release (r2117)]
  

  Copyright (c) 2007-2011,
  

  Beijing Choice Software Technologies Inc (http://www.coreseek.com)
  

  

  using config file '/coreseek/etc/test.conf'...
  

  index 'test': query '世界 ': returned 0 matches of 0 total in 0.072 sec
  

  

  words:
  

  1. '世界': 0 documents, 0 hits

却搜不到结果，另外补充下：
1，unigram.txt 有 “世界” 一词；
2 “魔兽世界” 和 “仙侠世界” 还有 “坦克世界” 的复合词如下：

  魔兽世界
  

  -魔兽,世界,
  

  

  仙侠世界
  

  -仙侠,世界,
  

  

  坦克世界
  

  -坦克,世界,

为什么搜索 “世界” 没有结果？

coreseek sphinx

11 years, 4 months ago

Trium

Trium 11 years, 4 months ago

  1.coreseek 在检索“世界”，的时候，发现分词里里面，没有“+世界”这个词，那么它是默认返回的 0.
  

  2.“+魔兽世界”已经是一个词了，当然也就不会再拆分了。

answered 11 years, 4 months ago

Soul残舞

Soul残舞 answered 11 years, 4 months ago