Quick Search


Tibetan singing bowl music,sound healing, remove negative energy.

528hz solfreggio music -  Attract Wealth and Abundance, Manifest Money and Increase Luck



 
Your forum announcement here!

  Free Advertising Forums | Free Advertising Board | Post Free Ads Forum | Free Advertising Forums Directory | Best Free Advertising Methods | Advertising Forums > Other Methods of FREE Advertising > Online Classifieds Directory

Online Classifieds Directory Online Classifieds are an often over looked method of FREE Advertising and way of getting your brand name out there, but just ask around...they work, if you work them.

 
 
Thread Tools Search this Thread Display Modes
Prev Previous Post   Next Post Next
Old 10-11-2011, 05:46 PM   #1
Adcdfwwxa
Warrant Officer
 
Join Date: Oct 2011
Posts: 343
Adcdfwwxa is on a distinguished road
Default 提高wiki的中文搜索的准确度,mysql的fulltext的中文支

Mysql 全文索引的中文问题 (Mediawiki搜索中文问题)Category: 技术 ― ssmax @ 15:24:59今天翻了一下meidawiki的源代码,由于它的中文搜索不太准确,想查查原因,就看了 一下它的搜索是如何实现的。
数据库是mysql,使用了全文索引表进行搜索
CREATE TABLE `searchindex` (
`si_page` int(10) unsigned NOT NULL,
`si_title` varchar(255) NOT NULL DEFAULT ”,
`si_text` mediumtext NOT NULL,
UNIQUE KEY `si_page` (`si_page`),
FULLTEXT KEY `si_title` (`si_title`),
FULLTEXT KEY `si_text` (`si_text`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
mysql的FULLTEXT 对中文的支持一直不太好,如果直接用utf8字符串的话,没有分词分隔符,moncler ski,所以索引就没有效果,wiki通过取巧的方法,把utf8字符转换成U8xxxx进行保存,用英文空格分 隔,所以就可以搜索了。
wiki的字符转换代码,比较有用,呵呵:
cat wiki/languages/classes/LanguageZh_cn.php
/**
* @addtogroup Language
*/
class LanguageZh_cn extends Language {
function stripForSearch( $string ) {
# MySQL fulltext index doesn't grok utf-8, so we
# need to fold cases and convert to hex
# we also separate characters as "words"
if( function_exists( 'mb_strtolower' ) ) {
return preg_replace(
"/([\xc0-\xff][\x80-\xbf]*)/e",
"' U8' . bin2hex( "$1" )",
mb_strtolower( $string ) );
} else {
list( , $wikiLowerChars ) = Language::getCaseMaps();
return preg_replace(
"/([\xc0-\xff][\x80-\xbf]*)/e",
"' U8' . bin2hex( strtr( "$1", $wikiLowerChars ) )",
$string );
}
}
}
上面的代码就会把汉字转换为U8xxxx空格,然后就可以使用mysql的full text索引了,Burberry On Sale,其实5.0之后的mysql可以使用utf8字符做全文索引了,但是由于分词的问题,还是需要把每个汉字 用空格分开,而且要设置最小索引字符长度才行,所以还是wiki的这种方式方便。
因为它是一个汉字作为一个词,没有按顺序搜索,所以最后结果和中国人的语言习惯不太一样,其实只需要改一下 源代码,使用冒号封装短语,就可以得出比较精确的结果了。
vim wiki/includes/SearchMySQL4.php
找到以下代码
if( $this->strictMatching && ($terms[1] == '') ) {
$terms[1] = '+';
}
$searchon .= $terms[1] . $wgContLang->stripForSearch( $terms[2] );
修改为
if( $this->strictMatching && ($terms[1] == '') ) {
// $terms[1] = '+';
$terms[1] = '+"';
}
$searchon .= $terms[1] . $wgContLang->stripForSearch( $terms[2] ) . '"';
即可精确搜索。

----------------------------------------------------------------------------------------------------------------------
--- /var/lib/mediawiki/includes/SearchMySQL.php.orig 2011-06-13 09:18:52.000000000 +0800
+++ /var/lib/mediawiki/includes/SearchMySQL.php 2011-06-13 09:11:32.000000000 +0800
@@ -51,9 +51,9 @@
foreach( $m as $terms ) {
if( $searchon !== '' ) $searchon .= ' ';
if( $this->strictMatching && ($terms[1] == '') ) {
- $terms[1] = '+';
+ $terms[1] = '+"';
}
- $searchon .= $terms[1] . $wgContLang->stripForSearch( $terms[2] );
+ $searchon .= $terms[1] . $wgContLang->stripForSearch( $terms[2] ).'"';
if( !empty( $terms[3] ) ) {
// Match individual terms in result highlighting...
$regexp = preg_quote( $terms[3], '/' );
Adcdfwwxa is offline   Reply With Quote
 


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT. The time now is 08:32 PM.

 

Powered by vBulletin Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Free Advertising Forums | Free Advertising Message Boards | Post Free Ads Forum