Friday, October 23, 2009

Sphinx for a better internal search engine decoded !! woot!!

sphinx is intended to be a better replacement of sql query 'like', when you are running a large site and have a lot of content, the sql 'like' query is very lousy and create a serious query bottleneck and have no guaranties on successful query.

When a site use sphinx it can query the database like syntax in a very simple manner, just like google or yahoo did, but not the page rank :D...

ok here it goes (i take the example from https://www.ibm.com/developerworks/library/os-php-sphinxsearch/).

first create database body_part on mysql, and then, create a table model
CREATE TABLE Model (
id int(10) unsigned NOT NULL auto_increment,
label varchar(7) NOT NULL,
description varchar(256) NOT NULL,
begin_production int(4) NOT NULL,
end_production int(4) NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;


and insert the database

INSERT INTO Model
(`id`, `label`, `description`, `begin_production`, `end_production`)
VALUES
(1,'X Sedan','Four-door performance sedan',1998,1999),
(3,'X Sedan','Four door performance sedan, 1st model year',1995,1997),
(4,'J Convertible','Two-door roadster, metal retracting roof',2002,2005),
(5,'J Convertible','Two-door roadster',2000,2001),
(7,'W Wagon','Four-door, all-wheel drive sport station wagon',2007,0);



after that create a table Assembly

CREATE TABLE Assembly (
id int(10) unsigned NOT NULL auto_increment,
label varchar(7) NOT NULL,
description varchar(128) NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;

and create a sample data in assembly
INSERT INTO Assembly
(`id`, `label`, `description`)
VALUES
(1,'5-00','Seats'),
(2,'4-00','Electrical'),
(3,'3-00','Glasses'),
(4,'2-00','Frame'),
(5,'1-00','Engine'),
(7,'101-00','Accessories');


create table inventory and populate with the data
CREATE TABLE Inventory (
id int(10) unsigned NOT NULL auto_increment,
partno varchar(32) NOT NULL,
description varchar(256) NOT NULL,
price float unsigned NOT NULL default '0',
PRIMARY KEY (id),
UNIQUE KEY partno USING BTREE (partno)
) ENGINE=InnoDB;

INSERT INTO `Inventory`
(`id`, `partno`, `description`, `price`)
VALUES
(1,'WIN408','Portal window',423),
(2,'ACC711','Jack kit',110),
(3,'ACC43','Rear-view mirror',55),
(4,'ACC5409','Cigarette lighter',20),
(5,'WIN958','Windshield, front',500),
(6,'765432','Bolt',0.1),
(7,'ENG001','Entire engine',10000),
(8,'ENG088','Cylinder head',55),
(9,'ENG976','Large cylinder head',65);


after that, create a schematic table like this as a table relation with the other tables

CREATE TABLE Schematic (
id int(10) unsigned NOT NULL auto_increment,
partno_id int(10) unsigned NOT NULL,
assembly_id int(10) unsigned NOT NULL,
model_id int(10) unsigned NOT NULL,
PRIMARY KEY (id),
KEY partno_index USING BTREE (partno_id),
KEY assembly_index USING BTREE (assembly_id),
KEY model_index USING BTREE (model_id),
FOREIGN KEY (partno_id) REFERENCES Inventory(id),
FOREIGN KEY (assembly_id) REFERENCES Assembly(id),
FOREIGN KEY (model_id) REFERENCES Model(id)
) ENGINE=InnoDB;


and insert the relation data
INSERT INTO `Schematic`
(`id`, `partno_id`, `assembly_id`, `model_id`)
VALUES
(1,6,5,1),
(2,8,5,1),
(3,1,3,1),
(4,5,3,1),
(5,8,5,7),
(6,6,5,7),
(7,4,7,3),
(8,9,5,3);


this is the best part of mysql is creating a view table only available in mysql 5.x :) for a faster and more simple query solution :) (I wonder why drupal don't use this ??)

CREATE OR REPLACE VIEW Catalog AS
SELECT
Inventory.id,
Inventory.partno,
Inventory.description,
Assembly.id AS assembly,
Model.id AS model
FROM
Assembly, Inventory, Model, Schematic
WHERE
Schematic.partno_id=Inventory.id
AND Schematic.model_id=Model.id
AND Schematic.assembly_id=Assembly.id;


and now create a sphinx.conf file in your sphinx directory, usually in /usr/local/etc/sphinx.conf, with the data like this

source catalog
{
type = mysql

sql_host = localhost # your database host
sql_user = reaper # your database
sql_pass = s3cr3t #your password
sql_db = body_parts # the db name
sql_sock = /var/run/mysqld/mysqld.sock #your connetcion sock
sql_port = 3306 #the sql port

# indexer query
# document_id MUST be the very first field
# document_id MUST be positive (non-zero, non-negative)
# document_id MUST fit into 32 bits
# document_id MUST be unique
# a view is needed to provide a simple select method :)
sql_query = \
SELECT \
id, partno, description, \
assembly, model \
FROM \
Catalog;
#every filter should defined here ! more description on filter later on :)
sql_group_column = assembly
sql_group_column = model

# document info query
# ONLY used by search utility to display document information
# MUST be able to fetch document info by its id, therefore
# MUST contain '$id' macro
#
# the id is selected here
sql_query_info = SELECT * FROM Inventory WHERE id=$id
}

#the indexing catalog
index catalog
{
source = catalog # the catalog source
path = /var/data/sphinx/catalog #the catalog path
morphology = stem_en

min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
enable_star = 1 # enabling star allow us to use LIKE statement in the database
}

#search daemon, like services in linux
searchd
{
port = 3312 # the daemon port of API's
log = /var/log/searchd/searchd.log
query_log = /var/log/searchd/query.log
pid_file = /var/log/searchd/searchd.pid
}


install the sphinx refer to http://sphinxsearch.com/docs/manual-0.9.8.html#installation

1. Extract everything from the distribution tarball (haven't you already?) and go to the sphinx subdirectory:

$ tar xzvf sphinx-0.9.7.tar.gz
$ cd sphinx

2. Run the configuration program:

$ ./configure

There's a number of options to configure. The complete listing may be obtained by using --help switch. The most important ones are:
* --prefix, which specifies where to install Sphinx;
* --with-mysql, which specifies where to look for MySQL include and library files, if auto-detection fails;
* --with-pgsql, which specifies where to look for PostgreSQL include and library files.

3. Build the binaries:

$ make

4. Install the binaries in the directory of your choice:

$ make install

Create the indexes

$ /usr/local/bin/indexer --config /usr/local/etc/sphinx.conf --all


Test the indexes

$ /usr/local/bin/search --config /usr/local/etc/sphinx.conf ENG*
Sphinx 0.9.7
Copyright (c) 2001-2007, Andrew Aksyonoff

index 'catalog': query 'ENG ': returned 2 matches of 2 total in 0.000 sec

displaying matches:
1. document=8, weight=1, assembly=5, model=7
id=8
partno=ENG088
description=Cylinder head
price=55
2. document=9, weight=1, assembly=5, model=3
id=9
partno=ENG976
description=Large cylinder head
price=65

words:
1. 'eng': 2 documents, 2 hits

test the filters
$ /usr/local/bin/search --config /usr/local/etc/sphinx.conf --filter model 3 ENG
Sphinx 0.9.7
Copyright (c) 2001-2007, Andrew Aksyonoff

index 'catalog': query 'ENG ': returned 1 matches of 1 total in 0.000 sec

displaying matches:
1. document=9, weight=1, assembly=5, model=3
id=9
partno=ENG976
description=Large cylinder head
price=65

words:
1. 'eng': 2 documents, 2 hits



for PHP integration create search.php

include('sphinx-0.9.7/api/sphinxapi.php'); //the php files

$cl = new SphinxClient();
$cl->SetServer( "localhost", 3312 ); //server and port
$cl->SetMatchMode( SPH_MATCH_ANY );
$cl->SetFilter( 'model', array( 3 ) );//the filters

$result = $cl->Query( 'cylinder', 'catalog' ); //the query

if ( $result === false ) {
echo "Query failed: " . $cl->GetLastError() . ".\n";
}
else {
if ( $cl->GetLastWarning() ) {
echo "WARNING: " . $cl->GetLastWarning() . "
";
}

if ( ! empty($result["matches"]) ) {
foreach ( $result["matches"] as $doc => $docinfo ) {
echo "$doc\n";
}

print_r( $result );
}
}

exit;


Test! and the result is.....
$ sudo mkdir -p /var/log/searchd \\daemon files of log that has been declared before
$ sudo /usr/local/bin/searchd --config /usr/local/etc/sphinx.conf \\ running the daemons
$ php search.php

\\result :)
9
Array
(
[fields] => Array
(
[0] => partno
[1] => description
)

[attrs] => Array
(
[assembly] => 1
[model] => 1
)

[matches] => Array
(
[9] => Array
(
[weight] => 1
[attrs] => Array
(
[assembly] => 5
[model] => 3
)

)

)

[total] => 1
[total_found] => 1
[time] => 0.000
[words] => Array
(
[cylind] => Array
(
[docs] => 2
[hits] => 2
)

)
)


Happy Searching! i'd like to give you more details on it :), now it's time to learn lucene :D