As with more and more developers, i've always faced a problem with MySQL's FULLTEXT search. As soon as you get over 100-200Mb table and try and do a BOOLEAN search on a keyword field, performance seems to fall out the window.
For Example:
SELECT id FROM listing WHERE MATCH ( keywords ) AGAINST ( 'plumbers' IN BOOLEAN MODE )
Rows: 2,913 total, Query took 0.3789 sec
This was run on a 500k row table, containing 180Mb data.
Sphinx Search makes FULLTEXT searching much simpler and quicker.
Once you download it and get it installed (I have been running it in linux environments: Cent OS and ubuntu), you modify the main config file to setup your desired indexes. You then start the search daemon and away you go.
How it works:
You setup your new index to retrieve data from either: MySQL, PostgreSQL, HTML files, plain text files, XML feeds, mailboxes, Etc. Although MySQL and PostgreSQL are the easiest so will use for an example. You enter your database details including what database to use, then enter a SQL statement that will retrieve the data you want indexed. Sphinx does NOT store all the information about the rows in the table. ONLY id numbers and relevant search data.
For Example:
SELECT ID,keywords,lat,long,town,county FROM __TABLE__
Here we are storing the ID (which will be returned on a match against a search query), keywords to be searched upon, and 4 attributes (all integers - text attributes are still too come) on which we can filter and group data during a search.
Once your happy with your index config, you run the "indexer" and your indices will be built.
You will be able to then query your index from either commandline (via "search" program - most useful for debugging), or from one of there more useful apis. Currently there are api's for: PHP, Python, Ruby, Java and C++.
Using these api's you can search and retrieve row ID numbers extremely quickly:
query 'plumbers ': returned 1000 matches of 2913 total in 0.001 sec
As you can see a basic search pulls back excatly the same amount of results as MySQL, but much quicker. Once you have your ID's you can then query your database (using IN () ) too keep your rows in correct order, too pull back any other data you need.
The major bonuses I like about Sphinx is the ability to apply filters, grouping and weight results by keywords or by which index they are returned on (yes, you can use more then one index at a time - they've thought off everything).
Another major feature I like is the ability to order results by distance using radial latitude and longitude figures (I will be posting more information about this in the future) but again it is much easier and quicker then running:
SELECT `id`,ROUND( SQRT( POW((69.1 * ( 50.734501 - `listing`.`lat`)), 2) + POW((53 * ( -1.98377 - `listing`.`long`)), 2)), 1) AS `distance` FROM `listing` WHERE `listing`.`lat`< (50.734501+0.12) AND `listing`.`lat` > ( 50.734501-0.12) AND `listing`.`long` < (-1.98377+0.12) AND `listing`.`long` > (-1.98377-0.12) AND `listing`.`keywords` LIKE '%plumbers%' ORDER BY `distance`
Which is slow at the best of times.
I think sphinx is great and with some new things in the pipe line, it can only get better.
posted by Mark Willis

