20090527

http://www.fastechws.com/tricks/sql/full-text-search-boolean.php

Full Text Searching with MySQL

When you want to let visitors search through a web-site's data by typing a search string, possibly containing multiple words, what's the best way to do that?

This is a question that web developers have been asking for a long time. There's many views on this subject, and many opinions. I generally want the best searching ability I can get without spending aeons of time (and cost) developing it.

Simple Searches

You may be thinking this is no big deal - your data is in a SQL database, so why not just do word-matching with "LIKE":

SELECT * FROM t1
WHERE description LIKE '%word%';

That is fine when you only have one word to search for, and one field to search in. But what if you have three fields (name, title, description) and the user types five keywords, such as "where is the Saharan Desert"? By eye, you can see that the really useful search keywords are "Saharan" and "Desert" - but how can your code possibly know that? If you require all 5 words to match, you won't return any matches at all.

Even if you knew that there are only 2 valid keywords, you can't just do a single match like this:

SELECT * FROM t1
WHERE description LIKE '%saharan%desert%';

That won't match everything that should match, because maybe some of the descriptions have the words in the other order. For two keywords, you really have to do two matches:

SELECT * FROM t1
WHERE description LIKE '%saharan%desert%'
OR description LIKE '%desert%saharan%';

You can imagine how this degrades - every combination of 3 keywords requires 6 different LIKE's; 4 keywords requires 24 combinations - that's factorial growth; it starts to get really big, and really slow!

Database Solution - Full Text Search

A really good solution to this and other problems related to search are solved by Full Text Search. Let the Database automatically maintain a special index of some sort that helps it find your matches quickly! You just provide the keywords to search for and which table fields should be searched. Simple.

The basic form is called "Natural Language Search", and has a number of cool features, including the ability to have a "stop list" - a list of words that are so popular, we ignore them because it would return too many records (too many mis-matches). Words like "is", "the", and "and" are way too popular. MySQL's built-in stop list is a text file that the administrator (root user) can change if needed.

In fact, with MySQL, when it builds the FT index for the specified fields in a table, it figures out other stop words - any word used in 50% or more of the records is automatically excluded from matching - which is good, you generally don't want half your table's records coming back as positive matches.

FTS queries look like this in MySQL:

SELECT * FROM articles
WHERE MATCH (title,body) AGAINST ('what is the saharan desert');

Setting up Full Text Search

For that query to work, your table must already have a FTS index created, which must have the exact fields that you're matching against (title and body, in the above example).

To create that index you would do the following, one time only:

ALTER TABLE articles
ADD FULLTEXT (title,body);

If you ever need to match against a different set of fields - say, just title, or just body, you'll need to create a whole other FULLTEXT index just for that. You can have many indexes on any table; it just takes more space (disk space, and sometimes memory).

Fancier Method - Boolean Matching

You know those advanced features of search engines like Google, where you can specify negative keywords with "-" sign, as in:

chili -bar -"new york"

because you want info on chili the food, not the bar and grill or the city in New York.

That, and a lot more, is built in to MySQL FTS - using Boolean Expressions. Boolean expressions include ways to vary the priority value based on certain keywords being there, or not being there. You can do partial-word matches (like "*" globbing in Unix, sort of). It honors double-quotes around "multiple words" when you want to find those words next to each other in the order shown. It can even let you group pieces of the boolean expression (with parentheses). Read the MySQL docs (see References below) to learn the syntax.

Notes and Limits with Full Text Search

Make sure your test data has at least 3 records. Think about it: 2 records won't work because ALL the words are found on 50% of the records, so everything's stoplisted!

If you want really advanced features such as configurable weight values per field, you will have to write more code yourself. You could put separate FT indexes on each field of the table, then build a query that has multiplier values (weights) for each field's results. This would probably result in more work for the database engine, but give you tighter results.

Another thing to remember is that (at least with MySQL), the Natural Language matching does not necessarily tell "how well" the match occurred; you only know that it matched or it didn't match. To get a nicer priority value back, use Boolean matching. You get a floating point value between 0 and 1, I believe.

References - MySQL Manual Online

MySQL Natural Language Matching

MySQL Boolean Expression Matching

Those links are for MySQL 5.0, but you can click on the other versions to see the related pages there.

Books

MySQL Cookbook from O'Reilly and Associates
Chapter 5 has a lot of info on Full Text Searches.

Conclusion

FTS has been around for a while - it should work with MySQL versions 4, 5, and the upcoming version 6. FTS exists in many other databases too, not just MySQL. However the syntax may vary with other database systems.

Overall, Full Text Search is very powerful and useful, and very easy to configure and use. It certainly saves a lot of time from having to do individual keyword matches yourself and try to determine which matches are "better" than which others.

No comments:

Post a Comment