Archive

Posts Tagged ‘mysql’

subqueries make life easy

September 13th, 2009 1 comment

often times while working with databases, you find the need to do some simple tasks (adding a column and populating it with a value from another table, etc). rather than write a script to do this, use mysql’s native subquery (and temporary table) functionality – it makes life much easier.

consider the case where you have two tables:

mysql> describe colors;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id    | int(11)     | YES  |     | NULL    |       |
| color | varchar(10) | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
2 rows in set (0.00 sec)

and

mysql> describe color_mapping;
+------------+---------+------+-----+---------+-------+
| Field      | Type    | Null | Key | Default | Extra |
+------------+---------+------+-----+---------+-------+
| picture_id | int(11) | YES  |     | NULL    |       |
| color_id   | int(11) | YES  |     | NULL    |       |
+------------+---------+------+-----+---------+-------+
2 rows in set (0.00 sec)

let’s say you want to update the colors table to add a frequency column such that you know how often a given color is used. it turns out this is really easy using mysql:

alter table colors add frequency int;
update colors set frequency = (select count(*) from color_mapping \
   where color_mapping.color_id = colors.id);

and that’s it. also useful are temporary tables:

create temporary table color_frequencies select color_id, count(*) as cnt \
   from color_mapping group by color_id;

if you then attempt to desc color_frequencies, you’ll see a table with two columns – a color_id and a cnt column.

hopefully this will save some people some efforts writing scripts next time some simple database updates are needed :)

Categories: code Tags:

faster and better text search

January 25th, 2009 4 comments

important update (august 5th, 2009)so i realized that some of the results here (specifically, the java lucene ones) were incorrect. the reason is because as is mentioned on the lucene wiki, the first search has to initialize the caches. as thus, the results aren’t accurate. this seems to be very true. for example, if i run a test query, discard the results, and then run the real query, results for the three classes of queries are now 17, 8, and 4 ms respectively, which is very comparable (if not sometimes better) than that of sphinx. i will probably need to re-run this benchmark to do a better job of giving the backend systems a level playing field to test on.

update (jan 26th, 2009)as mentioned in the comments, the mysql results aren’t very accurate either because i was probably not properly searching against the index.

i have a set of ~6000 quotes (verses, if you will), along with a multiple set of translations for each of those verses. before, i was searching across these verses using mysql. while this seemed to work, it was very limiting, and i began looking into alternatives.

so i did a little bit of research and tried out lucene and sphinx. for lucene, i specifically used the zend version (i’ll discuss standard lucene (java) towards the end of this post.)

i’ll show the results first, and then explain them after.

the graph above shows a quick overview of the tests run. a set of 3 different queries were run against 4 different backends. the numbers were generated using apache bench (ab) using 100 requests with a concurrency of 1.

backends:
lucene: this was the first implementation. in it, each verse was a “document.” each translation was a property of the document. the total number of documents was thus equivalent to the number of verses.

sphinx: this was the second sphinx implementation (see sphinx alt below for the first implementation). this implementation was just done to make the data model similar to that of lucene, which is exactly what it is. although this ended up being the fastest (by < 5ms in the tests run), i prefer the sphinx alt implementation because it’s closest to that of the database schema.

sphinx alt: although it is named “sphinx alt” in the graph above, this is really the initial sphinx implementation. in this model, a translation of one verse was a document. consequently, the total number of documents was (number of translations) * (number of verses). i sort of like this one most (even though it’s not the fastest) because it is the closest to the current database schema.

mysql: this is sort of the baseline, and, to be honest, it’s not fair either. the query used here is something in the nature of getting the row where the text is like ‘%word1%word2%’; the number of results returned by this are far fewer (and less valuable) than those returned by either lucene or sphinx. one would need to do “where text like ‘%word1%word2%’ or text like ‘%word2%word1%’” to get a more accurate estimate, but for baseline purposes, i simply ran the first query. note that the query cache size is 0 (ie query cache is on but effectively off for this set of tests). note that the text field has a fulltext index on it.

results:
sphinx wins hands down. however, although it seems that lucene comes in last, this is not really accurate because of the type of mysql query being used. from my limited tests (using a more complicated sql query), lucene and mysql have comparable performance, but lucene of course has the added benefit of more advanced query options, etc.

sphinx times were 8.030 ms, 7.542 ms, 8.324 ms, sphinx alt times were 8.304 ms, 7.898 ms, 11.131 ms, lucene times were 285.759 ms, 116.222 ms, 224.381 ms, and mysql times were 106.254 ms, 106.561 ms, 108.747 ms for queries 1, 2, and 3, respectively. query 1 contained three words (+term1 +term2 +term3), query 2 contained one word (+term4), and query 3 contained two words (+term5 +term6).

additional details:
plain vanilla java lucene is usually faster than zend’s lucene implementation. the largest difference can be noted in indexing times (a few seconds for java versus 15+ minutes in php). if i had to index frequently, i’d use java lucene or sphinx because they are insanely faster.

for example, the first query takes 179.84 ms on average in java (over 100 queries) versus about 272.61 ms on average for php. the second query takes 173.22 ms on average in java versus about 103.30 ms in php. the third query takes 178.98 ms on average in java versus about 214.78 ms in php.

php only won at the second query, which also happens to be the simplest query. two things to note – first, the times here don’t include the jvm or php interpreter start times. these are times reported by taking the time before and after the search call and displaying them. second, unlike the first test, this was all run from the command line and not directly via web (didn’t want to bother setting up tomcat or solr, etc).

just for fun, i implemented the “sphinx alt” data scheme in java lucene as well and re-ran the 3 tests 100 times each. the results were 178.54 ms, 160.20 ms, and 172.72 ms – very much comparable to the results with the alternate schema.

the summary of this very long post in 2 words: sphinx rocks.

Categories: code, technology Tags: , ,