BayesianAverage CakePHP plugin Performance Tests

The BayesianAverage CakePHP plugin was written with performance as a consideration and documented to help developers avoid performance problems. However, when the values of C and m are left for the behavior to calculate, two queries must be run that present a performance bottleneck.

I decided to test the performance of these queries for medium-large data sets, so I’d know what to expect using the plugin in a production environment.

Environment & Method

These tests were all done locally on K9, my MacBook Pro: 2.4 GHz Core 2 Duo, 2 GB RAM, running MySQL Server 5.1.37.

This is the basic structure of the items table:


CREATE TABLE `items` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `ratings_count` int(7) unsigned NOT NULL,
  `mean_rating` decimal(5,4) unsigned NOT NULL,
  `bayesian_rating` decimal(5,4) unsigned NOT NULL
  PRIMARY KEY (`id`)
) ENGINE=MyISAM;

Neither of the “problem” queries examine the actual ratings table, so I won’t go into its structure except to say that about 125 million ratings were spread across 50,000 items in my largest test data set.

I decided to run my tests twice: against the basic table above (average row size: 25 bytes), and against a wider (but otherwise identical) table filled with random strings (average row size: 1.4 kilobytes). You can download the test template for the wider table, complete with data, from my drop.io.

I ran eight tests on both tables (25 byte rows and 1.4 k rows) for both “problem” queries. I ran each test 10 times and recorded the average performance as well as the best and worst performance. For the rest of this post, I’ll only discuss the average performance, but you can see the the other results in this Google Docs Spreadsheet.

Calculating C & m

The first “problem” query runs every half hour when someone adds or modifies a rating, and recalculates the values to cache for C and m if they aren’t set explicitly in the model. If the recalculated values differ from the cache by more than 10%, the cache is replaced, otherwise the behavior keeps the cache and waits another half hour (and until someone adds or changes another vote) and tries again. This can be useful if you have a small, volatile data set (e.g., a review site that’s only just launched), but as your data set grows, these values will become less volatile and you should consider making C and m explicit constants within your model to avoid performance problems.

SELECT AVG(`Item`.`ratings_count`) C, AVG(`Item`.`mean_rating`) m FROM `items` AS `Item` WHERE `Item`.`ratings_count` > 0 LIMIT 1;

Here’s how the query performed with my test data:

No of Items Time w/25 byte rows Time w/1.4 k rows
50k 41.8 ms 177.8 ms
30k 29.92 ms 122.4 ms
20k 26.07 ms 96.72 ms
10k 18.97 ms 69.38 ms
5k 13.58 ms 55.71 ms
2.5k 13.77 ms 52.26 ms
1k 12.43 ms 45.14 ms
500 11.69 ms 43.1 ms

As an experiment, I tried creating an index on the relevant fields (KEY `ratings_count` (`ratings_count`,`mean_rating`)), which reduced query time by more than 100 ms for the wider table at 50,000 rows.

Recalculating Items

The second query that could present an even more serious performance bottleneck updates the bayesian_rating for all items when the cache values for C and m are updated or created. Again, this query never runs if C and m are set manually, and only runs after the cache is updated or created. As your data set grows, C and m will become less volatile, meaning recalculated values will differ from the cache by less than 10%, and this query is unlikely to run.

UPDATE `items` AS `Item` SET `Item`.`bayesian_rating` = (`Item`.`ratings_count` / (`Item`.`ratings_count` + 2518.8187)) * `Item`.`mean_rating` + (2518.8187 / (`Item`.`ratings_count` + 2518.8187)) * 2.99895026 WHERE `Item`.`ratings_count` > 0;

Here’s how the query performed with my test data:

No of Items Time w/25 byte rows Time w/1.4 k rows
50k 220.7 ms 966.6 ms
30k 138 ms 680.7 ms
20k 94.38 ms 539 ms
10k 52.64 ms 407.3 ms
5k 33.81 ms 334.7 ms
2.5k 22.92 ms 297.7 ms
1k 15.56 ms 276.8 ms
500 12.51 ms 269 ms

Analysis & Recommendations

In a case where all bayesian ratings have to be recalculated, both the select and update queries must be run in the same request. With the longest, widest data set I tested (50,000 items; 1.4 k per item), the two queries together took just over a second to complete in a MyISAM table. I think that’s unacceptable given the environment; in fact, all the results for the 1.4 k/row table made me cringe a little. That said, the second query would run extremely rarely — probably never — for a long items table. The longer the list of voted items, the less likely it is you’ll have to recalculate all Bayesian averages.

These are my recommendations:

  1. Once you’ve set C and m within your model, these performance concerns no longer apply. Set them when you can. If your results look wrong after setting them, you can remove the setting and the data will fix itself next time a vote is added. C is the average number of votes you expect per item, and m is the average rating an item is given (the median rating works in a pinch; e.g., “3” for a 5-star rating).
  2. Create a single index for the two fields ratings_count and mean_rating. I found this improved the calculation of C and m dramatically.
  3. If you have a wide “items” table and want to use BayesianAverageable in “magic” mode, you may want to do some normalization for the sake of performance.
  4. If you have a wide “items” table and normalization isn’t an option, you may want to separate the ratings_count, mean_rating, and bayesian_rating fields into an item_averages table and add 'itemModel' => 'ItemAverage' to BayesianAverageable’s settings from your model.
  5. Make sure caching is not disabled on your production environment. (The line Configure::write('Cache.disable', true); should be commented out in /app/config/core.php)
  6. Set up a dedicated cache configuration to make sure the cache for C and m isn’t deleted.

A how-to for that last recommendation: CakePHP’s default cache configuration is set to expire after an hour. To add another cache configuration, open /app/config/core.php and add the following:


Cache::config('annual', array(
  'engine' => 'File',
  'duration'=> '+1 year',
));

Tell BayesianAverageable to use this Cache configuration instead of the default through the ‘cache’ setting:


var $actsAs = array('BayesianAverage.BayesianAverageable' => array(
  'fields' => array('itemId' => 'item_id'),
  'cache' => array(       // <=
    'config' => 'annual', // <= lookit!
  ),                      // <=
));

This should stop the cache file containing C and m from expiring, further decreasing the chances you’ll have to recalculate them.

For my part, I’m going to modify the way the “items” table is updated so that the query is either given a LOW_PRIORITY or run in batches so that select operations can continue. I’m also going to make it easier to delegate the “problem” queries to a shell task so they don’t delay a user’s request. Finally, I’ll update the plugin’s documentation to reflect the above recommendations.