Databases: slow count() statement with where clause on large tables
The typical task where count with where could be used in applications is showing data in html table with paging. So for instance to calculate number of pages you’ll use count(). Where clause used usually in case you apply some filters to find particular data.
The problem is that count(*) is very slow on large tables in all databases that uses MVCC method of concurrency control ( including MySql when InnoDB is used as storage engine). And it is slow in all database types when “where” statement used and conditions changes often ( so MySQL cache for count is not helpful for this case).
Lets take for instance Postgresql ( as actually that was the db we were working with when the problem arises):
explain analyze select count(id) from transaction_log where created>’06/1/2009′;
Aggregate (cost=67513.76..67513.77 rows=1 width=4) (actual time=8534.997..8534.998 rows=1 loops=1)
-> Bitmap Heap Scan on transaction_log (cost=11308.36..66005.36 rows=603360 width=4) (actual time=992.066..7904.009 rows=603763 loops=1)
Recheck Cond: (created > ‘2009-06-01 00:00:00’::timestamp without time zone)
-> Bitmap Index Scan on transaction_log_created_index (cost=0.00..11157.52 rows=603360 width=0) (actual time=976.543..976.543 rows=603763 loops=1)
Index Cond: (created > ‘2009-06-01 00:00:00’::timestamp without time zone)
Total runtime: 8535.177 ms
So we can see that 8 seconds is not what we are expecting (In this case one raw from transaction_log is about 1 kB of data, so 600 000 raws of data definitely not fit to the memory available in my case for database. In case if all of the data fit into memory, further requests would be faster, as all the data would be in memory).
What can one do about this? In our project we decided that we still want to have paging ( of course we are), but if somebody made selection with result of more than 1000 raws, he is not interested much to know how many pages he has, but he rather interested that he has more than 50 pages and he need to narrow the selection by more precise filter).
Now- how to implement this?
For postgresql, queries like next works very fast:
select id from transaction_log where created>’11/11/2009′ and created<’11/12/2009′ offset 1000 limit 1;
Total runtime: 15.118 ms
(important that created is indexed of course). If you have something as result, you know it is more than 1000 raws.
Next, on small amount of data in query, count() is working very fast. So in case you know that result is less than 1000 raws, you can use count to calculate precise amount. In case it is more than particular amount of raws, usually you don’t interested how much exactly it is.