Counting in SQL

COUNT() is a function that returns the number of items found in a group. Syntax-wise, it is very easy, but it can be tricky to get the best performance out of it.

Basic implementation

The simplest way to get the count is something like this:

SELECT COUNT(*) FROM table_name WHERE condition

But there is always a question: which column should I use to count? Should it be the primary key, any column, COUNT(*), COUNT(1), or does it really matter?

To test this, let’s create the following table

This table has the primary key id, and for fun, I inserted 250k records. Let’s run the following three count statements to check if there is a difference in performance:

-- Query 2

select count(1) from todo
WHERE due_date between '2024-03-11 03:25:00' and '2024-03-11 03:26:00'

select count(id) from todo
WHERE due_date between '2024-03-11 03:25:00' and '2024-03-11 03:26:00'

select count(*) from todo
WHERE due_date between '2024-03-11 03:25:00' and '2024-03-11 03:26:00'

From the screenshot above, there actually is no difference between count(1), count(id), or count(*). However, there is a problem here: all of these queries’ execution plans are performing an index scan over the clustered index. This is bad because an index scan means the query touches every row, which is fine for a smaller table, but it will become a problem as the number of rows in a table increases.

Improvement

Now, let’s try creating an index, say an index over the column due_date, without including any other columns, and run the same queries again

CREATE NONCLUSTERED INDEX [IX_todo_duedate] ON [dbo].[todo] ([due_date] ASC)

After adding the new index, the result is much better, the query execution plan is now performing an index seek instead an index scan

Query hint

Now, let’s run another count query over a column that does not have an index including this column, something like:

-- Query 3

select count(1) [created_date] from todo 
where created_date BETWEEN '2024-03-10 02:51:00' and '2024-03-10 02:52:00'

select count(id) [created_date] from todo
where created_date BETWEEN '2024-03-10 02:51:00' and '2024-03-10 02:52:00'


select count(*) [created_date] from todo 
where created_date BETWEEN '2024-03-10 02:51:00' and '2024-03-10 02:52:00'

As expected, the query switched back to using the clustered index since there is no non-index covering that column. What’s even better, the engine recommended creating an index for that column to avoid an index scan.

Now, let’s force the use of a specific index by using the query hint WITH(INDEX(IX_todo_duedate)), so the query will be something like:

-- Query 4

select count(1) [created_date] from todo WITH(INDEX(IX_todo_duedate))
where created_date BETWEEN '2024-03-10 02:51:00' and '2024-03-10 02:52:00'

select count(id) [created_date] from todo WITH(INDEX(IX_todo_duedate))
where created_date BETWEEN '2024-03-10 02:51:00' and '2024-03-10 02:52:00'


select count(*) [created_date] from todo WITH(INDEX(IX_todo_duedate))
where created_date BETWEEN '2024-03-10 02:51:00' and '2024-03-10 02:52:00'

The execution plan does look peculiar, doesn’t it? What happened here? In order to aggregate the values of created_date, the query has to go through every single row, perform another key lookup for the value of that column, and then aggregate this value.

To avoid this behavior (either primary index scan or key lookup), we can create a covering index.

A covering index is simply a non-clustered index that has all the columns of the table. In our sample table, it should be something like:

CREATE NONCLUSTERED INDEX [IX_todo_coveringindex] ON [dbo].[todo]
(
	[id] ASC,
	[name] ASC,
	[due_date] ASC,
	[created_by] ASC,
	[created_date] ASC
)

Now let’s run query 3 again.

From the execution plan, the query did use the index IX_todo_coveringindex as expected, but why is the engine still recommending creating a new index for created_date?

Although covering indexes are useful, their downside is significant since they hold all row data. So ideally, we need an index that targets the column we need to query.

Conclusion

Firstly, there’s no difference between using count(1), count(id), or count(*), so don’t stress about it in your next code review.

Secondly, SQL Server’s engine is smart enough to use the best index for performance, so as long as you have a proper index that covers the columns you are using, you should be fine.

Finally, database design is just like any other software design; there is no silver bullet, and there is no right or wrong answer. It all depends on the use case of your application. Covering indexes can save you a lot of hassle, but they’re not always the best for performance. Additionally, the more indexes you add for a read and write database, the more resources you will need to rebuild the index’s B+ trees, and the slower insert statements will become, since all the indexes need to be updated.

References

2024-05-10

https://blog.waelouf.com/post/blog/sql-count/ Wael Mansour