Clustered Column Store Load Pitfalls – Magic number 64 when using WHERE IN clause

The Problem

During our workload we have recognized a difference in performance for queries using IN clause. We are building query dynamically, so you can have a different length of list in IN clause. We work on the top of the Clustered Column Store Index table on SQL 2016 version.

Explanation

If you have a clause with just a couple of values such as this one:

WHERE [CCIDatatypeTestINT].[Column1] in (1, 2, 3, 4)

It translated to the following predicate in Scan operator

[AdventureWorksDW2017].[dbo].[CCIDatatypeTestINT].[Column1]=(1) OR [AdventureWorksDW2017].[dbo].[CCIDatatypeTestINT].[Column1]=(2) OR [AdventureWorksDW2017].[dbo].[CCIDatatypeTestINT].[Column1]=(3) OR [AdventureWorksDW2017].[dbo].[CCIDatatypeTestINT].[Column1]=(4) OR

And this is how exec plan looks:

However, if you exceed 64 items in IN clause, exec plan changes as the following picture shows.

As you can see, we now have constant scan and hash match. As you can see, Actual Number of Rows of the scan operator did not change! In case of integer optimizer, it is still able to do a pushdown of the list from hash into the scan.

The situation is however different for nvarchar datatypes. Let’s see how exec plan looks if we have 64 or fewer values in IN clause:

The predicate pushdown works well in this case and we have 6.5 mio records as an output of scan.

Now let’s check how it looks for 65 values in IN clause.

This is different exec plan than for INT data type. We can see all 12.2 mio records as an output of Scan operation. In this case, predicate pushdown of hash match results was not possible due to the probe values for s strings when hashing. Therefore, new operator – Filter is introduced in our plan to apply string filter. Let’s see some details of the filter operator:

Let’s compare some statistics for integers – 64 values vs 65 values

Statistics for Int with 64 items in IN condition

Statistics for Int with 65 items in IN condition

You can see there is no difference in IO activity. We have slightly different duration, but nothing dramatic.

Now let’s check the string results:

Statistics for nvarchar with 64 items in IN condition

Statistics for nvarchar with 65 items in IN condition

As you can see here the impact was huge. 8 seconds vs 46 seconds means we do better when we run query twice with shorter list of values in IN clause.

Summary

64 is the magic number for WHERE IN clause when working with CCI.

For list of values less than 64 predicate pushdown works well.
For more than 64 values we will have constant scan and hash.
1. Due to the way how hashing works for INT, engine can push down values from hash operator thanks to the Bitmap In-Row Optimization
2. For NVARCHAR due to the probe mechanism push down to scan is not possible

Therefore, the impact of going over 64 will be especially visible when working with strings in CCI. It will be of course less visible if your query is complex and scan takes only a small part of the duration.

4 Comments. Leave new

Guy Glantser
23. 12. 2018 7:20

Thanks Roman!

This is really interesting. I knew about the 64 magic number, but I didn’t know that the optimizer can’t perform a predicate pushdown when it comes to a string data type. So I tried to reproduce this behavior, but I couldn’t. I’m using SQL Server 2017 build 14.0.2002.14 with a data type of NVARCHAR(100). With 65 values, I got the expected constant scan and hash match join, but I also got the probe predicate on the columnstore scan operator. And performance was the same as with 64 values.

I wonder what the difference is. What is the string data type you were using?

Regards,
Guy

Reply
- Roman Lánský
  28. 12. 2018 21:25
  
  Hi Guy,
  I have version 14.0.3029.16 and using nvarchar(30).. I have 12 200 000 rows in the table and 100 000 unique values in Column1. My query is:
  drop table if exists #tmptest
  SELECT Column1 into #tmptest
  FROM [dbo].[CCIDatatypeTestNVCH]
  WHERE Column1 in ( ‘8RVXSPOU5X1ORZ1HO43W2BBTTFFUC8′ ,’ZZGX7TGVDTHH4WO3UFQDVK106HFKM’ ,’EPHHAQEDI1OSRQEQE1J5SU7KTYE8HB’,’Q265UJ47CRV6HTD3V5VD2ETSB3AD0X’, …
  
  Regards
  Roman
  
  Reply
Ondra
28. 12. 2018 9:08

Interesting. Have you tried something like this as well?

https://en.wikipedia.org/wiki/De_Morgan%27s_laws

[Column1] in (1, 2, 3, 4)

… ( Column1 = 1 Or Column1 = 2 Or Column1 = 3 Or Column1 = 4)
…
which is
not ( Column1 1 and Column1 2 and Column1 3 and Column1 4)

Reply
- Roman Lánský
  28. 12. 2018 21:55
  
  Hi Ondra,
  Thanks for the question. Just tested this and optimizer actually translates
  NOT( a != ‘X’ AND a != ‘Y’)
  back into
  (a = ‘X’ OR a = ‘Y’)
  and this behaves the same as IN
  
  Regards
  Roman
  
  Reply