My last two blogs were on understanding columnstore and creating columnstore indexes. In this post, I will touch on one last topic, when to use columnstore indexes. As noted in my prior posts these indexes are designed for large data warehouse workloads, not normal OLTP workload tables. As with any index design it is important to know your data and what you will be using the data for. Be sure to look to what types of queries you will be running against it, as well as how the data is loaded and maintained. Let’s look at a few questions you should ask yourself before deciding to implement a columnstore index. Just because your table consist of millions of rows doesn’t mean columnstore is the right way to go.
First you need to know the data you are designing the index for.
Is your table large enough to benefit? As we reviewed in my first post usually this means in the millions of records range as these rows are divided into groups of rows, called a rowgroup. A rowgroup has a minimum of 102,400 rows with a max of approximately 1 million rows. Each rowgroup is changed into column segments. Therefore, having a columnstore index on a table with under a 1 million rows does not make sense in that if the table is too small you don’t get the benefits of compression that comes with the column segments. A general recommendation is to use columnstore indexes on the fact tables in your data warehouse, and on very large dimension tables, containing over 5 million rows.
Is your data volatile, meaning changing frequently? Rule of thumb says you want tables that rarely have data modifications, more specifically, where less then 10% of the rows are modified. Having large numbers of deletes can cause fragmentation, which adversely affect compression rates, thus reducing the efficiency of the index. Updates in particular are expensive, as they are processed as deletes followed by inserts, which will adversely affect the performance of your loading process.
What datatypes are in your table? There are several data types that are not supported within a columnstore index. Data types like varchar(max), nvarchar(max), or varbinary(max) were not supported until SQL Server 2017, and typically aren’t the best fit for this type of workload. Additionally, if you are using uniqueidentifiers (GUIDs) you won’t be able to create your index as they are still not supported.
Next what are you doing in your queries.
Are you doing aggregations or performing analytics on the data or are you looking for specific values? The standard B-tree rowstore indexes are best for singleton lookups (single values) and are sometimes used in tandem with a columnstore index. If you’re using an index to cover a where clause that does not look up a range of values and is just filling predicates, then columnstore does not benefit. This especially true if you need to “cover” that query with included columns since columnstore does not allow included columns. However, columnstore is designed to quickly process aggregations especially on a grouped range of values. So, if you are performing aggregations or analytics usually columnstore can give you large performance gains as it can do full table scans to perform aggregations very fast.
Now there are times where you want to seek a specific value and perform aggregations (think average sale price in the last quarter for a specific product). In these cases you may benefit from a combination of a rowstore and columnstore index. Creating the columnstore to handle the grouping and aggressions and covering the index seek requirement with the rowstore. Adding these b-tree indexes will help your query performance, but they can dramatically impact the loading process—if your data loads are large enough it may be more efficient to drop the b-tree index, and rebuild it after loading the data into the columnstore.
Is this a data warehouse fact or dimension table? As we know a dimension table is usually used to find specific values, mostly lookups to match with an aggregated value from a fact table. If it’s a dimension table, typically you are going to use a b-tree based models, with the exception of very large dimensions. We tend to use columnstore on fact tables in the data warehouse as these tables are normally the source of aggregations.
Summary
Knowing your data is pivotal in deciding whether or not a columnstore index is best suited for your query needs. Don’t just jump to columnstore indexes now that you have an understanding of them, and your tables are large. Make sure to take the time to choose the right indexing option for your usage patterns, the data and the overall environment.