Recently I reviewed filtered indexes, this time let’s look at columnstore indexes. These indexes are very useful for data warehouse workloads and large tables. They can improve query performance by a factor of 10 in some cases, so knowing and understanding how they work is important if you work in an environment with larger scaled data.
Now, I admit when these first were introduced in SQL Server 2012 I found them very intimidating (additionally, you couldn’t update them directly). For me, anytime you say columnstore, my mind tends to set off alarms saying wait stay away, this is too complicated. So, in this post I am going to try and simplify the feature for you.
To do that first you need to understand some terminology and the difference between a columnstore index and a row store index (the normal kind we all use). Let’s start with the terminology.
Columnstore is simply the way the data is stored in the index. Instead of our normal Rowstore or b-tree indexes where the data is logically and physically organized and stored as a table with rows and column, the data is in columnstore indexes are physically stored in columns and logically organized in rows and columns. It is this difference in architecture that gives the columnstore index a very high level of compression along with reducing your storage footprint and providing massive improvements in read performance.
Now the way the index works is by slicing the data into compressible segments. It takes a group of rows, a minimum of 102,400 rows with a max of 1 million rows, called a rowgroup and then takes that group of rows and changes them into Column segments. It’s these segments that are the basic unit of storage for a columnstore index as shown below. This to me is a little tricky to understand without a picture.
Imagine this is a table with 2.1 million rows and 6 columns. Which means we now have two rowgroups of 1 million rows each and a reminder of 100,00 rows, which is called a deltagroup. Since each rowgroup holds a minimum of 102,400 rows the delta rowgroup is used to store all index records remaining until it creates another rowgroup. You can have multiple delta rowgroups awaiting being moved to the columnstore. Multiple delta groups are stored in the delta store and it is actually a B-tree index used in addition to the columnstore. Ideally, your index will have rowgroups as close to 1 million rows as possible to reduce the overhead of scanning operations.
Now to complicate things just one step further there is a process that runs to move delta rowgroups from the delta store to the columnstore index called a tuple-mover process. This process checks for closed groups, meaning a group that has the maximum of 1 million records and is ready to be compressed and added to the index. As illustrated in the picture the columnstore index now has 2 rowgroups that it will then divide into column segments for every column in a table. This creates 6 pillars of 1 million rows per rowgroup for a total of 12 column segments. Make sense? It is these column segments that are compressed individually for storage on disk. The engine takes these pillars and uses them for very highly paralleled scans of the data. You can also force the tuple-mover process by doing a reorg on your columnstore index.
To facilitate faster data access, only the Min and Max values for the row group are stored on the page header. In addition, query processing, as it relates to column store uses Batch mode allowing the engine to process multiple rows at one time. This also makes the engine able to process rows extremely fast in some cases, giving two to four times the performance of a single query process. For example, if you are doing an aggregation these happen very quickly as only the row being aggregated is read into memory and using the row groups the engine can batch process the groups of 1 million rows. In SQL Server 2019, batch mode is going to be introduced to some row store indexes and execution plans.
Now hopefully you have a basic understanding of what a columnstore index is. In my next post we will look at how to create one, load data, and what limitations using columnstore indexes have. We will also get to see the index in action compared to a rowstore index. Stay tuned.
You can continue reading part 2&3 of this blog series below.
Pingback: Simplifying Columnstore – Curated SQL
Pingback: Understanding Columnstore Indexes in SQL Server Part 2 – A Shot of SQLEspresso
Pingback: Understanding Columnstore Indexes in SQL Server Part 3 – A Shot of SQLEspresso