There is a Problem
During the implementation of SSAS Tabular model, you usually hit some point you think about, managing some calculation as calculated column. If possible, you should preferably use calculated measure. However, there are situations the calculated column is the only option. In my case, I have implemented General Ledger type cube. As far as General Ledger has only one value column in it, you actually define real measures (calculations) trough data filtering. Such a filtering represents MxN relation as same amount can be calculated in different measures.
The content of my measures was defined by quite complicated filtering expressions such as:
- Using ranges of value of one column. (E.g. Values which contains value of account number between 0 and 023400)
- Many different filters in same column. (Values which contains value of account number between (0 and 023400) or (300000 and 350000) or …)
- Filtering in more columns
In following sections, I will describe how I managed to handle calculated columns in all layers of the reporting solution and what its impact was.
Calculated Measure (Member)
As a first try, I have implemented filters directly into DAX calculated member. It was rather long and not really clear DAX expression. Such an approach, having filter directly in DAX measure, resulted in:
- Poor performance
- Hard readability and maintainability of measure
Logically, another attempt was to move such a filter into calculated column, which would result in pre-calculated value “False or True” (True if row belongs to particular measure). In fact, I needed 14 calculated columns to handle all my measures. Even though calculated columns result is stored in the memory, data volume did not increase significantly as there were always only two values (True and False) in the column. Such a fact resulted in high compression factor for column store type engine. I then used the result of calculated column or more columns in DAX measure. Outcome of this solution was:
- Good performance of the queries
- Solid readability and maintainability
- Minimal increase of data volume in memory
I was fully satisfied with this solution until I implemented incremental processing into the cube. Data volume I was working with is 90mio records so full load of the cube took about one hour. Nevertheless, my target was to implement hourly updates into the cube. When I tried to implement it, we learned we are not able to get better processing time then 7 minutes, no matter how many rows are added, no matter whether via Partitions or via ProcessAdd method.
While analyzing workload on the server during the processing, I found out that most of the time SSAS engine was recalculating these calculated columns. Obviously, SSAS engine is not calculating values of those columns only in order to get new rows, but it is also recalculating all the existing rows. Together with ETL part, incremental processing would take more than 10 minutes, which was really not a result I would be happy with.
To sum this up, there is one downside of this solution which can be defined as following:
- Poor performance in incremental data loading phase
Therefore I decided to move calculated columns to lower layer. There are basically two options: SQL or ETL. Firstly, I tried SQL Server but our HW was not performing very well and our tests discovered that there would be a problem during initial load – when all 90mio records are processed in only one step.
Therefore, I decided to test to put definition of calculated column directly into the SSIS package which was extracting the data from data source to our DataMart. In the end, it was a right place.
Outcome of my implementation can be defined as followed:
- Minimal impact to ETL duration incremental and initial.
- Minimal impact to data volume on SQL layer as we stored fact table as clustered column store index. Same goes for SSAS layer.
- Great improvement of incremental SSAS processing. (Duration around 1 min 30 sec.)
As for the answer to my core question of the initial article I would say “it depends”. However, as for my concrete solution, ETL layer represents the best solution bringing most advantages. To be clear, there is one disadvantage of this approach – If calculation is redefined or new calculated column is added, you have to apply this to all, already loaded rows. So you can reload all data or update data on SQL layer in batches. Therefore, backward changes can be challenging.