Optimizing HR Head Count DAX Measure with Power BI

The head count for human resources is a rather common request found in Power BI related forums and the usual suggestion is to implement the Events in progress DAX Pattern from SQLBI. I built a sample model to experiment with this pattern.

The problem

The Employee table contains the hire date and the leave date. The leave date is blank if still employed. The request is to count the number of employees (heads) during the selected period of time.

The direct solution

The direct approach to implement the head count measure is to check if the selected period of time intersects the interval between the hire and leave dates. This can be easily implemented with DAX, as shown in the first section of the pattern, but performance is not very good, because the intersection is performed entirely with DAX code and no physical relationship is used.

The snapshot table approach

This approach consists in the creation of a snapshot table, by expanding the interval between the hire and the leave date, creating one row per each day, and then by creating a physical relationship with the Date table on the Date column. Then to create a measure simply by computing the distinct count of the employee IDs.

To my surprise, the performance with this measure in the sample model is worse than the previous one. The reason is that in this scenario the DISTINCTCOUNT needs to iterate on a very large number of rows to count the distinct IDs.

To avoid the DISTINCTCOUNT it’s possible to compute the head count at the end of the selected period instead than over the whole time selection. Because each employee can appear just once per day, it’s possible to use COUNTROWS on the last day to compute the number of employees at he end of the selected period.

This measure performance is very good.

When the selected time period is at the day level, the two measures return the same result, therefore at the day level we can always use the end of period version of the measure.

When the selected time period is at a higher granularity level, the results of the two measures are slightly different, therefore a decision must be made: is it acceptable to use the end of period version of the measure, or is it mandatory to compute the head count using the whole selected time period?

The solution for the latter in the sample model is to check the granularity level to choose the measure to be used dynamically: the end of period one at the day level and the direct solution one at the month or year level.

Conclusion

As usual, each scenario is different and the possible solutions must be tested.

This HR scenario consists of events that lasts for very long. This means that the snapshot table approach of the Events in progress DAX Pattern needs to create a very large snapshot table, since each row in the employee table is multiplied by the number of days of employment.

A performance gain can be obtained using the End Of Period version of the measure with the snapshot table.

By dynamically checking the granularity level it’s possible to choose the measure version to be used.

The sample model builds the snapshot table using DAX. A different approach like using Power Query or SQL could work better in real case scenarios.

Should we always use a snapshot table? Probably not, since the advantage of performance might be less than the cost of building a very large snapshot table, but it depends on the actual data of the real scenario and on the reports to be implemented.

The sample model

The employee table contains about 300K employees over three years, from the 2008 to 2010

To simulate the employee table I used the User table from the Small Stack Overflow Database provided by Brent Ozar on his awesome blog

The query uses the LastAccessDate as LeaaveDate, setting NULL when after the 2010

WITH CTE AS (
SELECT [AccountId]
      ,[DisplayName]
      ,CONVERT(DATE, [CreationDate]) AS HireDate
      ,CONVERT(DATE, [LastAccessDate]) AS LeaveDate
  FROM [dbo].[Users]
  WHERE Id <> -1 AND Id <> 11
)
SELECT 
		AccountId AS ID
	,	DisplayName AS Name
	,	HireDate
	,	CASE WHEN LeaveDate > '20101231' THEN NULL ELSE LeaveDate END AS LeaveDate
FROM CTE

The Date table is a very simple DAX calculated table

Date = 
ADDCOLUMNS(
    CALENDAR(DATE(2008, 1, 1), DATE(2010,12,31)),
    "Year", YEAR( [Date] ),
    "Year Month", EOMONTH([Date], 0), -- formatted as yyyy-mm
    "Month", MONTH( [Date] )
)


The generated Employee table contains about 100K non-blank and 200K blank LeaveDate


The HireDate are distributed rather smoothly from the 2008 to the 2010, as can be seen by the Hires Running Total

The resulting head count grows up to 200K, as expected

The measure [Head Count] uses directly the Employee table. It filters the Employee rows by checking if the selected period overlaps with the time interval between the HireDate and LeaveDate. The REMOVEFILTERS( ‘Date’ ) removes the effect of any existing relationship between the Date and the Employee tables.

Head Count = 
VAR FromDate = MIN('Date'[Date])
VAR ToDate = MAX('Date'[Date])
RETURN
CALCULATE (
    COUNTROWS( Employee ),
    Employee[HireDate] <= ToDate,
    Employee[LeaveDate] >= FromDate || ISBLANK(Employee[LeaveDate]),
    REMOVEFILTERS('Date')
)

The performance of this measure to build the graph over the whole 3 years at the day level granularity can be measured using DAX Studio

This query runs in about 4.5 seconds

The snapshot table approach

To try to improve performance, I implemented the snapshot table as a DAX calculated table

Employee Snapshot = 
VAR MaxDate = MAX( 'Date'[Date] )
RETURN
SELECTCOLUMNS( 
    GENERATE(
        Employee,
        VAR HireDate = 'Employee'[HireDate]
        VAR LeaveDate = IF ( ISBLANK( 'Employee'[LeaveDate] ), MaxDate, Employee[LeaveDate] )
        RETURN
        DATESBETWEEN(
            'Date'[Date],
            HireDate,
            LeaveDate
        )
    ),
    "ID", Employee[ID],
    "Date", 'Date'[Date]
)

This code expands each row of the Employee table creating one row per each day from the HireDate to the LeaveDate or to the last date of the date table, when LeaveDate is BLANK.

The resulting Employee Snapshot table contains about 68M rows

And then the physical relationship over the Date can be created

The Head Count measure can be implemented by counting the distinct IDs in the current filter context

Head Count Snapshot ALL = DISTINCTCOUNT( 'Employee Snapshot'[ID] )

But the performance is worse than with the previous measure

The total execution time is now almost 24 seconds! This happens because of the DISTINCTCOUNT.

To get rid of the DISTINCTCOUNT it’s possible to use the faster COUNTROWS to count the rows on the last day of the selected period. This works because each employee ID in the snapshot table appears only once per day. At the day level of granularity there is no difference with the [Head Count] and [Head Count Snapshot ALL], since they all operate on the same time period of one day.

Head Count Snapshot EOP = 
CALCULATE (
    COUNTROWS ( 'Employee Snapshot' ),
    LASTDATE ( 'Date'[Date] )
)

This measure performance is way better than the previous ones

The execution time is now just 65 milliseconds!

Let’s see what happens at the month granularity level

As we can see at the month level of granularity the [Head Count] measure performance is good, since it takes just about 160 milliseconds.

The difference between the [Head Count] and the [Head Count Snapshot EOP] when used at the month level of granularity can be seen in the following graph

To avoid this difference without losing the gain in performance obtained at the day level, it’s possible to check the granularity level dynamically to select the measure to use

Head Count Hierarchy = 
IF (
    ISINSCOPE ( 'Date'[Date] ),
    [Head Count Snapshot EOP],
    [Head Count]
)

This way it’s possible to create a graph visual with a hierarchy on the X axis: Year, Year Month and Date

The model and the DAX queries used to compute the performance can be downloaded from GitHub

Ambiguous model and USERELATIONSHIP

When a data model diagram contains cycles it’s ambiguous, because more than one path exists from the same two tables. When I began writing DAX code I thought that in case more than one path connected the same two tables, the result was the intersection of the filters. I was wrong.

A few days ago I answerd a question on stackoverflow.com about what happens when USERELATIONSHIP is used with an ambiguous model and I realized that the proposed model was the simplest case of ambiguity: three tables and three relationships.

There are two relationships on the Date column and an inactive one on the Name column between Product and ProductHistory. The last relationship is inactive, because otherwise it would create an ambiguity on the propagation of a filter from ‘Date'[Date] to ProductHistory[Date]. What would be the right path to follow? From Date directly to ProductHistory or from Date to Product and then to ProductHistory?

With the Product[Name] –> ProductHistory[Name] relationship inactivated, it’s evident that a filter applied to ‘Date'[Date] directly propagates to ProductHistory without crossing the Product table.

This can be seen by executing this query in DAX Studio, that per each ‘Date'[Date] shows the corresponding ProductHistory rows, according to the active relationship:

EVALUATE
GENERATE(
    ALL( 'Date'[Date] ),
    CALCULATETABLE(
        ProductHistory
    )
)

The result shows matching dates, as expected, since the ‘Date'[Date] –> ProductHistory[Date] relationship is the active one

But what happens when activating the Product[Name] –> ProductHistory[Name] relationship with USERELATIONSHIP ?

  • Both relationship are used and the result is the intersection
  • Only the relationship specified with USERELATIONSHIP is used and the other one is ignored

The answer is the latter: DAX uses only the path containing the relationship specified with USERELATIONSHIP and ignores any other existing path. The result is that the ‘Date'[Date] -> ‘ProductHistory'[Date] relationship becomes the inactive one for the duration of the CALCULATETABLE where USERELATIONSHIP is applied.

This can be seen by executing this other query in DAX Studio

EVALUATE
GENERATE(
    ALL( 'Date'[Date] ),
    CALCULATETABLE(
        ProductHistory,
        USERELATIONSHIP ( 'Product'[Name], ProductHistory[Name] )
    )
)

The result shows that the dates don’t match between ‘Date'[Date] and ProductHistory[Date].

‘Date'[Date] matches with the Product[Date], as can be seen in the code used to generate the three tables:

Product = 
DATATABLE( 
    "Name", STRING, "Date", DATETIME, "Quantity", INTEGER,
    {
        { "A", "2020-01-01", 1 },
        { "B", "2020-01-01", 2 }
    }
)

ProductHistory = 
DATATABLE(
    "Name", STRING, "Date", DATETIME, "Quantity", INTEGER,
    {
        { "A", "2020-01-02", 10 },
        { "A", "2020-01-03", 20 },
        { "B", "2020-01-01", 30 }
    }
)

Date = CALENDAR( "2020-01-01", "2020-01-31" )

The sample files for this article can be found on my github

Adding a Static Table with Types in Power Query With M Language

SQLBI’s guys recently published the Youtube video Create Static Tables in DAX Using the DATATABLE Function, where they show how to create a static table in Power BI.
At first using Power Query Enter data interface and then using the DAX language.
They also show that the M code generated by Enter data contains the table definition as an unreadable (by humans) encoded string, and that the only way to change it is through its specific editor.

Actually, it’s possible to create a static table using M readable code, using the syntax explained in this Chris Webb’s BI Blog post Creating Tables In Power BI/Power Query M Code Using #table(), where one of the proposed syntaxes also allows to specify the data types.

The M code to create the Segments table is the following

#table(
    type table
        [
            #"Price Range"=text, 
            #"Min Price"=number,
            #"Max Price"=number
        ], 
    {
        {"LOW", 0, 100},
        {"MEDIUM", 100, 1000},
        {"HIGH", 1000, 999999}
    }
)

This code uses the #table() function, specifying the type as the first parameter and the data as the second one.

To enter this code in Power Query first create a Blank Query

Then open the Advanced editor

And finally write the code in the Advanced editor window

selecting Done closes the Advanced editor window and shows the generated table

the #table() M function doesn’t have the same limitation of the corresponding DAX function, that only accept literals. With #table() variables are allowed.

Of course, defining static tables using M language has the disadvantage that a change triggers a refresh of the dataset.

The sample file can be downloaded from my GitHub

Circular Dependency in DAX caused by CALENDARAUTO

Working on a small demo, I was surprised by an unexpected circular dependency.

I built a repro to understand what was going on and I’m writing this post to fix what I found out for future memory.

The model is very simple, I have a table Employee with the Name and the Hire and Leave dates

I’ve added a Date calculated table, created by CALENDARAUTO

Then I generated a Employee Snapshot table, with all the dates between the Hire and Leave dates for each employee.

When I tried to create the relationship on the Name columns between the Employee and the Employee Snapshot tables, a Circular Dependency error was generated.

Trying the known techniques to solve the circular dependency on the Employee Snapshot DAX code didn’t work.

Eventually, the only solution I found was to replace the CALENDARAUTO with CALENDAR.

The Employee table in the repro is very simple, I created it directly with “Enter Data” button.

SNAG-0008

The code to create the Date table with CALENDARAUTO is straightforward

Date = CALENDARAUTO()

The code to create the Employee Snapshot, reduced to avoid context transitions and Time Intelligence function is

Employee Snapshot = 
SELECTCOLUMNS( 
    GENERATE(
        ALLNOBLANKROW( Employee ),
        FILTER (
            ALLNOBLANKROW( 'Date'[Date] ),
            'Date'[Date] >= 'Employee'[Hire Date] &&
            'Date'[Date] <= 'Employee'[Leave Date]
        )
    ),
    "Name", Employee[Name],
    "Date", 'Date'[Date]
)
The model now has the three tables, without any relationship
 
SNAG-0007
 
Then, when I try to create a relationship using the Name column
 
Relationship on Name setting
 
The circular dependency error is generated
SNAG-0006
The solution I found is to replace the code to generate the Date table
Date = 
VAR MaxEmployeeDate = MAX( MAX( Employee[Hire Date] ), MAX( Employee[Leave Date] ) )
VAR MinEmployeeDate = MIN( MIN( Employee[Hire Date] ), MIN( Employee[Leave Date] ) )
VAR MaxDate = DATE( YEAR( MaxEmployeeDate ), 12, 31 )
VAR MinDate = DATE( YEAR( MinEmployeeDate ), 1, 1 ) 
RETURN
    CALENDAR( MinDate, MaxDate )
With the new Date table it’s now possible to create the relationships
 
SNAG-0010
 
The explanation is that CALENDARAUTO is using all the dates in the model, including those in the calculated table.

 

The sample pbix can be downloaded from github

 

How To Set Up Power BI New Composite Models

After watching the new video from SQBI’s guys Unboxing new Power BI composite models I decided to try it.

Here is a quick guide on the steps to follow in order to set up a composite model.

Enable the Preview feature DirectQuery for Power BI datasets and Analysis Services.

In the Options configuration window in Power BI Desktop we need to check the “Power BI datasets and Analysis Services” Preview feature. After clicking OK,  Power BI requires to be restarted.

options with steps

Connect to the Power BI dataset to be used via Direct Query.

This creates a Live Connection with a model in the Power BI service. First click on the “Power BI datasets” button and then choose the Workspace. 

SNAG-0025

Add a local model and transform the Live Connection to Direct Query

Click on the “Transform Data” button and then “Add a local model”

Transform Data Direct Query connection

And that’s it

its’ now possible to add new tables using Power Query or DAX and to create calculated columns from the tables context menu in the Fields panel.

SNAG-0026

 

 

Variables in DAX and Common Mistakes

This post purpose is to show two common mistakes that happen when using DAX variables.

Variables in DAX are a powerful tool that improves the readability of DAX code. But their name is misleading, since DAX variables behave like constants.

The syntax to define a DAX variable is straightforward:

VAR myVariable = ... a DAX expression ...
RETURN ... a DAX expression using myVariable ...

For instance, this is valid DAX code, where we define the variable TotalSales and later we use it

[% Sales] = 
VAR TotalSales = CALCULATE( [Sales], REMOVEFILTERS( 'Date' ) )
RETURN DIVIDE( [Sales Amount[, TotalSales )

Once a variable has been declared, it cannot be changed like in a conventional programming language, where variables can be assigned multiple times.

To see what are the two common mistakes that newbies do when using variables, we create two tables in Power BI Desktop using this DAX code

Date = 
ADDCOLUMNS(
    CALENDAR( "2020-01-01", "2020-12-31" ),
    "Month", FORMAT( [Date], "mmm" ),
    "Month Number", MONTH( [Date] )
)

Sales = 
VAR MinDate =
    MIN( 'Date'[Date] )
RETURN
    SELECTCOLUMNS(
        GENERATESERIES( 1, 10000 ),
        "Date", CONVERT( MOD( [Value], 365 ) + MinDate, DATETIME ),
        "Amount", MOD( [Value], 100 )
    )

Then we create a one to many relationship between ‘Date'[Date] and Sales[Date] and we set the sort by column of ‘Date'[Month] to ‘Date'[Month Number].

This is our model:

We can create the measure [Sales Amount]

Sales Amount = SUM( Sales[Amount] )

And a matrix visual to show the [Sales Amount] per Month

Now we want to write a measure to compute the percentage of the sales over the total, and to do so we write this DAX code

% Sales Amount Wrong = 
VAR SalesAmount = [Sales Amount]
VAR TotalSalesAmount =
    CALCULATE( SalesAmount, REMOVEFILTERS( 'Date' ) )
RETURN
    DIVIDE( SalesAmount, TotalSalesAmount )

But when we add this measure to the matrix visual we get this result

We have 100% per every month, which is wrong. Our mistake was to assume that the SalesAmount variable would be evaluated inside the CALCULATE with the modified filter context.

The fact is that once SalesAmount is defined, it contains the value for the current row, and when it is referred later it behaves like a constant. For instance, in January SalesAmount assumes the value of 42690, therefore the following CALCULATE behaves like if it was written

VAR TotalSalesAmount =
    CALCULATE( 42690, REMOVEFILTERS( 'Date' ) )

The correct code for this measure uses the [Sales Amount] measure instead

% Sales Amount = 
VAR SalesAmount = [Sales Amount]
VAR TotalSalesAmount =
    CALCULATE( [Sales Amount], REMOVEFILTERS( 'Date' ) )
RETURN
    DIVIDE( SalesAmount, TotalSalesAmount )

And now the result is correct

The second mistake is less evident, since it happens when we define a variable containing a table.

We want to add a card visual to show the number of Sundays of January. We think it is a good idea (but it isn’t) to first prepare a table with just the Sundays and later to count its rows with a filter on January. So we write this code

# Sundays in January Wrong = 
VAR SundaysTable =
    FILTER( ALL( 'Date' ), WEEKDAY( 'Date'[Date] ) = 1 )
RETURN
    CALCULATE( COUNTROWS( SundaysTable ), 'Date'[Month Number] = 1 )

And we find out that January has more Sundays that days

This is because the SundayTable in CALCULATE is a constant, like if it was written with literals: { { “2020-01-05”, 1, “Jan” }, { “2020-01-12”, 1, “Jan” }, … { “2020-12-27”, 12, “Dec” } }, therefore it is not affected by the filter context modified by CALCULATE.

But since the columns in table variables keep the data lineage, table variables can be used as filter arguments and the previous measure can be fixed like follows

# Sundays in January = 
VAR SundaysTable =
    FILTER( ALL( 'Date' ), WEEKDAY( 'Date'[Date] ) = 1 )
RETURN
    CALCULATE( COUNTROWS( 'Date' ), 'Date'[Month Number] = 1, SundaysTable )

And now we see the correct number of 2020 January Sundays.

Variables are very useful and once we learn to avoid these mistakes we’ll use them to better organize our code, as a support for debugging and also, in some cases, as a tool to optimize our measure performances.

The DAX code in this post was written with the purpose to be as simple as possible to show the common mistakes.

The pbix file can be downloaded from my github profile