Differencing with SQL LAG()

Sep 28

I’m working on this project where I have daily values for a bunch of subjects. I need to know the differences between the values for (day5, day1), (day5, day2), and (day5, day3).

I had planned to do this in Python in a loop, but then I remembered that many languages (including SQL) have tools for handling time series data. SQL Sever’s LAG() function to the rescue!

Some of the subjects are missing values for some days. This is easy to deal with for days 1-4. The LAG() function gives a “deafault” option for setting what the value should be in the target day is missing. The syntax is

LAG (scalar_expression [,offset] [,default])  
    OVER ( [ partition_by_clause ] order_by_clause )  

where offset is the number of days you want to go back (so yesterday is offset=1) and default is the value to return if your target day is missing. For my particular project, it’s actually safe to assume a missing day was actually 0. The OVER() piece lets SQL know if your day is grouped by something (for me each subject is a group), so I might use a subject_id for example. The order by clause is the time (or other ordering) column, so, to get 2 days ago for each subject, my LAG clause looks like this.

LAG (value_of_interest, 2,0)  
    OVER ( partition by subject_id order by date )  

But! I still have a problem. What about if day5 is missing? It can be treated as 0, but LAG() won’t help with that. I have an idea. I think I can create a table of just day5s filled with 0s and join that based on subject_id and date to the original table. Then, sum the value of interest grouping by day. If it works, that will be the topic of a future post!

Ryan Melvin

Differencing with SQL LAG()

SQL: Fill completely missing observations

SQL date part of date-time