Watch this Ask the Expert session to learn how the LAG function works, get an introduction to the hash object and why it can be helpful with autoregressive time series forecasting.
Watch the webinar
You’ll also learn how to take a value from a previous row, perform a calculation with it and then store it in the same column.
You will learn:
An in-depth understanding of the LAG function and a visual demonstration of how it works.
An introduction to the hash object.
How to use a hash object to perform a LAG function when you need to do a calculation on the value.
The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.
Q&A
If I am correct, the size of HASH Object depends on amount memory, correct? If so, how would you determine if you have enough memory?
Yes, hash objects must fit into available memory, and the MEMSIZE option in SAS will determine how much system memory is available to your session. Default is 2G. You can increase the value of the MEMSIZE option at startup. There are also techniques for using hash objects efficiently in memory by adjusting the HASHEXP method.
Does the memory go against the 2G default or it is from the available pool on the machine?
Yes, it goes against the 2G. You can increase the value of the MEMSIZE option at startup. There are also techniques for using hash objects efficiently in memory by adjusting the HASHEXP method.
What happens if you don't have continuous dates in your data? Like if you have 16-Oct and then 18-Oct?
If, as in my example, we are doing a lag of 1, when it reaches the observation for Oct 18, it will look in the hash object for a row of data with a key of Oct 17, and it will return a non 0 value indicating the FIND method failed. This will result in Oct 18 having ‘.’ for its prediction which will then ripple forward through the remainder of the rows of data.
You can add error catching to your code to always check the value of rc and handle exceptions where it is not 0.
rc=myLag.find(key: Date-1);
if not (rc = 0) then do;
/*error handling code*/
end;
You seem to be operating within Screen Control Language (SCL) and not within the SAS Display Manager data step. There is no SAS data step statement DECLARE.
This is just Base SAS. The DECLARE HASH and DECLARE HITER statements were added in version 9.0 as beta, and production in version 9.1.
Instead of hash objects, could you have used retain statements before the set statement(s) and lag statements right after the set statement(s) and before any logic statements?
We could RETAIN the value of PRED from the previous observation, and then conditionally set the LAG_QTYSOLD_1 variable to that value before calculating the new PRED. We still don’t want to use a LAG function as it does not allow us time to make a calculation between the pulling of the previous value from the memory queue and the pushing of the new value into that memory queue.
data scored;
set raw;
retain pred;
if lag_qtySold_1=. then lag_qtySold_1=pred;
if qtysold=. then pred=-0.19751+0.67111*lag_qtySold_1+0.14582*dow;
Does the data have to be sorted before reading it in?
Yes and no. Because I am doing time series predictions the raw dataset that I am using to create my predictions must be sorted in order of date, or else I will have issues the first time I try to predict the value on a given date where I do not already have a prediction from the date before. (Same as the first question, if we skip a date, or if the dates are out of order, we will have problems with our time series forecast, we need to make our predictions in order since they feed into the next day’s prediction).
However, the data loaded to the HASH object does not have to be sorted. That is the beauty of the HASH object! 😊
If I had done some other manipulation and had a second version of my RAW dataset called RAW2 that was not sorted by date, but RAW is still sorted by date, the following would work correctly:
data scored;
set raw; /*sorted by date*/
if _n_=1 then do;
declare hash myLag(dataset: "raw2(where=(QtySold=.))"); /*NOT sorted by date*/
If it weren’t for the time series nature of my example, the main dataset would not need to be sorted either. If you had a dataset of patient records and you were using a HASH object to lookup data about those patients, you would not need either dataset to be sorted by the patientID.
Is there any way to display the hash table?
Not exactly, it is stored in memory and disappears as soon as the data step is done. However, you can save it as a SAS dataset using the method OUTPUT.
rc=myLag.OUTPUT(dataset: “HashDS”);
The above line of code will create a dataset called HashDS that will contain all the fields that were listed when the DEFINEDATA method was used. It will NOT include the fields from the DEFINEKEY method so if you wanted to see the date you would need to include it in the DEFINEDATA statement. If you do not issue the DEFINEDATA statement then, by default, all variables are included, which would include your KEY variables as well.
When you declare your HASH object you can use the ORDERED option to make sure that your resulting dataset is sorted (and save yourself the overhead of a PROC SORT). Similarly to the DECLARE HASH statement, you can also use the same dataset options to KEEP, DROP, RENAME or subset the data with a WHERE statement here as well.
Regarding whether we are missing October 20 (for example), can we use _n_ instead of date to go back one record?
Not directly, but you can create a new variable and use that. In my data step creating my RAW dataset, I added “recNum = _N_;” and then anywhere that I used DATE I replaced it with RECNUM and got the exact same results.
If you’ve got two hash objects A and B that you’ve been cycling forward, partway through A and then set it aside to dig into B, does A stay where you left it when you come back to cycle through more records in A?
If you have two or more HASH objects within a single data step, they are two separate entities. With each reference to the FIND method you specify which of the HASH objects you are searching in and simply tell it what key value to be looking up and there is no problem caused by switching which HASH object you are using. You just need to keep it straight in your code when you want A.FIND and B.FIND.
You might be thinking about the HITER object that will help you iterate through a HASH object, but there too, each would be a separate entity associated with a specific HASH object and there would be no problem caused by switching which object you are iterating through. They each have their own pointer. You’ll just want to name them in such a way that you know which HITER is associated with which HASH object.
Can hash object be used to do table joins?
Absolutely! And the tables don’t need to be sorted first. The only caveat is that all but one of your tables must fit in memory. So, if you are merging a small table into a large one, load the small one into a HASH object, then use the FIND method to add the columns from the smaller one into the larger one.
Can you discuss multiple rows in either the dataset or in the hash table?
If the base dataset, SCORED in my example, has duplicate rows, there is no problem, the same row will be pulled in from the HASH object each time a given KEY is used in the FIND method.
In SAS 9.0 and 9.1, what was loaded into a HASH object had to have unique KEY values, but starting with 9.2, you could have duplicates. When defining your HASH object you need to tell SAS to expect this or else it will only keep one row per key value. You do this by adding the MULTIDATA=”Y” option when you declare your HASH object.
The challenge is that the FIND method will only return the first record it finds. It will not return all the rows with a given key. You will need to add the FIND_NEXT method that was also added in SAS 9.2. That will keep searching through the HASH object returning all matching rows till it runs out of matching rows.
There is a FIND_PREV method but it doesn’t seem to be useful as there is not currently a method to position the pointer at the last row for a particular key, but perhaps that will come someday. For now, FIND will go to the first one and FIND_NEXT will move on till it runs out.
Could this work using something as a key other than dates?
Yes, the key in a HASH object is not limited to dates or even numeric fields such as a record number created from _N_ as in some of the above answers, or a patientID. It will also work on character keys as well. Just keep in mind that it will be case sensitive unless you apply the UPCASE function to both fields first.
Do you have any tips on using hash object for validating large datasets - like missing data, incomplete segments, etc.?
It would depend greatly on the specific case, but an example could be finding missing transactions. If you had a variable that recorded transaction numbers and you expected that to be consecutive you could use PROC SQL and SELECT INTO to set macro variables with the min and max transaction numbers, then generate a dataset with all possible values between them. You would then load your original dataset into a HASH object with that transaction number as the key. You could then go through your generated dataset and check to see if every number you generated is successfully found.
rc=myTrans.FIND(key: transNum);
if not (rc=0) then missTrans=1;
Recommended Resources
Hash Objects: When LAGging Behind Just Doesn’t Work
LAG Function Documentation
Using the Hash Object to Store and Retrieve Data
Hash Object Tip Sheet
SAS Programmer Week (Free + Virtual)
SAS Conference Proceedings (Lex Jansen)
Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.
... View more