Hello all,
I am working with a program and am having difficulty understanding a specific piece of code, a RETAIN statement in a DATA step.
Specific questions:
a) What is the purpose/function of adding x's to the variable names?
b) How does the RETAIN statement differ from the KEEP statement?
Code:
DATA sci_2;
SET sci_1;
RETAIN xfiscalyear xschlcode xsaisid xsubject xgr_new;
IF xfiscalyear = fiscalyear AND xschlcode = schlcode AND xsubject = subject AND xsaisid = saisid AND xgr_new = gr_new THEN delete;
Xfiscalyear = fiscalyear;
Xsaisid = saisid;
Xsubject = subject;
Xgr_new = gr_new;
Xschlcode = schlcode;
Any additional information for my extended learning would be appreciated! Thanks in advance!
@tennytivvytutu wrote:
Thank you! I'm not sure why we would want to set variables to missing, could you think of a general example where I'd want to use RETAIN? When would a variable be set to missing if it is not truly a missing data point?
ALL data step variables not coming from a dataset or explicitly RETAINed will be set to missing at the start of a data step iteration (I assume you already know what a data step iteration is). This is default behavior of data steps.
No. As already stated, variables read from a dataset are automatically retained; you cannot un-retain them, you must explicitly set them to missing if such is needed.
RETAIN lets you retain values in newly created variables from one iteration to the next, which is useful for aggregating or comparing values.
See this simple example:
data have;
input x,
datalines;
1
2
3
;
data want;
set have;
retain y;
y = sum(y,x);
run;
@tennytivvytutu wrote:
Thank you! I'm not sure why we would want to set variables to missing, could you think of a general example where I'd want to use RETAIN? When would a variable be set to missing if it is not truly a missing data point?
There are many reasons to set a variable to missing. Consider data that is supposed to have some numeric values but the data collection system places a "code" like 9999 for any of "not valid" "not collected yet" "collection error" or similar meanings. When you do statistics on that variable you likely do not want to use the numeric value of 9999. SAS, since it was designed to do summary statistics, has the value of missing so that summaries do not count, average or include in other summary statistics when not wanted. So for processing in SAS you should set the 9999 (or similar) to missing.
Another is that you examine data and find that one or more values are suspicious that you do not want to include in your analysis. Consider a data set that has hourly wage information and vast majority of people in the data have values from $7 to maybe $500 (lawyers or such). But then you find a value that is $1,000,000. You may not want to consider a million dollars as a valid hourly wage. Or converse of similar data, the value is supposed to reflect annual income and you find values of $4.50 recorded. You might recode that to missing.
There are process in SAS, such as RETAINed variables where you may need to reset the retained missing value because the current record is for a new group of records and continuing the previous value would be inappropriate as you want the value to be within the group.
Search this forum for the word RETAIN. You will find hundreds of examples where retain is used. Typical it is for creating cumulative values, filling in missing that should not be or having a value to use for a later purpose.
If you have data that is sequential by date with some group identification you may want to compare the date of a current record with the first to calculate durations (using the date itself) or change from the first record for the group to get change in price/payment/salary/instrument recording like temperatures.
And you may just need the value for some calculation or comparison but do not actually write it to the output data.
Start with reading the documentation:
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lestmtsref/p0t2ac0tfzcgbjn112mu96hkgg9o.htm
This code looks like (not very smart) way to delete duplicated rows.
@a) "x" prefix is just to create temporary variable.
@b) KEEP statement is for keeping variables in datasets, RETAIN statement is for retaining variable's values between main loop's iterations.
Bart
Thanks! If I update this section of the code to delete duplicated rows more efficiently, would it perhaps look like this:
DATA sci_2;
SET sci_1;
IF fiscalyear = . AND schlcode = . AND subject = . AND saisID = . AND gr_new = . THEN delete;
RUN;
...or perhaps:
DATA sci_2;
SET sci_1;
WHERE fiscalyear schlcode subject saisID gr_new IS NOT MISSING;
RUN;
And why not use PROC SORT with NODUPKEY?
Hi @tennytivvytutu ,
my first idea for de-duplication would be Proc Sort with NODUPKEY option (for small datasets).
For big datasets with n variables as key I would use Paul Dorfman ( @hashman ) approach from "How To Dupe A Dedup" paper:
https://analytics.ncsu.edu/sesug/2018/SESUG2018_Paper-281_Final_PDF.pdf
Bart
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.