Re: Help understanding RETAIN statement, x before variable names?

tennytivvytutu · Posted 06-10-2022 02:11 PM

Hello all,

I am working with a program and am having difficulty understanding a specific piece of code, a RETAIN statement in a DATA step.

Specific questions:

a) What is the purpose/function of adding x's to the variable names?

b) How does the RETAIN statement differ from the KEEP statement?

Code:

DATA sci_2;
SET sci_1;
RETAIN xfiscalyear xschlcode xsaisid xsubject xgr_new;
IF xfiscalyear = fiscalyear AND xschlcode = schlcode AND xsubject = subject AND xsaisid = saisid AND xgr_new = gr_new THEN delete;

Xfiscalyear = fiscalyear;
Xsaisid = saisid;
Xsubject = subject;
Xgr_new = gr_new;
Xschlcode = schlcode;

Any additional information for my extended learning would be appreciated! Thanks in advance!

Kurt_Bremser · Posted 06-10-2022 02:20 PM

The x is just there to create a new variable where the relationship to the original variable is obvious.
KEEP determines which variables make it into the output dataset.
RETAIN directs the data step to not set the variables to missing at the start of a data step iteration, so they retain their values across data step iterations.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

tennytivvytutu · Posted 06-10-2022 02:38 PM

Thank you! I'm not sure why we would want to set variables to missing, could you think of a general example where I'd want to use RETAIN? When would a variable be set to missing if it is not truly a missing data point?

Kurt_Bremser · Posted 06-10-2022 03:07 PM

@tennytivvytutu wrote:
Thank you! I'm not sure why we would want to set variables to missing, could you think of a general example where I'd want to use RETAIN? When would a variable be set to missing if it is not truly a missing data point?

ALL data step variables not coming from a dataset or explicitly RETAINed will be set to missing at the start of a data step iteration (I assume you already know what a data step iteration is). This is default behavior of data steps.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

tennytivvytutu · Posted 06-10-2022 03:17 PM

I'm still learning SAS (and programming in general) so the DATA step is a bit confusing. But I found helpful documentation here: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lepg/p0l9z26c8qrhl8n1hqs6raontpbn.htm#p0ly68p...
and it does say that "variables read in a SET statement are not reset to missing [as the system automatically returns to the top of the DATA step]." I suppose this means that the RETAIN statement acts as an override to ensure the variables indeed are set to missing, even though they're read in a SET statement? I'm closer to understanding, but the benefit of the RETAIN statement is still lost on me...

Kurt_Bremser · Posted 06-10-2022 03:44 PM

No. As already stated, variables read from a dataset are automatically retained; you cannot un-retain them, you must explicitly set them to missing if such is needed.

RETAIN lets you retain values in newly created variables from one iteration to the next, which is useful for aggregating or comparing values.

See this simple example:

data have;
input x,
datalines;
1
2
3
;

data want;
set have;
retain y;
y = sum(y,x);
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 06-10-2022 04:31 PM

@tennytivvytutu wrote:
Thank you! I'm not sure why we would want to set variables to missing, could you think of a general example where I'd want to use RETAIN? When would a variable be set to missing if it is not truly a missing data point?

There are many reasons to set a variable to missing. Consider data that is supposed to have some numeric values but the data collection system places a "code" like 9999 for any of "not valid" "not collected yet" "collection error" or similar meanings. When you do statistics on that variable you likely do not want to use the numeric value of 9999. SAS, since it was designed to do summary statistics, has the value of missing so that summaries do not count, average or include in other summary statistics when not wanted. So for processing in SAS you should set the 9999 (or similar) to missing.

Another is that you examine data and find that one or more values are suspicious that you do not want to include in your analysis. Consider a data set that has hourly wage information and vast majority of people in the data have values from $7 to maybe $500 (lawyers or such). But then you find a value that is $1,000,000. You may not want to consider a million dollars as a valid hourly wage. Or converse of similar data, the value is supposed to reflect annual income and you find values of $4.50 recorded. You might recode that to missing.

There are process in SAS, such as RETAINed variables where you may need to reset the retained missing value because the current record is for a new group of records and continuing the previous value would be inappropriate as you want the value to be within the group.

Search this forum for the word RETAIN. You will find hundreds of examples where retain is used. Typical it is for creating cumulative values, filling in missing that should not be or having a value to use for a later purpose.

If you have data that is sequential by date with some group identification you may want to compare the date of a current record with the first to calculate durations (using the date itself) or change from the first record for the group to get change in price/payment/salary/instrument recording like temperatures.

And you may just need the value for some calculation or comparison but do not actually write it to the output data.

yabwon · Posted 06-10-2022 02:23 PM

Start with reading the documentation:

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lestmtsref/p0t2ac0tfzcgbjn112mu96hkgg9o.htm

This code looks like (not very smart) way to delete duplicated rows.

@a) "x" prefix is just to create temporary variable.

@b) KEEP statement is for keeping variables in datasets, RETAIN statement is for retaining variable's values between main loop's iterations.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

tennytivvytutu · Posted 06-10-2022 02:37 PM

Thanks! If I update this section of the code to delete duplicated rows more efficiently, would it perhaps look like this:

DATA sci_2;

SET sci_1;

IF fiscalyear = . AND schlcode = . AND subject = . AND saisID = . AND gr_new = . THEN delete;

RUN;

...or perhaps:

DATA sci_2;

SET sci_1;

WHERE fiscalyear schlcode subject saisID gr_new IS NOT MISSING;

RUN;

And why not use PROC SORT with NODUPKEY?

yabwon · Posted 06-10-2022 02:45 PM

Hi @tennytivvytutu ,

my first idea for de-duplication would be Proc Sort with NODUPKEY option (for small datasets).

For big datasets with n variables as key I would use Paul Dorfman ( @hashman ) approach from "How To Dupe A Dedup" paper:

https://analytics.ncsu.edu/sesug/2018/SESUG2018_Paper-281_Final_PDF.pdf

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

tennytivvytutu · Posted 06-10-2022 03:01 PM

This paper is very useful and informative, I will definitely read!