BookmarkSubscribeRSS Feed
tennytivvytutu
Fluorite | Level 6

Hello all,

 

I am working with a program and am having difficulty understanding a specific piece of code, a RETAIN statement in a DATA step. 

 

Specific questions: 

a) What is the purpose/function of adding x's to the variable names? 

b) How does the RETAIN statement differ from the KEEP statement?

 

Code:

 

DATA sci_2;
SET sci_1;
RETAIN xfiscalyear xschlcode xsaisid xsubject xgr_new;
IF xfiscalyear = fiscalyear AND xschlcode = schlcode AND xsubject = subject AND xsaisid = saisid AND xgr_new = gr_new THEN delete;

Xfiscalyear = fiscalyear;
Xsaisid = saisid;
Xsubject = subject;
Xgr_new = gr_new;
Xschlcode = schlcode;

 

Any additional information for my extended learning would be appreciated! Thanks in advance! 

10 REPLIES 10
Kurt_Bremser
Super User
The x is just there to create a new variable where the relationship to the original variable is obvious.
KEEP determines which variables make it into the output dataset.
RETAIN directs the data step to not set the variables to missing at the start of a data step iteration, so they retain their values across data step iterations.
tennytivvytutu
Fluorite | Level 6
Thank you! I'm not sure why we would want to set variables to missing, could you think of a general example where I'd want to use RETAIN? When would a variable be set to missing if it is not truly a missing data point?
Kurt_Bremser
Super User

@tennytivvytutu wrote:
Thank you! I'm not sure why we would want to set variables to missing, could you think of a general example where I'd want to use RETAIN? When would a variable be set to missing if it is not truly a missing data point?

ALL data step variables not coming from a dataset or explicitly RETAINed will be set to missing at the start of a data step iteration (I assume you already know what a data step iteration is). This is default behavior of data steps.

tennytivvytutu
Fluorite | Level 6
I'm still learning SAS (and programming in general) so the DATA step is a bit confusing. But I found helpful documentation here: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lepg/p0l9z26c8qrhl8n1hqs6raontpbn.htm#p0ly68p...
and it does say that "variables read in a SET statement are not reset to missing [as the system automatically returns to the top of the DATA step]." I suppose this means that the RETAIN statement acts as an override to ensure the variables indeed are set to missing, even though they're read in a SET statement? I'm closer to understanding, but the benefit of the RETAIN statement is still lost on me...
Kurt_Bremser
Super User

No. As already stated, variables read from a dataset are automatically retained; you cannot un-retain them, you must explicitly set them to missing if such is needed.

RETAIN lets you retain values in newly created variables from one iteration to the next, which is useful for aggregating or comparing values.

See this simple example:

data have;
input x,
datalines;
1
2
3
;

data want;
set have;
retain y;
y = sum(y,x);
run;
ballardw
Super User

@tennytivvytutu wrote:
Thank you! I'm not sure why we would want to set variables to missing, could you think of a general example where I'd want to use RETAIN? When would a variable be set to missing if it is not truly a missing data point?

There are many  reasons to set a variable to missing. Consider data that is supposed to have some numeric values but the data collection system places a "code" like 9999 for any of "not valid" "not collected yet" "collection error" or similar meanings. When you do statistics on that variable you likely do not want to use the numeric value of 9999. SAS, since it was designed to do summary statistics, has the value of missing so that summaries do not count, average or include in other summary statistics when not wanted. So for processing in SAS you should set the 9999 (or similar) to missing.

 

Another is that you examine data and find that one or more values are suspicious that you do not want to include in your analysis. Consider a data set that has hourly wage information and vast majority of people in the data have values from $7 to maybe $500 (lawyers or such). But then you find a value that is $1,000,000. You may not want to consider a million dollars as a valid hourly wage. Or converse of similar data, the value is supposed to reflect annual income and you find values of $4.50 recorded. You might recode that to missing.

 

There are process in SAS, such as RETAINed variables where you may need to reset the retained missing value because the current record is for a new group  of records and continuing the previous value would be inappropriate as you want the value to be within the group.

 

Search this forum for the word RETAIN. You will find hundreds of examples where retain is used. Typical it is for creating cumulative values, filling in missing that should not be or having a value to use for a later purpose.

If you have data that is sequential by date with some group identification you may want to compare the date of a current record with the first to calculate durations (using the date itself) or change from the first record for the group to get change in price/payment/salary/instrument recording like temperatures.

And you may just need the value for some calculation or comparison but do not actually write it to the output data.

yabwon
Onyx | Level 15

Start with reading the documentation:

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lestmtsref/p0t2ac0tfzcgbjn112mu96hkgg9o.htm

 

This code looks like (not very smart) way to delete duplicated rows.

 

@a)  "x" prefix is just to create temporary variable.  

@b) KEEP statement is for keeping variables in datasets, RETAIN statement is for retaining variable's values between main loop's iterations.

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



tennytivvytutu
Fluorite | Level 6

Thanks! If I update this section of the code to delete duplicated rows more efficiently, would it perhaps look like this:

 

DATA sci_2;

SET sci_1;

IF fiscalyear = . AND schlcode = . AND subject = . AND saisID = . AND gr_new = . THEN delete;

RUN;

 

...or perhaps:

 

DATA sci_2;

SET sci_1;

WHERE fiscalyear schlcode subject saisID gr_new IS NOT MISSING;

RUN;

 

And why not use PROC SORT with NODUPKEY?

yabwon
Onyx | Level 15

Hi @tennytivvytutu ,

 

my first idea for de-duplication would be Proc Sort with NODUPKEY option (for small datasets).

 

For big datasets with n variables as key I would use Paul Dorfman ( @hashman ) approach from "How To Dupe A Dedup" paper:

https://analytics.ncsu.edu/sesug/2018/SESUG2018_Paper-281_Final_PDF.pdf

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



tennytivvytutu
Fluorite | Level 6
This paper is very useful and informative, I will definitely read!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1035 views
  • 6 likes
  • 4 in conversation