BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rcleven2
Obsidian | Level 7

I have combined 3 very large data sets all having similar information in them. Each data set would have an address and certain dates that would be duplicates in the next data set like date_of_enrol, date_enrl, enrollment, etc. Some of these dates were different and I have written code and logic to make sure that the new variable Date_of_Enrollment holds the appropriate date. 

 

Here is where I found the problem. After I combined all three data sets and their many common variables into new ones that I created I wanted to organize the information with a retain statement. HOWEVER, when I used the retain statement I was surprised to find that the DATE variables that I had created that combined common date variables across data sets are now retaining their values from one observation to the next. I have fixed the problem by moving the retain statement out of that data step and into later one. 

 

It seems that RETAIN statements do not have the same effect on new variables as they do on existing ones when used with the intent of order. I would like to know why. 

 

Does any one know?

 

 

Ex.

date_1          date_2            best_date(desired)           best_date(actual)
01/01/1960  01/01/1959      01/01/1959                       01/01/1959
                .                   .                       .                        01/01/1959
                .                   .                       .                        01/01/1959
                .   01/01/1970      01/01/1970                       01/01/1970
                .                   .                       .                        01/01/1970

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

@rcleven2 wrote:

....

 

I was under the impression that the retain statement used under the data statement was meant to reorder the variables but in my case it did two things. It reordered my variables and also went a step further that i wanted it to go by retaining the value of ONLY my created numeric variables, which was very unexpected.

 

Why does the retain statement have that second effect on only created numeric variables  when using it to reorder variables in a data step?


The RETAIN statement is designed to RETAIN the values of variable from one iteration to the next. That is not a "side-effect".

 

The side effect is what you are trying to take advantage of.  SAS places the variables in the dataset in the order that they are referenced in the data step.  Some people of adopted using the RETAIN statement, whose purpose if to RETAIN variable, as an easy way to set the order of the variables by using it as the first place the variables are mentioned. 

 

It would be better to just use a LENGTH or ATTRIB statement and define the variables. 

 

People have gotten in the habit of using RETAIN because it does not set the type or length, just the position and the RETAIN flag.

If that is a concern then here is another method to mix old and new variables without knowing how the old variable should be defined.

 

data new ;
  if 0 then do ;
    set old1 (keep=id date1) ;
    length newvar 8 ;
    set old2 (keep=date2);
  end;
  merge old1 old2 ;
  by id;
  newvar = date2-date1;
run;

The IF 0 step will never execute, but the SAS compiler will see the variables referenced and so it will force the program data vector to begin with 

id date1 newvar date2

View solution in original post

8 REPLIES 8
Astounding
PROC Star

Yes, that's automatic.

 

Any variable that comes from a SAS data set is automatically retained.  Doesn't matter whether there's a SET statement or a MERGE statement.  All such variables are retained.

 

There are complications if a SET statement does not contain a subsequent BY statement.  When the SET statement moves to a new data set, the RETAIN is lifted for the first observation of the new data set.  This probably doesn't apply to the situation you are encountering.

 

Most of the time, you wouldn't even notice and it won't make a difference.  But when it does make a difference it can be difficult to diagnose.  You did well to get that far.

rcleven2
Obsidian | Level 7

Yes, I understand that variables are "retained" from one data step to the next. I am using the retain statement to order my variables. Let say that I have date_1 and date_2 and I have created best_date. Best_date will equal the minimum value of date_1 and date_2 if they both are not equal to . and will take either if one is missing.

 

Date_1 and date_2 will be listed in a drop statement and best_date will be ordered among many other variables within the created dataset. My problem is that with created numeric variable listed in a retain statement, they are keeping their values from one observation to the next. So if Obs1 has a best date of 01/01/1960  and obs2-0bs10 should be . I am seeing that obs2-obs10 have a best_date value of 01/01/1960. This value is duplicated, or "retained", until the next time that is filled in normally. Any blanks that SHOULD be in the data set are being populated by the value directly above it.

 

Why is that happening?

Tom
Super User Tom
Super User

@rcleven2 wrote:

Yes, I understand that variables are "retained" from one data step to the next. I am using the retain statement to order my variables. Let say that I have date_1 and date_2 and I have created best_date. Best_date will equal the minimum value of date_1 and date_2 if they both are not equal to . and will take either if one is missing.

 

Date_1 and date_2 will be listed in a drop statement and best_date will be ordered among many other variables within the created dataset. My problem is that with created numeric variable listed in a retain statement, they are keeping their values from one observation to the next. So if Obs1 has a best date of 01/01/1960  and obs2-0bs10 should be . I am seeing that obs2-obs10 have a best_date value of 01/01/1960. This value is duplicated, or "retained", until the next time that is filled in normally. Any blanks that SHOULD be in the data set are being populated by the value directly above it.

 

Why is that happening?


The purpose of the RETAIN statement is to mark NEW variables to not be reset to missing when a new iteration of the data step starts. If you don't want to keep the retained value then make sure to change it to the value you do want. 

 

Most likely you are only setting the new variable on some of the data step iterations.  Check that your IF/THEN statements always have an ELSE clause.

rcleven2
Obsidian | Level 7

All my conditional code and logic works fine. Leaving the new variables off of the retain statement gives me the output that I desire except for the variables being out of the desired order. All I did to fix the issue was put the retain statement in the next data step and it works fine.

 

I was under the impression that the retain statement used under the data statement was meant to reorder the variables but in my case it did two things. It reordered my variables and also went a step further that i wanted it to go by retaining the value of ONLY my created numeric variables, which was very unexpected.

 

Why does the retain statement have that second effect on only created numeric variables  when using it to reorder variables in a data step?

Tom
Super User Tom
Super User

@rcleven2 wrote:

....

 

I was under the impression that the retain statement used under the data statement was meant to reorder the variables but in my case it did two things. It reordered my variables and also went a step further that i wanted it to go by retaining the value of ONLY my created numeric variables, which was very unexpected.

 

Why does the retain statement have that second effect on only created numeric variables  when using it to reorder variables in a data step?


The RETAIN statement is designed to RETAIN the values of variable from one iteration to the next. That is not a "side-effect".

 

The side effect is what you are trying to take advantage of.  SAS places the variables in the dataset in the order that they are referenced in the data step.  Some people of adopted using the RETAIN statement, whose purpose if to RETAIN variable, as an easy way to set the order of the variables by using it as the first place the variables are mentioned. 

 

It would be better to just use a LENGTH or ATTRIB statement and define the variables. 

 

People have gotten in the habit of using RETAIN because it does not set the type or length, just the position and the RETAIN flag.

If that is a concern then here is another method to mix old and new variables without knowing how the old variable should be defined.

 

data new ;
  if 0 then do ;
    set old1 (keep=id date1) ;
    length newvar 8 ;
    set old2 (keep=date2);
  end;
  merge old1 old2 ;
  by id;
  newvar = date2-date1;
run;

The IF 0 step will never execute, but the SAS compiler will see the variables referenced and so it will force the program data vector to begin with 

id date1 newvar date2
Astounding
PROC Star

Just a small piece to add to what @Tom already said ...

 

Since you now understand that RETAIN holds on to the value from one observation to the next, you can program around that:

 

if first.id then best_date = min(date_1, date_2);

else best_date = .;

rcleven2
Obsidian | Level 7

In my programing I class I was taught the retain statement was a way to RETAIN the order of variables in a data step. Knowing that the retain statement is meant for a different purpose is what I was looking for. I still find its side effect or ordering variables more useful than its intended purpose though. Outside of using a retain statement, would there be any other way to reorder variables in a dataset in and easy and safe way?

 

Thanks!

Tom
Super User Tom
Super User

I agree with @Astounding that you need to realize that all variables that come from input datasets are retained.  That is what allows you to do a 1 to MANY merge.

 

But your description of your problem sounds different.

Can you make a small example?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2479 views
  • 0 likes
  • 3 in conversation