DATA Step, Macro, Functions and more

Implied retain in SET statement?

Reply
New Contributor
Posts: 4

Implied retain in SET statement?

I have some datasets that contain the same information for three separate years, however, the variable names from year to year are inconsistent. I needed these all in one dataset with uniform variable names.

Originally, I was hoping to just use the rename= option, however there are several variables which are defined as numeric in one dataset and character in another. So my solution was to just put all the datasets in a SET statement to combine them and then sort out the variable names later in the same data step.

Here is an example which illustrates what I am trying to do, as well as the confusing output I am receiving:

--------------------------------------------------------------
data data1;
infile datalines;
input @1 name $15. @17 dob yymmdd8. @26 age;
datalines;
John Smith 19851225 25
Jack Bauer 19600704 50
Charlie Day 19791021 31
;
run;

data data2;
infile datalines;
input @1 name $15. @17 person_dob @26 person_age $2.;
datalines;
Patrick Stewart 19500406 60
Steve Jobs 19520115 58
Bill Gates 19510803 59
;
run;

data strange;
set data1 data2;

if dob=. then dob=input(strip(person_dob),yymmdd8.);
if age=. then age=input(person_age,8.);
*drop person_age person_dob;
run;

OUTPUT:
proc print data=strange;
run;


Obs name dob age person_dob person_ age

1 John Smith 9490 25 .
2 Jack Bauer 185 50 .
3 Charlie Day 7233 31 .
4 Patrick Stewart -3557 60 19500406 60
5 Steve Jobs -3557 60 19520115 58
6 Bill Gates -3557 60 19510803 59



-------------------------------------------------------------------------------


As you can see from the output, the values for "dob" and "age" do not populate as (I) expected. The if conditions are only being tested on the first observation. The new values are set and then retained so the if condition fails on all further observations because the variables all have values.

I am confused as to why the values are being retained. I was under the impression that since the next observation being read in had missing values for these variables, they would not be retained. Am I thinking about this incorrectly? My guess is that it has to to with the variable already being present in the dataset, however, I would like to know what is actually happening?

Thanks,
Brian
Super Contributor
Super Contributor
Posts: 3,174

Re: Implied retain in SET statement?

No input observation on WORK.DATA2 contributed a value for DOB so the PDV is not updated, and so you will have the most recent assigned value. My suggestion would be to test the "source" CHARACTER variable and consider renaming any incoming NUMERIC/CHARACTER variable so that it is unique. Or set DOB back to MISSING after doing an explicit OUTPUT in your last DATA step.

Scott Barry
SBBWorks, Inc.
Respected Advisor
Posts: 3,777

Re: Implied retain in SET statement?

An easy way to get this working is to do your own INIT to MISSING using CALL MISSING.

[pre]
data strange;
set data1 data2;
if dob=. then dob=input(strip(person_dob),yymmdd8.);
if age=. then age=input(person_age,8.);
*drop person_age person_dob;
output;
call missing(of _all_);

format dob yymmddn8.;
run;
[/pre]
New Contributor
Posts: 4

Re: Implied retain in SET statement?

Thanks guys.

I see... so because there is no DOB in work.data2, and I had already assigned a value to DOB for the first observation is keeps that value throughout. To clarify, if I was to create both dob and age in work.data2, and initialize them to missing, these values would then update the PDV and overwrite the current values of DOB and AGE as each observation is read in?

I went ahead and updated my program as per your suggestions.

Thanks again!
Brian
SAS Super FREQ
Posts: 8,743

Re: Implied retain in SET statement?

Hi:
A good technique for diagnosing issues like the one you experienced is to use the PUT or PUTLOG statement to write selected variables from the Progarm Data Vector (PDV) as a debugging technique. For example, the log below shows the result of using PUTLOG statements to see what was happening in your program with the DOB and AGE variables:
[pre]
1671 ** diagnose initialization issues with 2 set statements;
1672 ** and see how using IN= might help;
1673 data diagnose;
1674 set data1(in=inone) data2(in=intwo);
1675 putlog '***';
1676 putlog '***** After SET';
1677 put _n_ inone= intwo= name= dob= age= person_dob= person_age= xdob= xage=;
1678
1679 if dob=. then dob=input(put(person_dob,8.),yymmdd8.);
1680 if age=. then age=input(person_age,8.);
1681
1682 *** make OTHER variables not in either dataset;
1683 *** XDOB and XAGE will ONLY be created when an obs comes;
1684 *** from DATA2 dataset.;
1685 if intwo then do;
1686 xdob=input(put(person_dob,8.),yymmdd8.);
1687 xage=input(person_age,8.);
1688 end;
1689
1690 putlog '***** After IF for DOB and AGE';
1691 put _n_ inone= intwo= name= dob= age= person_dob= person_age= xdob= xage=;
1692 run;

***
***** After SET
1 inone=1 intwo=0 name=John Smith dob=9490 age=25 person_dob=. person_age= xdob=. xage=.
***** After IF for DOB and AGE
1 inone=1 intwo=0 name=John Smith dob=9490 age=25 person_dob=. person_age= xdob=. xage=.
***
***** After SET
2 inone=1 intwo=0 name=Jack Bauer dob=185 age=50 person_dob=. person_age= xdob=. xage=.
***** After IF for DOB and AGE
2 inone=1 intwo=0 name=Jack Bauer dob=185 age=50 person_dob=. person_age= xdob=. xage=.
***
***** After SET
3 inone=1 intwo=0 name=Charlie Day dob=7233 age=31 person_dob=. person_age= xdob=. xage=.
***** After IF for DOB and AGE
3 inone=1 intwo=0 name=Charlie Day dob=7233 age=31 person_dob=. person_age= xdob=. xage=.
***
***** After SET
4 inone=0 intwo=1 name=Patrick Stewart dob=. age=. person_dob=19500406 person_age=60 xdob=. xage=.
***** After IF for DOB and AGE
4 inone=0 intwo=1 name=Patrick Stewart dob=-3557 age=60 person_dob=19500406 person_age=60 xdob=-3557 xage=60
***
***** After SET
5 inone=0 intwo=1 name=Steve Jobs dob=-3557 age=60 person_dob=19520115 person_age=58 xdob=. xage=.
***** After IF for DOB and AGE
5 inone=0 intwo=1 name=Steve Jobs dob=-3557 age=60 person_dob=19520115 person_age=58 xdob=-2908 xage=58
***
***** After SET
6 inone=0 intwo=1 name=Bill Gates dob=-3557 age=60 person_dob=19510803 person_age=59 xdob=. xage=.
***** After IF for DOB and AGE
6 inone=0 intwo=1 name=Bill Gates dob=-3557 age=60 person_dob=19510803 person_age=59 xdob=-3073 xage=59
NOTE: There were 3 observations read from the data set WORK.DATA1.
NOTE: There were 3 observations read from the data set WORK.DATA2.
NOTE: The data set WORK.DIAGNOSE has 6 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
[/pre]

In addition to creating XDOB and XAGE so you could see the difference in initialization for those two variables versus the way you did it (by reusing existing variables that came from DATA1, I also show the use of the IN= data set option so you might check it out as a different way to do your test. Finally, I have a simpler suggestion for how to do the creation of the variables where you need to do the conversion for DATA2 obs (shown in the program below). Also, I showed the use of the INPUT function with the PUT function to get rid of "automatic conversion" messages in the SAS log.

Alternate suggestion:
[pre]
data simpler;
set data1(in=inone) data2(in=intwo);
** only create dob and age for observations from DATA2;
** you avoid the issue of testing for missing dob and age because;
** you know that variables from DATA2 will ONLY fall into the IF condition.;
if intwo then do;
dob=input(put(person_dob,8.),yymmdd8.);
age=input(person_age,8.);
end;
*drop person_age person_dob;
run;

proc print data=simpler;
format dob yymmdd10.;
run;
[/pre]

cynthia
Regular Contributor
Posts: 241

Re: Implied retain in SET statement?

The basic problem is that two different operations (pre-processing of data2 variables and appending data2 to data1) are crammed together in a data step. IMHO, both the manual pdv refreshing and the conditional processing relying on the in= ds option variable seem contrived. (besides, the in= ds option for data1 is not used at all)

The pre-processing of the data2 should and can be separated without any performance penalty using a view. Things like this are what views are for.

[pre]
data data1;
infile datalines firstobs=2;
input @1 name $15. @17 dob yymmdd8. @26 age;
datalines;
----+----1----+----2----+-
John Smith 19851225 25
Jack Bauer 19600704 50
Charlie Day 19791021 31
;
run;

data data2;
infile datalines firstobs=2;
input @1 name $15. @17 person_dob @26 person_age $2.;
datalines;
----+----1----+----2----+-
Patrick Stewart 19500406 60
Steve Jobs 19520115 58
Bill Gates 19510803 59
;
run;

data view2/view=view2;
set data2;
dob = input(strip(person_dob), yymmdd8.);
format dob yymmdd8.;
age = input(person_age, best.);
keep name dob age;
run;

data both;
set data1 view2;
run;

/* check */
proc print data=both;
run;
/*
Obs name dob age
1 John Smith 85-12-25 25
2 Jack Bauer 60-07-04 50
3 Charlie Day 79-10-21 31
4 Patrick Stewart 50-04-06 60
5 Steve Jobs 52-01-15 58
6 Bill Gates 51-08-03 59
*/
[/pre]
Ask a Question
Discussion stats
  • 5 replies
  • 246 views
  • 0 likes
  • 5 in conversation