I have some datasets that contain the same information for three separate years, however, the variable names from year to year are inconsistent. I needed these all in one dataset with uniform variable names.
Originally, I was hoping to just use the rename= option, however there are several variables which are defined as numeric in one dataset and character in another. So my solution was to just put all the datasets in a SET statement to combine them and then sort out the variable names later in the same data step.
Here is an example which illustrates what I am trying to do, as well as the confusing output I am receiving:
--------------------------------------------------------------
data data1;
infile datalines;
input @1 name $15. @17 dob yymmdd8. @26 age;
datalines;
John Smith 19851225 25
Jack Bauer 19600704 50
Charlie Day 19791021 31
;
run;
data data2;
infile datalines;
input @1 name $15. @17 person_dob @26 person_age $2.;
datalines;
Patrick Stewart 19500406 60
Steve Jobs 19520115 58
Bill Gates 19510803 59
;
run;
data strange;
set data1 data2;
if dob=. then dob=input(strip(person_dob),yymmdd8.);
if age=. then age=input(person_age,8.);
*drop person_age person_dob;
run;
OUTPUT:
proc print data=strange;
run;
Obs name dob age person_dob person_ age
1 John Smith 9490 25 .
2 Jack Bauer 185 50 .
3 Charlie Day 7233 31 .
4 Patrick Stewart -3557 60 19500406 60
5 Steve Jobs -3557 60 19520115 58
6 Bill Gates -3557 60 19510803 59
-------------------------------------------------------------------------------
As you can see from the output, the values for "dob" and "age" do not populate as (I) expected. The if conditions are only being tested on the first observation. The new values are set and then retained so the if condition fails on all further observations because the variables all have values.
I am confused as to why the values are being retained. I was under the impression that since the next observation being read in had missing values for these variables, they would not be retained. Am I thinking about this incorrectly? My guess is that it has to to with the variable already being present in the dataset, however, I would like to know what is actually happening?
Thanks,
Brian
... View more