BookmarkSubscribeRSS Feed
BrianDiamante
Calcite | Level 5
I have some datasets that contain the same information for three separate years, however, the variable names from year to year are inconsistent. I needed these all in one dataset with uniform variable names.

Originally, I was hoping to just use the rename= option, however there are several variables which are defined as numeric in one dataset and character in another. So my solution was to just put all the datasets in a SET statement to combine them and then sort out the variable names later in the same data step.

Here is an example which illustrates what I am trying to do, as well as the confusing output I am receiving:

--------------------------------------------------------------
data data1;
infile datalines;
input @1 name $15. @17 dob yymmdd8. @26 age;
datalines;
John Smith 19851225 25
Jack Bauer 19600704 50
Charlie Day 19791021 31
;
run;

data data2;
infile datalines;
input @1 name $15. @17 person_dob @26 person_age $2.;
datalines;
Patrick Stewart 19500406 60
Steve Jobs 19520115 58
Bill Gates 19510803 59
;
run;

data strange;
set data1 data2;

if dob=. then dob=input(strip(person_dob),yymmdd8.);
if age=. then age=input(person_age,8.);
*drop person_age person_dob;
run;

OUTPUT:
proc print data=strange;
run;


Obs name dob age person_dob person_ age

1 John Smith 9490 25 .
2 Jack Bauer 185 50 .
3 Charlie Day 7233 31 .
4 Patrick Stewart -3557 60 19500406 60
5 Steve Jobs -3557 60 19520115 58
6 Bill Gates -3557 60 19510803 59



-------------------------------------------------------------------------------


As you can see from the output, the values for "dob" and "age" do not populate as (I) expected. The if conditions are only being tested on the first observation. The new values are set and then retained so the if condition fails on all further observations because the variables all have values.

I am confused as to why the values are being retained. I was under the impression that since the next observation being read in had missing values for these variables, they would not be retained. Am I thinking about this incorrectly? My guess is that it has to to with the variable already being present in the dataset, however, I would like to know what is actually happening?

Thanks,
Brian
5 REPLIES 5
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
No input observation on WORK.DATA2 contributed a value for DOB so the PDV is not updated, and so you will have the most recent assigned value. My suggestion would be to test the "source" CHARACTER variable and consider renaming any incoming NUMERIC/CHARACTER variable so that it is unique. Or set DOB back to MISSING after doing an explicit OUTPUT in your last DATA step.

Scott Barry
SBBWorks, Inc.
data_null__
Jade | Level 19
An easy way to get this working is to do your own INIT to MISSING using CALL MISSING.

[pre]
data strange;
set data1 data2;
if dob=. then dob=input(strip(person_dob),yymmdd8.);
if age=. then age=input(person_age,8.);
*drop person_age person_dob;
output;
call missing(of _all_);

format dob yymmddn8.;
run;
[/pre]
BrianDiamante
Calcite | Level 5
Thanks guys.

I see... so because there is no DOB in work.data2, and I had already assigned a value to DOB for the first observation is keeps that value throughout. To clarify, if I was to create both dob and age in work.data2, and initialize them to missing, these values would then update the PDV and overwrite the current values of DOB and AGE as each observation is read in?

I went ahead and updated my program as per your suggestions.

Thanks again!
Brian
Cynthia_sas
SAS Super FREQ
Hi:
A good technique for diagnosing issues like the one you experienced is to use the PUT or PUTLOG statement to write selected variables from the Progarm Data Vector (PDV) as a debugging technique. For example, the log below shows the result of using PUTLOG statements to see what was happening in your program with the DOB and AGE variables:
[pre]
1671 ** diagnose initialization issues with 2 set statements;
1672 ** and see how using IN= might help;
1673 data diagnose;
1674 set data1(in=inone) data2(in=intwo);
1675 putlog '***';
1676 putlog '***** After SET';
1677 put _n_ inone= intwo= name= dob= age= person_dob= person_age= xdob= xage=;
1678
1679 if dob=. then dob=input(put(person_dob,8.),yymmdd8.);
1680 if age=. then age=input(person_age,8.);
1681
1682 *** make OTHER variables not in either dataset;
1683 *** XDOB and XAGE will ONLY be created when an obs comes;
1684 *** from DATA2 dataset.;
1685 if intwo then do;
1686 xdob=input(put(person_dob,8.),yymmdd8.);
1687 xage=input(person_age,8.);
1688 end;
1689
1690 putlog '***** After IF for DOB and AGE';
1691 put _n_ inone= intwo= name= dob= age= person_dob= person_age= xdob= xage=;
1692 run;

***
***** After SET
1 inone=1 intwo=0 name=John Smith dob=9490 age=25 person_dob=. person_age= xdob=. xage=.
***** After IF for DOB and AGE
1 inone=1 intwo=0 name=John Smith dob=9490 age=25 person_dob=. person_age= xdob=. xage=.
***
***** After SET
2 inone=1 intwo=0 name=Jack Bauer dob=185 age=50 person_dob=. person_age= xdob=. xage=.
***** After IF for DOB and AGE
2 inone=1 intwo=0 name=Jack Bauer dob=185 age=50 person_dob=. person_age= xdob=. xage=.
***
***** After SET
3 inone=1 intwo=0 name=Charlie Day dob=7233 age=31 person_dob=. person_age= xdob=. xage=.
***** After IF for DOB and AGE
3 inone=1 intwo=0 name=Charlie Day dob=7233 age=31 person_dob=. person_age= xdob=. xage=.
***
***** After SET
4 inone=0 intwo=1 name=Patrick Stewart dob=. age=. person_dob=19500406 person_age=60 xdob=. xage=.
***** After IF for DOB and AGE
4 inone=0 intwo=1 name=Patrick Stewart dob=-3557 age=60 person_dob=19500406 person_age=60 xdob=-3557 xage=60
***
***** After SET
5 inone=0 intwo=1 name=Steve Jobs dob=-3557 age=60 person_dob=19520115 person_age=58 xdob=. xage=.
***** After IF for DOB and AGE
5 inone=0 intwo=1 name=Steve Jobs dob=-3557 age=60 person_dob=19520115 person_age=58 xdob=-2908 xage=58
***
***** After SET
6 inone=0 intwo=1 name=Bill Gates dob=-3557 age=60 person_dob=19510803 person_age=59 xdob=. xage=.
***** After IF for DOB and AGE
6 inone=0 intwo=1 name=Bill Gates dob=-3557 age=60 person_dob=19510803 person_age=59 xdob=-3073 xage=59
NOTE: There were 3 observations read from the data set WORK.DATA1.
NOTE: There were 3 observations read from the data set WORK.DATA2.
NOTE: The data set WORK.DIAGNOSE has 6 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
[/pre]

In addition to creating XDOB and XAGE so you could see the difference in initialization for those two variables versus the way you did it (by reusing existing variables that came from DATA1, I also show the use of the IN= data set option so you might check it out as a different way to do your test. Finally, I have a simpler suggestion for how to do the creation of the variables where you need to do the conversion for DATA2 obs (shown in the program below). Also, I showed the use of the INPUT function with the PUT function to get rid of "automatic conversion" messages in the SAS log.

Alternate suggestion:
[pre]
data simpler;
set data1(in=inone) data2(in=intwo);
** only create dob and age for observations from DATA2;
** you avoid the issue of testing for missing dob and age because;
** you know that variables from DATA2 will ONLY fall into the IF condition.;
if intwo then do;
dob=input(put(person_dob,8.),yymmdd8.);
age=input(person_age,8.);
end;
*drop person_age person_dob;
run;

proc print data=simpler;
format dob yymmdd10.;
run;
[/pre]

cynthia
chang_y_chung_hotmail_com
Obsidian | Level 7
The basic problem is that two different operations (pre-processing of data2 variables and appending data2 to data1) are crammed together in a data step. IMHO, both the manual pdv refreshing and the conditional processing relying on the in= ds option variable seem contrived. (besides, the in= ds option for data1 is not used at all)

The pre-processing of the data2 should and can be separated without any performance penalty using a view. Things like this are what views are for.

[pre]
data data1;
infile datalines firstobs=2;
input @1 name $15. @17 dob yymmdd8. @26 age;
datalines;
----+----1----+----2----+-
John Smith 19851225 25
Jack Bauer 19600704 50
Charlie Day 19791021 31
;
run;

data data2;
infile datalines firstobs=2;
input @1 name $15. @17 person_dob @26 person_age $2.;
datalines;
----+----1----+----2----+-
Patrick Stewart 19500406 60
Steve Jobs 19520115 58
Bill Gates 19510803 59
;
run;

data view2/view=view2;
set data2;
dob = input(strip(person_dob), yymmdd8.);
format dob yymmdd8.;
age = input(person_age, best.);
keep name dob age;
run;

data both;
set data1 view2;
run;

/* check */
proc print data=both;
run;
/*
Obs name dob age
1 John Smith 85-12-25 25
2 Jack Bauer 60-07-04 50
3 Charlie Day 79-10-21 31
4 Patrick Stewart 50-04-06 60
5 Steve Jobs 52-01-15 58
6 Bill Gates 51-08-03 59
*/
[/pre]

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1710 views
  • 0 likes
  • 5 in conversation