Hi guys,
suppose to have the following datasets:
data DB;
input ID :$20. Admission :date09. Discharge :date09. Variable;
cards;
166 16FEB2019 26FEB2019 1
170 13JAN2017 25JAN2017 0
170 22FEB2017 07MAR2017 1
170 27APR2017 16MAY2017 1
170 30JAN2019 04MAR2019 0
313 03MAR2016 10MAR2016 1
;
run;
data DB1;
input ID :$20. Admission :date09. Discharge :date09. Variable_;
cards;
166 16FEB2019 26FEB2019 1
166 16FEB2019 26FEB2019 1
166 16FEB2019 26FEB2019 1
166 16FEB2019 26FEB2019 1
166 16FEB2019 26FEB2019 1
166 16FEB2019 26FEB2019 1
170 22FEB2017 07MAR2017 0
170 13JAN2017 25JAN2017 0
170 30JAN2019 04MAR2019 1
170 13JAN2017 25JAN2017 0
170 30JAN2019 04MAR2019 1
170 27APR2017 16MAY2017 1
170 27APR2017 16MAY2017 1
170 27APR2017 16MAY2017 1
170 27APR2017 16MAY2017 1
170 27APR2017 16MAY2017 1
313 03MAR2016 10MAR2016 0
313 03MAR2016 10MAR2016 0
313 03MAR2016 10MAR2016 0
;
run;
Is there a way to compare the Variable_ value in DB1 with Variable in DB at corresponding date and BY ID? As you can see in DB1 some dates are replicated and when it happens the value of Variable_ is also replicated so no chance to have 1 or 0 when dates are replicated but always 1 or always 0. the desired output should be:
data DB2;
input ID :$20. Admission :date09. Discharge :date09. Variable New_Var;
cards;
166 16FEB2019 26FEB2019 1 0
170 13JAN2017 25JAN2017 0 0
170 22FEB2017 07MAR2017 1 0
170 27APR2017 16MAY2017 1 0
170 30JAN2019 04MAR2019 0 1
313 03MAR2016 10MAR2016 1 0
;
run;
In other words a new variable should be created in DB, called "New_Var
" equal to 1 only if Variable_ = 1 in DB1 and Variable = 0 in DB.
Thank you in advance
Hi @NewUsrStat, it isn't immediately clear what your desired outcome should be. From your initial description, of wanting to compare values in different datasets, it seems like you're trying to do a merge by ID, admission and discharge date. So far so good, you can achieve that with a DATA step merge (of course, making sure data is pre-sorted):
data want;
merge db db1;
by id admission discharge;
format admission discharge date9.;
run;
However, you also mention that the output dataset should contain a new variable (which you don't provide an example for). In your last dataset DB2, you have examples that don't quite align with your previous observations in DB and DB1. For example, how did we get this observation in DB2?
166 16FEB2019 26FEB2019 1 0
This observation doesn't meet the requirement specified: only the case in which Variable_
=1 and Variable
=0.
Thank you very much for your feedback. The new variable Variable_ in DB should be added after the comparison variable (from DB) vs Variable_ (from DB2). Briefly I need to update the variable "Variable" in DB only when value is 1 in DB1. I though to add a new variable called "Variable_" in DB and after run an if statement. With the merge replicated rows appear.
I think it would be clearer if we used a new variable name, say New_Var. Is your requirement as follows:
Unfortunately your example output dataset (DB2) does not follow this requirement. Would you mind running through this requirement for IDs 166 and 170 to confirm. I'm getting different answers.
Example: pts 166: nothing changes (Variable_ and Variable both = 1) so in DB New_Var = 0, and so on until pts 170, 30JAN2019: Variable = 0 and Variable_ = 1 so New_Var = 1. There was a mistake in DB1. I'm really sorry.
Hahaha, ok that makes more sense. In that case, you just want to add an IF statement after your merge, specifying the condition:
if variable=0 and variable_=1 then new_var=1;
Which "corresponding date"? Both of your data sets have two date values. So there are potentially 4 pairs of values to be compared/matched on any given pair of observations.
Do yourself a favor and assign Formats to date values.
Now you get to explain why these values appear in the wanted set:
166 16FEB2019 26FEB2019 1 0 170 30JAN2019 04MAR2019 0 1
In the example set DB1 every value of variable+ for id 166 with dates of Admission =16FEB2019 and Discharge=26FEB2019 is 1.
For Id=170 Admission=30JAN2019 and Discharge=04MAR2019 are all 0.
IF your want data is incorrect and should have different values for variable_ then perhaps:
proc sort data=db; by id admission discharge; run; proc sort data=db1; by id admission discharge; run; data db2; update db (in=indb) db1 ; by id admission discharge; if indb; run;
UPDATE only works if the combination of the BY variables only occurs exactly one time in the first data set on the UPDATE statement. Data should be sorted by those variables.
What this does is take the value of Variable_ and add it to the master record in DB each time it appears in DB1.
IF you have multiple different values of Variable_ the last one in DB1 would be the result.
The IN=data set option creates a temporary variable that indicates the current observation comes from DB. So if there are ID and/or dates in DB1 that do not match one in DB they are not added to the desired output (The IF Indb; statement subsets data to only the ones that come from DB).
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.