Solved: Re: Subset rows based on another datasets and selected variables

NewUsrStat · Posted 06-28-2024 06:42 AM

Hi guys,

suppose to have the following two datasets:

data DB;
  input ID :$20. Admission :date09. Discharge :date09. Code;
  format Admission date9. Discharge date9.;
cards;
0001  17MAR2016 23MAR2016  486
0001  26MAR2016 04APR2016  4660
0001  03AUG2017 11AUG2017  49121
0001  03AUG2017 11AUG2017  485
0001  08FEB2018 08FEB2018  49122 
0001  08FEB2018 10FEB2018  49122
; 
run;


data DB1;
  input ID :$20. Admission :date09. Discharge :date09. Code;
  format Admission date9. Discharge date9.;
cards;
0001  17MAR2016 23MAR2016  486
0001  26MAR2016 04APR2016  4660
0001  03AUG2017 11AUG2017  49121
0001  03AUG2017 11AUG2017  485
0001  08FEB2018 10FEB2018  49122
; 
run;

I'm trying to make DB identical to DB1. For example the 5th row in DB should be removed because not present in DB1. I'm trying the following but without success:


proc sql;
   create table DBx as
   select * from DB
   where ID  in (select ID  from DB1) and
   		 Admission in (select Admission  from DB1) and 
		 Discharge in (select Discharge  from DB1) and 
         Code in (select Code from DB1) 
;
quit;

Can anyone help me please?

The original datasets have many more rows and columns. Only some rows are shown for simplicity and about other columns/variables they don't make a rule for the subset except: ID, Admission, Discharge and Code that I showed.

Thank you in advance!

PaigeMiller · Posted 06-28-2024 07:14 AM

proc sort data=db; 
    by id admission discharge code;
run;
proc sort data=db1; 
    by id admission discharge code; 
run;
data dbx;
    merge db(in=in1) db1(in=in2);
    by id admission discharge code;
    if in1 and in2;
run;

--
Paige Miller

View solution in original post

PaigeMiller · Posted 06-28-2024 06:50 AM

I don't think the explanation is complete. You say "I'm trying to make DB identical to DB1" but you have to tell us the logic to use to remove rows (and make any other changes that might be necessary) to make DB identical to DB1. We can't generalize from the very brief example you have shown us.

--
Paige Miller

NewUsrStat · Posted 06-28-2024 07:04 AM

I edited a little bit. So, basically, what not present in DB1 must be removed from DB. The 5th row in DB is not present in DB1 so must be removed.

PaigeMiller · Posted 06-28-2024 07:14 AM

proc sort data=db; 
    by id admission discharge code;
run;
proc sort data=db1; 
    by id admission discharge code; 
run;
data dbx;
    merge db(in=in1) db1(in=in2);
    by id admission discharge code;
    if in1 and in2;
run;

--
Paige Miller

PaigeMiller · Posted 06-28-2024 07:20 AM

I have to say that I am still not 100% sure I grasp what you are asking.

Here's what I think you want.

If a row is in DB and in DB1, keep it.

If a row is in DB and not in DB1, remove it.

If a row is not in DB and is in DB1 ... I don't know what you want.

--
Paige Miller

NewUsrStat · Posted 06-28-2024 07:24 AM

So there are no rows "not in DB but in DB1". DB1 is a subset of DB. Thank you very much for your help!!

Tom · Posted 06-28-2024 08:09 AM

You appear to be missing a dataset in your description. If you only have the two datasets why not just take the second one directly and ignore the first?

Is the issue that there is some third dataset that want to use to convert the first into the second?

If you want to compare the whole observation then you might just want to use PROC SQL's INTERSECT operation.

proc sql;
create table want as
select * from table1
intersect
select * from table2
;
quit;

If the issue is just to compare on only the key variables and not worry about other variables then perhaps you want to do data step merge. So if you want to keep all of the observations from DB1 whose key variable values do not appear in DB2 then the merge step might look like this:

data want;
  merge db1(in=in1) db2(in=in2 keep=id admission discharge code);
  by id admission discharge code;
  if in1 and in2;
run;

Catch up on SAS Innovate 2026