DATA Step, Macro, Functions and more

Excluding people based on rows in a second dataset

Reply
Frequent Contributor
Posts: 138

Excluding people based on rows in a second dataset

Hi,

 

I have two datasets. Dataset1 is at the person-level and has ID as well as some categorical variables. Dataset2 is at the person-event level and has a matching ID variable and an event date variable that is month/year.

 

Dataset 1

ID  categ_var1...categ_var5

1     abc                ghi

2     def                 jkl

 

Dataset 2

ID     event_date

1       Apr 2012

1       May 2012

1       Jun 2012

1       Aug 2012

 

Some people from Dataset 1 will not be in Dataset 2 at all. What I want is to keep everyone from Dataset 1 that has events in April, May, and June 2012 in Dataset 2 (so ID=1 would stay, ID=2 would not as they were not found in Dataset 2).

 

Any help is much appreciated.

 

 

Frequent Contributor
Posts: 130

Re: Excluding people based on rows in a second dataset

Here's a way to do it using a subquery in a SQL procedure:

 

 

data dataset1;
input ID$ categ_var1$ categ_var2$ categ_var3$ categ_var4$ categ_var5$;
datalines;
1 abc ghi mno st wx
2 def jkl pqr uv yz
;
run;

data dataset2;
input ID$ event_date;
informat ID $3. event_date mmddyy10.;
format ID $3. event_date mmddyy10.;
datalines;
1 04/01/2012
1 05/01/2012
1 06/01/2012
1 08/01/2012
;
run;

proc sql;
create table want as
select A.*
from dataset1 as A
Where A.ID in (select distinct ID 
	       from dataset2 as B
	       Where year(B.event_date)=2012 AND month(B.event_date) in (4,5,6));
quit;

 

Super User
Posts: 5,083

Re: Excluding people based on rows in a second dataset

[ Edited ]

The exact syntax will depend on whether EVENT_DATE is text vs. a SAS date.  You could use the SQL approach already suggested (but with modifications for the EVENT_DATE selection if EVENT_DATE happens to be text).  It's also straightforward to sort and merge.  Assuming both data sets are sorted by ID:

 

data want;

merge dataset1 (in=in1) dataset2 (in=in2);

by id;

if in1 and in2;

/* For a text EVENT_DATE subset as follows */

if event_date in ('Apr 2012', 'May 2012', 'Jun 2012');

/* For a SAS date version of EVENT_DATE subset as follows */

if '01apr2012'd <= event_date <= '30jun2012'd;

run;

 

Things to consider:  Do you want to drop EVENT_DATE from the final data set?  Could there be variations in spelling, such as Apr, apr, and APR?

Super Contributor
Posts: 297

Re: Excluding people based on rows in a second dataset

You could also use a hash table if the lookup dataset can fit into memory.

 

DATA WANT;
	IF _N_ = 1 THEN DO;
		DECLARE HASH H(DATASET:"WORK.DS2 (WHERE = (EVENT_DATE IN ('APRIL 2012', 'MAY 2012', 'JUNE 2012')))");
		H.DEFINEKEY("ID");
		H.DEFINEDONE();
	END;

	SET WORK.DS1;

	RC = H.CHECK();

	IF RC = 0 THEN OUTPUT;
RUN;

 

In the above example we are subsetting the lookup data by the event_dates.  This is only applicable to SAS 9.2 and above.

Ask a Question
Discussion stats
  • 3 replies
  • 212 views
  • 0 likes
  • 4 in conversation