Solved: Re: Filter out dataset using month and year

nickspencer · Posted 01-18-2020 06:12 PM

Hi all,

I have two datasets with transaction data. I want to select the transactions present in first dataset but not in the second one by month and year.

Dataset1:

acct_id date
1234 12dec2019
2345 12dec2019
3456 12dec2019
4467 12dec2019

dataset2:

Acct_id date
1234 01dec2019
2345 01dec2019
3456 21nov2019
4467 21nov2019

In the above datasets I want to remove acct ids 1234 and 2345 from dataset1 (and create a new dataset) since they are already present in dataset2 to for the same month and year. But want to keep 3456 and 4467 from dataset1 since they were for the month of November in dataset 2. There are number of other variables in both dataset but I want to compare the accounts and month year only and create a new dataset from dataset1 based on dataset 2.

What is the best way to achieve that ? Any suggestion is highly appreciated .

Thanks!!

novinosrin · Posted 01-18-2020 06:20 PM

Hi @nickspencer It's fun in Proc SQL

data one;
input acct_id date :date9.;
format date date9.;
cards;
1234 12dec2019
2345 12dec2019
3456 12dec2019
4467 12dec2019
;


data two;
input acct_id date :date9.;
format date date9.;
cards;
1234 01dec2019
2345 01dec2019
3456 21nov2019
4467 21nov2019
;
proc sql;
create table want as
select a.*
from one a left join two b
on a.acct_id=b.acct_id and put(a.date,monyy7. -l)=put(b.date,monyy7. -l)
where put(a.date,monyy7. -l) ne put(b.date,monyy7. -l);
quit;

Actually better with INNER JOIN. Oops So sorry

proc sql;
create table want as
select a.*
from one a inner join two b
on a.acct_id=b.acct_id and put(a.date,monyy7. -l) ne put(b.date,monyy7. -l);
quit;

View solution in original post

novinosrin · Posted 01-18-2020 06:20 PM

Hi @nickspencer It's fun in Proc SQL

data one;
input acct_id date :date9.;
format date date9.;
cards;
1234 12dec2019
2345 12dec2019
3456 12dec2019
4467 12dec2019
;


data two;
input acct_id date :date9.;
format date date9.;
cards;
1234 01dec2019
2345 01dec2019
3456 21nov2019
4467 21nov2019
;
proc sql;
create table want as
select a.*
from one a left join two b
on a.acct_id=b.acct_id and put(a.date,monyy7. -l)=put(b.date,monyy7. -l)
where put(a.date,monyy7. -l) ne put(b.date,monyy7. -l);
quit;

Actually better with INNER JOIN. Oops So sorry

proc sql;
create table want as
select a.*
from one a inner join two b
on a.acct_id=b.acct_id and put(a.date,monyy7. -l) ne put(b.date,monyy7. -l);
quit;

novinosrin · Posted 01-18-2020 06:37 PM


data want ;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("acct_id","d") ;
   h.definedone () ;
   do until(z);
    set two end=z;
	d=put(date,monyy7. -l);
	h.ref();
   end;
 end;
 set one;
 if h.check(key:acct_id,key:put(date,monyy7. -l)) ne 0;
 drop d;
run;

nickspencer · Posted 01-18-2020 08:18 PM

@nonivosrin This is perfect. But I want to include the accounts from dataset 1 which is not present in the dataset2 for the month. Will the inner join stilll work if it is present in dataset1 but not in dataset2 but want to include in the table want ?

nickspencer · Posted 01-18-2020 08:19 PM

@novinosrin This is perfect. But I want to include the accounts from dataset 1 which is not present in the dataset2 for the month. Will the inner join stilll work if it is present in dataset1 but not in dataset2 but want to include in the table want ?

novinosrin · Posted 01-18-2020 08:37 PM

Thank you @nickspencer for clarifying. Please ignore the INNER JOIN and stick to the LEFT JOIN, the 1st one. I'm glad my initial thought was right. Have a good one!

mkeintz · Posted 01-18-2020 08:36 PM

Assuming ONE and TWO are sorted by ID/DATE:

data one;
input acct_id date :date9.;
format date date9.;
cards;
1234 12dec2019
2345 12dec2019
3456 12dec2019
4467 12dec2019
;


data two;
input acct_id date :date9.;
format date date9.;
cards;
1234 01dec2019
2345 01dec2019
3456 21nov2019
4467 21nov2019
;


data want;
  set two (in=in2) one ;
  by acct_id;

  array _cal {2015:2020,12} _temporary_;
  if first.acct_id then call missing(of _cal{*});
  if in2 then _cal{year(date),month(date)}=1;
  else if _cal{year(date),month(date)}^=1 then output;
run;

Just make sure the _CAL matrix has upper and lower bounds to cover the time span in your data set.
The program reads all the cases for a given ID in data set TWO, and sets the matrix accordingly. Then it reads all the cases for the same ID in data set ONE, and examines the matrix to determine whether to output.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away