Solved: Re: joining on same variable date n key but different values

MV04 · Posted 07-26-2019 04:48 AM

Hi All ,

I have got two datasets as below

dataset 1:

id	amount	date
aa	50	06/06/2019
aa	10	07/06/2019
aa	20	08/06/2019
aa	40	09/06/2019
bb	10	03/06/2019
bb	20	04/06/2019
bb	30	08/06/2019

Dataset 2

id	amount	date	curr
aa	50	07/06/2019	gbp
aa	10	08/06/2019	gbp
aa	20	10/06/2019	gbp
aa	40	09/06/2019	gbp
aa	60	12/06/2019	gbp
bb	10	02/06/2019	gbp
bb	20	04/06/2019	gbp
bb	30	08/06/2019	gbp

I need all records from Datset1 basically a left join but when I am trying to join on Id and amount I am getting loads of dups

amount matches to the date but both datasets have got different dates.Therefore when I am trying to match them left join is making loads of dups.

I need an urgent help for this query .

Any any guidance will be much appreciated.

MV04 · Posted 07-26-2019 06:25 AM

Patrick I have sent a better data basicaly duplicate of amount on diff dates

View solution in original post

Patrick · Posted 07-26-2019 05:06 AM

If using your sample data there wouldn't be duplicates when joining over ID and Amount so first thing you should be doing is post representative sample data which showcase the issue.

"amount matches to the date but both datasets have got different dates"

Is there any logic to how the dates for "matching" rows differ? Can you describe in word the rules how you would need to join so that you don't hit many to many conditions.

MV04 · Posted 07-26-2019 05:15 AM

Thanks for the response Patrik,

Both datasets are from different sources one may be running on weekdays and getting dates as the process day and if it is a weekend or a bank holiday then dates are skipping those whereas other may be running on from tues- saturday something like that hence they are different so id and amount is same but for some records when ID and amount is same date is same as well whereas for others id and amont is same but date is different

MV04 · Posted 07-26-2019 05:16 AM

also I was joining on id ,amount as a left join but getting loads of dups

KachiM · Posted 07-26-2019 05:40 AM

I am not sure the following answers you. If not, show the expected output.


data want;
   if _n_ = 1 then do;
      if 0 then set d2;
      declare hash h(dataset:'d2', ordered:'Y');
      h.definekey('id','amount');
      h.definedata('id','amount','date','curr');
      h.definedone();
   end;
   set d1;
   if h.find() = 0 then output;
run;

MV04 · Posted 07-26-2019 06:07 AM

Hi Parick,

Below are the dataset

data d1;
infile cards dsd truncover ;
input id $ amount date:ddmmyy10.;
cards;
aa,50,06/06/2019
aa,50,07/06/2019
aa,20,08/06/2019
aa,40,09/06/2019
bb,10,03/06/2019
bb,20,04/06/2019
bb,30,08/06/2019

;run;
data d2;
infile cards dsd dlm= "," missover;
input id $ amount date:ddmmyy10. curr $;
cards;
aa,50,07/06/2019,gbp
aa,50,08/06/2019,gbp
aa,20,10/06/2019,gbp
aa,40,09/06/2019,gbp
aa,60,12/06/2019,gbp
bb,10,02/06/2019,gbp
bb,20,04/06/2019,gbp
bb,30,08/06/2019,gbp
;run;

and the output soule be

Patrick · Posted 07-26-2019 06:28 AM

@MV04

Looks like the picture for the expected result didn't make it. Can you please post it again.

Also: I suggest you mock-up a different value for curr for every single row (like: gpb1, gbp2,...) so it becomes obvious in your expected result which row from d2 you'd be joining to d1.

KachiM · Posted 07-26-2019 06:35 AM

In the hash solution replace:

h.definekey('id','amount');

BY

h.definekey('id','date');

See whether you get the required output.

Patrick · Posted 07-26-2019 07:31 AM

@KachiM

I believe you've missed that the dates can differ. If going for a hash solution you probably would need to load the hash with keys {id,amount} and multidata:'y' and then use the do_over() method to loop over a tuple and find the item where the dates between hash and base table differ the least.

KachiM · Posted 07-26-2019 09:33 AM

@Patrick

Thanks for pointing to my misunderstanding of the OPs specification. Hope your solution works for the OP.

Patrick · Posted 07-26-2019 07:24 AM

Below code works for your sample data. You will have to verify if that's also an appropriate solution for your real data.

data d1;
  infile cards dsd truncover;
  input id $ amount date:ddmmyy10.;
  format date ddmmyy10.;
  cards;
aa,50,06/06/2019
aa,50,07/06/2019
aa,20,08/06/2019
aa,40,09/06/2019
bb,10,03/06/2019
bb,20,04/06/2019
bb,30,08/06/2019
;
run;

data d2;
  infile cards dsd dlm= "," missover;
  input id $ amount date:ddmmyy10. curr $;
  cards;
aa,50,07/06/2019,gbp1
aa,50,08/06/2019,gbp2
aa,20,10/06/2019,gbp3
aa,40,09/06/2019,gbp4
aa,60,12/06/2019,gbp5
bb,10,02/06/2019,gbp6
bb,20,04/06/2019,gbp7
bb,30,08/06/2019,gbp8
;
run;

proc sql feedback;
/*  create table want as*/
    select h1.*, h2.curr
    from d1 h1 left join d2 h2
    on h1.id=h2.id and h1.amount=h2.amount and h1.date-h2.date between -1 and 1
    group by h1.id, h1.amount, h1.date
    having abs(h1.date-h2.date)=min(abs(h1.date-h2.date))
    order by h1.id, h1.date, h1.amount
  ;
quit;

If you also want a match for row 3 then increase the range of dates included in the join by modifying below condition.

h1.date-h2.date between -1 and 1

I'd keep the date range as narrow as possible to avoid false positive matches.

MV04 · Posted 07-26-2019 09:41 AM

how do increase the range

Patrick · Posted 07-26-2019 12:20 PM

@MV04 wrote:
how do increase the range

Below bit in the ON clause determines which dates get selected for joining. So here the dates from the two tables may only differ by 1 day.

h1.date-h2.date between -1 and 1

If you want to allow for 2 days then the condition would need to be:

h1.date-h2.date between -2 and 2

The HAVING clause then filters the result set after the join and selects the row where the difference between the two dates is minimal.

having abs(h1.date-h2.date)=min(abs(h1.date-h2.date))

Now.... That works with your sample data BUT if there is a case where in table D2 there isn't a date which matches exactly to D1 but there is a match both a day earlier and a day later then you still end up with duplicates. Question is: Which record in D2 should get selected logically? The earlier or the later one?

Patrick · Posted 07-26-2019 05:40 AM

@MV04 wrote:
also I was joining on id ,amount as a left join but getting loads of dups

Not with the sample data you've posted - that's why I'm asking for better sample data.

data have1;
  infile datalines truncover dsd;
  input id $ amount date :ddmmyy10.;
  datalines;
aa,50,6/06/2019,
aa,10,7/06/2019,
aa,20,8/06/2019,
aa,40,9/06/2019,
bb,10,3/06/2019,
bb,20,4/06/2019,
bb,30,8/06/2019,
;

data have2;
  infile datalines truncover dsd;
  input id $ amount date :ddmmyy10. curr $;
  datalines;
aa,50,7/06/2019,gbp
aa,10,8/06/2019,gbp
aa,20,10/06/2019,gbp
aa,40,9/06/2019,gbp
aa,60,12/06/2019,gbp
bb,10,2/06/2019,gbp
bb,20,4/06/2019,gbp
bb,30,8/06/2019,gbp
;

proc sql;
/*  create table want as*/
    select h1.*, h2.curr
    from have1 h1 left join have2 h2
    on h1.id=h2.id and h1.amount=h2.amount
  ;
quit;

MV04 · Posted 07-26-2019 06:25 AM

Patrick I have sent a better data basicaly duplicate of amount on diff dates

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!