BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MV04
Obsidian | Level 7

Hi All ,

I have got two datasets as below 

dataset 1:

id

amountdate
aa
50
06/06/2019
aa
10
07/06/2019
aa
20
08/06/2019
aa
40
09/06/2019
bb
10
03/06/2019
bb
20
04/06/2019
bb
30
08/06/2019

 

Dataset 2

idamountdatecurr
aa
50
07/06/2019
gbp
aa
10
08/06/2019
gbp
aa
20
10/06/2019
gbp
aa
40
09/06/2019
gbp
aa
60
12/06/2019
gbp
bb
10
02/06/2019
gbp
bb
20
04/06/2019
gbp
bb
30
08/06/2019
gbp

 

 

I need all records from Datset1 basically a left join but when I am trying to join on Id and amount I am getting loads of dups 

amount matches to the date but both datasets have got different dates.Therefore when I am trying to match  them left join is making loads of dups.

I need an urgent help for this query .

Any any guidance will be much appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
MV04
Obsidian | Level 7
Patrick I have sent a better data basicaly duplicate of amount on diff dates

View solution in original post

16 REPLIES 16
Patrick
Opal | Level 21

If using your sample data there wouldn't be duplicates when joining over ID and Amount so first thing you should be doing is post representative sample data which showcase the issue. 

 

"amount matches to the date but both datasets have got different dates"

Is there any logic to how the dates for "matching" rows differ? Can you describe in word the rules how you would need to join so that you don't hit many to many conditions.

MV04
Obsidian | Level 7

Thanks for the response Patrik,

 

Both datasets are from different sources one may be running on weekdays and getting dates as the process day and if it is a weekend or a bank holiday then dates are skipping those  whereas other may be running on from tues- saturday something like that hence they are different so id and amount is same but for some records when ID and amount is same date is same as well whereas for others id and amont is same but date is different 

MV04
Obsidian | Level 7
also I was joining on id ,amount as a left join but getting loads of dups
KachiM
Rhodochrosite | Level 12

I am not sure the following answers you. If not, show the expected output.

 


data want;
   if _n_ = 1 then do;
      if 0 then set d2;
      declare hash h(dataset:'d2', ordered:'Y');
      h.definekey('id','amount');
      h.definedata('id','amount','date','curr');
      h.definedone();
   end;
   set d1;
   if h.find() = 0 then output;
run;
MV04
Obsidian | Level 7

Hi Parick,

 

Below are the dataset 

 

data d1;
infile cards dsd truncover ;
input id $ amount date:ddmmyy10.;
cards;
aa,50,06/06/2019
aa,50,07/06/2019
aa,20,08/06/2019
aa,40,09/06/2019
bb,10,03/06/2019
bb,20,04/06/2019
bb,30,08/06/2019


;run;
data d2;
infile cards dsd dlm= "," missover;
input id $ amount date:ddmmyy10. curr $;
cards;
aa,50,07/06/2019,gbp
aa,50,08/06/2019,gbp
aa,20,10/06/2019,gbp
aa,40,09/06/2019,gbp
aa,60,12/06/2019,gbp
bb,10,02/06/2019,gbp
bb,20,04/06/2019,gbp
bb,30,08/06/2019,gbp
;run;

 

and the output soule be 

Patrick
Opal | Level 21

@MV04 

Looks like the picture for the expected result didn't make it. Can you please post it again. 

Also: I suggest you mock-up a different value for curr for every single row (like: gpb1, gbp2,...) so it becomes obvious in your expected result which row from d2 you'd be joining to d1.

KachiM
Rhodochrosite | Level 12

In the hash solution replace:

 

h.definekey('id','amount');

BY

 

h.definekey('id','date');

See whether you get the required output.

Patrick
Opal | Level 21

@KachiM 

I believe you've missed that the dates can differ. If going for a hash solution you probably would need to load the hash with keys {id,amount} and multidata:'y' and then use the do_over() method to loop over a tuple and find the item where the dates between hash and base table differ the least.

KachiM
Rhodochrosite | Level 12

@Patrick 

 

Thanks for pointing to my misunderstanding of the OPs specification. Hope your solution works for the OP.

Patrick
Opal | Level 21

Below code works for your sample data. You will have to verify if that's also an appropriate solution for your real data.

 

data d1;
  infile cards dsd truncover;
  input id $ amount date:ddmmyy10.;
  format date ddmmyy10.;
  cards;
aa,50,06/06/2019
aa,50,07/06/2019
aa,20,08/06/2019
aa,40,09/06/2019
bb,10,03/06/2019
bb,20,04/06/2019
bb,30,08/06/2019
;
run;

data d2;
  infile cards dsd dlm= "," missover;
  input id $ amount date:ddmmyy10. curr $;
  cards;
aa,50,07/06/2019,gbp1
aa,50,08/06/2019,gbp2
aa,20,10/06/2019,gbp3
aa,40,09/06/2019,gbp4
aa,60,12/06/2019,gbp5
bb,10,02/06/2019,gbp6
bb,20,04/06/2019,gbp7
bb,30,08/06/2019,gbp8
;
run;

proc sql feedback;
/*  create table want as*/
    select h1.*, h2.curr
    from d1 h1 left join d2 h2
    on h1.id=h2.id and h1.amount=h2.amount and h1.date-h2.date between -1 and 1
    group by h1.id, h1.amount, h1.date
    having abs(h1.date-h2.date)=min(abs(h1.date-h2.date))
    order by h1.id, h1.date, h1.amount
  ;
quit;

Capture.JPG

 

If you also want a match for row 3 then increase the range of dates included in the join by modifying below condition.

h1.date-h2.date between -1 and 1

I'd keep the date range as narrow as possible to avoid false positive matches.

 

MV04
Obsidian | Level 7
how do increase the range
Patrick
Opal | Level 21

@MV04 wrote:
how do increase the range

Below bit in the ON clause determines which dates get selected for joining. So here the dates from the two tables may only differ by 1 day.

h1.date-h2.date between -1 and 1

 If you want to allow for 2 days then the condition would need to be:

h1.date-h2.date between -2 and 2

The HAVING clause then filters the result set after the join and selects the row where the difference between the two dates is minimal. 

having abs(h1.date-h2.date)=min(abs(h1.date-h2.date))

Now.... That works with your sample data BUT if there is a case where in table D2 there isn't a date which matches exactly to D1 but there is a match both a day earlier and a day later then you still end up with duplicates. Question is: Which record in D2 should get selected logically? The earlier or the later one?

Patrick
Opal | Level 21

@MV04 wrote:
also I was joining on id ,amount as a left join but getting loads of dups

Not with the sample data you've posted - that's why I'm asking for better sample data.

data have1;
  infile datalines truncover dsd;
  input id $ amount date :ddmmyy10.;
  datalines;
aa,50,6/06/2019,
aa,10,7/06/2019,
aa,20,8/06/2019,
aa,40,9/06/2019,
bb,10,3/06/2019,
bb,20,4/06/2019,
bb,30,8/06/2019,
;

data have2;
  infile datalines truncover dsd;
  input id $ amount date :ddmmyy10. curr $;
  datalines;
aa,50,7/06/2019,gbp
aa,10,8/06/2019,gbp
aa,20,10/06/2019,gbp
aa,40,9/06/2019,gbp
aa,60,12/06/2019,gbp
bb,10,2/06/2019,gbp
bb,20,4/06/2019,gbp
bb,30,8/06/2019,gbp
;

proc sql;
/*  create table want as*/
    select h1.*, h2.curr
    from have1 h1 left join have2 h2
    on h1.id=h2.id and h1.amount=h2.amount
  ;
quit;

Capture.JPG

MV04
Obsidian | Level 7
Patrick I have sent a better data basicaly duplicate of amount on diff dates

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1612 views
  • 0 likes
  • 3 in conversation