Hi All ,
I have got two datasets as below
dataset 1:
id | amount | date |
aa | 50 | 06/06/2019 |
aa | 10 | 07/06/2019 |
aa | 20 | 08/06/2019 |
aa | 40 | 09/06/2019 |
bb | 10 | 03/06/2019 |
bb | 20 | 04/06/2019 |
bb | 30 | 08/06/2019 |
Dataset 2
id | amount | date | curr |
aa | 50 | 07/06/2019 | gbp |
aa | 10 | 08/06/2019 | gbp |
aa | 20 | 10/06/2019 | gbp |
aa | 40 | 09/06/2019 | gbp |
aa | 60 | 12/06/2019 | gbp |
bb | 10 | 02/06/2019 | gbp |
bb | 20 | 04/06/2019 | gbp |
bb | 30 | 08/06/2019 | gbp |
I need all records from Datset1 basically a left join but when I am trying to join on Id and amount I am getting loads of dups
amount matches to the date but both datasets have got different dates.Therefore when I am trying to match them left join is making loads of dups.
I need an urgent help for this query .
Any any guidance will be much appreciated.
If using your sample data there wouldn't be duplicates when joining over ID and Amount so first thing you should be doing is post representative sample data which showcase the issue.
"amount matches to the date but both datasets have got different dates"
Is there any logic to how the dates for "matching" rows differ? Can you describe in word the rules how you would need to join so that you don't hit many to many conditions.
Thanks for the response Patrik,
Both datasets are from different sources one may be running on weekdays and getting dates as the process day and if it is a weekend or a bank holiday then dates are skipping those whereas other may be running on from tues- saturday something like that hence they are different so id and amount is same but for some records when ID and amount is same date is same as well whereas for others id and amont is same but date is different
I am not sure the following answers you. If not, show the expected output.
data want;
if _n_ = 1 then do;
if 0 then set d2;
declare hash h(dataset:'d2', ordered:'Y');
h.definekey('id','amount');
h.definedata('id','amount','date','curr');
h.definedone();
end;
set d1;
if h.find() = 0 then output;
run;
Hi Parick,
Below are the dataset
data d1;
infile cards dsd truncover ;
input id $ amount date:ddmmyy10.;
cards;
aa,50,06/06/2019
aa,50,07/06/2019
aa,20,08/06/2019
aa,40,09/06/2019
bb,10,03/06/2019
bb,20,04/06/2019
bb,30,08/06/2019
;run;
data d2;
infile cards dsd dlm= "," missover;
input id $ amount date:ddmmyy10. curr $;
cards;
aa,50,07/06/2019,gbp
aa,50,08/06/2019,gbp
aa,20,10/06/2019,gbp
aa,40,09/06/2019,gbp
aa,60,12/06/2019,gbp
bb,10,02/06/2019,gbp
bb,20,04/06/2019,gbp
bb,30,08/06/2019,gbp
;run;
and the output soule be
Looks like the picture for the expected result didn't make it. Can you please post it again.
Also: I suggest you mock-up a different value for curr for every single row (like: gpb1, gbp2,...) so it becomes obvious in your expected result which row from d2 you'd be joining to d1.
In the hash solution replace:
h.definekey('id','amount');
BY
h.definekey('id','date');
See whether you get the required output.
I believe you've missed that the dates can differ. If going for a hash solution you probably would need to load the hash with keys {id,amount} and multidata:'y' and then use the do_over() method to loop over a tuple and find the item where the dates between hash and base table differ the least.
Thanks for pointing to my misunderstanding of the OPs specification. Hope your solution works for the OP.
Below code works for your sample data. You will have to verify if that's also an appropriate solution for your real data.
data d1;
infile cards dsd truncover;
input id $ amount date:ddmmyy10.;
format date ddmmyy10.;
cards;
aa,50,06/06/2019
aa,50,07/06/2019
aa,20,08/06/2019
aa,40,09/06/2019
bb,10,03/06/2019
bb,20,04/06/2019
bb,30,08/06/2019
;
run;
data d2;
infile cards dsd dlm= "," missover;
input id $ amount date:ddmmyy10. curr $;
cards;
aa,50,07/06/2019,gbp1
aa,50,08/06/2019,gbp2
aa,20,10/06/2019,gbp3
aa,40,09/06/2019,gbp4
aa,60,12/06/2019,gbp5
bb,10,02/06/2019,gbp6
bb,20,04/06/2019,gbp7
bb,30,08/06/2019,gbp8
;
run;
proc sql feedback;
/* create table want as*/
select h1.*, h2.curr
from d1 h1 left join d2 h2
on h1.id=h2.id and h1.amount=h2.amount and h1.date-h2.date between -1 and 1
group by h1.id, h1.amount, h1.date
having abs(h1.date-h2.date)=min(abs(h1.date-h2.date))
order by h1.id, h1.date, h1.amount
;
quit;
If you also want a match for row 3 then increase the range of dates included in the join by modifying below condition.
h1.date-h2.date between -1 and 1
I'd keep the date range as narrow as possible to avoid false positive matches.
@MV04 wrote:
how do increase the range
Below bit in the ON clause determines which dates get selected for joining. So here the dates from the two tables may only differ by 1 day.
h1.date-h2.date between -1 and 1
If you want to allow for 2 days then the condition would need to be:
h1.date-h2.date between -2 and 2
The HAVING clause then filters the result set after the join and selects the row where the difference between the two dates is minimal.
having abs(h1.date-h2.date)=min(abs(h1.date-h2.date))
Now.... That works with your sample data BUT if there is a case where in table D2 there isn't a date which matches exactly to D1 but there is a match both a day earlier and a day later then you still end up with duplicates. Question is: Which record in D2 should get selected logically? The earlier or the later one?
@MV04 wrote:
also I was joining on id ,amount as a left join but getting loads of dups
Not with the sample data you've posted - that's why I'm asking for better sample data.
data have1;
infile datalines truncover dsd;
input id $ amount date :ddmmyy10.;
datalines;
aa,50,6/06/2019,
aa,10,7/06/2019,
aa,20,8/06/2019,
aa,40,9/06/2019,
bb,10,3/06/2019,
bb,20,4/06/2019,
bb,30,8/06/2019,
;
data have2;
infile datalines truncover dsd;
input id $ amount date :ddmmyy10. curr $;
datalines;
aa,50,7/06/2019,gbp
aa,10,8/06/2019,gbp
aa,20,10/06/2019,gbp
aa,40,9/06/2019,gbp
aa,60,12/06/2019,gbp
bb,10,2/06/2019,gbp
bb,20,4/06/2019,gbp
bb,30,8/06/2019,gbp
;
proc sql;
/* create table want as*/
select h1.*, h2.curr
from have1 h1 left join have2 h2
on h1.id=h2.id and h1.amount=h2.amount
;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.