Hi,
I have dataset in following format
data have;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 101; source_id = 2;dest_id = 3;output;
store_id = 102; source_id = 1;dest_id = 4;output;
store_id = 102; source_id = 2;dest_id = 4;output;
run;
and my requirement is to take out all duplicates where source or dest id is already taken. so final dataset should be like this
data want;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 102; source_id = 2;dest_id = 4;output;
run;
Logic:-
1) First record doesn't have any previous record to compare so will be there in final dataset.
2) second record should be removed as dest_id = 3 is taken by first observation.
3) third record should also be removed as source id = 1 is already taken by first observation.
4) should keep fourth observation as source and dest id are not taken previously.
I have millions of observation and different combination (source and dest id is not 1 and 2 but can have as many as digit ).
Thanks in advance for help.
Thanks,
Hash comes in handy:
data want;
if _n_=1 then do;
dcl hash h();
h.definekey('id');
h.definedata('id');
h.definedone();
call missing (id);
end;
set have;
if h.check(key:source_id) and h.check(key:dest_id) then output;
rc=h.ref(key:source_id,data:source_id);
rc=h.ref(key:dest_id,data:dest_id);
drop id rc;
run;
More Examples:-
data have;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 101; source_id = 2;dest_id = 3;output;
store_id = 102; source_id = 1;dest_id = 4;output;
store_id = 102; source_id = 2;dest_id = 4;output;
store_id = 103; source_id = 5;dest_id = 6;output;
store_id = 103; source_id = 5;dest_id = 7;output;
store_id = 103; source_id = 8;dest_id = 10;output;
store_id = 103; source_id = 9;dest_id = 10;output;
run;
data want;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 102; source_id = 2;dest_id = 4;output;
store_id = 103; source_id = 5;dest_id = 6;output;
store_id = 103; source_id = 8;dest_id = 10;output;
run;
Hash comes in handy:
data want;
if _n_=1 then do;
dcl hash h();
h.definekey('id');
h.definedata('id');
h.definedone();
call missing (id);
end;
set have;
if h.check(key:source_id) and h.check(key:dest_id) then output;
rc=h.ref(key:source_id,data:source_id);
rc=h.ref(key:dest_id,data:dest_id);
drop id rc;
run;
@Haikuo I think the forum garbled your code a bit?
Thanks Haikuo. I just modified little bit and it's working. Thanks for solution. will implement to real data and let you know reult.
data want;
if _n_=1 then do;
dcl hash h();
h.definekey('id');
h.definedata('id');
h.definedone();
call missing (id);
end;
set have;
if h.check(key:source_id) and h.check(key:dest_id) then do ;output;
rc=h.ref(key:source_id,data:source_id);
rc=h.ref(key:dest_id,data:dest_id);
drop id rc;
end;
run;
Don't think your version is working. Noticing the 'And' in the following code, so if you have a 'Or', you will be skipping those 'id's that need to be uploaded into Hash.
if h.check(key:source_id) and h.check(key:dest_id)
Edit:
Maybe I have mistaken your intention, so do please do ignore my comments above if it is the case.
Yes, @Reeza. It has been doing this to all of the hash code from day one. I have no idea when and if it can be fixed :(.
Yes, I need to skip id's to load into hash if match is not found so added do loop condition to your code so hash object only updated when record is written to output. It's what I wanted.
Even I implemented it to real data and it's working fine. Didn't get any unexpected error or result.
Thanks Haikuo for your support.
Thanks,
This one seems to be working too!
data have_not (drop=one_source_id one_dest_id);
retain one_source_id one_dest_id;
set have;
if _N_=1 then
do;
one_source_id=source_id;
one_dest_id=dest_id;
end;
else
do;
if source_id = one_source_id then delete;
if dest_id=one_dest_id then delete;
if source_id ne lag(source_id) then one_source_id=source_id;
if dest_id ne lag(dest_id) then one_dest_id=dest_id;
end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.