DATA Step, Macro, Functions and more

Difficult to Match and Remove observation based on combination of two variables

Accepted Solution Solved
Reply
Contributor
Posts: 30
Accepted Solution

Difficult to Match and Remove observation based on combination of two variables

[ Edited ]

Hi,

 

I have dataset in following format 

 

 

data have;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 101; source_id = 2;dest_id = 3;output;
store_id = 102; source_id = 1;dest_id = 4;output;
store_id = 102; source_id = 2;dest_id = 4;output;
run;

 

 

and my requirement is to take out all duplicates where source or dest id is already taken. so final dataset should be like this 

 

 

data want;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 102; source_id = 2;dest_id = 4;output;
run;

 

 

Logic:- 

 

1) First record doesn't have any previous record to compare so will be there in final dataset.

2) second record should be removed as dest_id = 3 is taken by first observation.

3) third record should also be removed as source id = 1 is already taken by first observation.

4) should keep fourth observation as source and dest id are not taken previously.

 

I have millions of observation and different combination (source and dest id is not 1 and 2 but can have as many as digit ).

 

Thanks in advance for help.

 

Thanks,

 

 


Accepted Solutions
Solution
‎04-21-2016 08:24 PM
Respected Advisor
Posts: 3,156

Re: Match on two variable

[ Edited ]

Hash comes in handy:

data want;
if _n_=1 then do;

dcl hash h();
h.definekey('id');
h.definedata('id');
h.definedone();
call missing (id);
end;

set have;
if h.check(key:source_id) and h.check(key:dest_id) then output;
rc=h.ref(key:source_id,data:source_id);
rc=h.ref(key:dest_id,data:dest_id);
drop id rc;
run;
 

View solution in original post


All Replies
Contributor
Posts: 30

Re: Match on two variable

More Examples:-

 

data have;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 101; source_id = 2;dest_id = 3;output;
store_id = 102; source_id = 1;dest_id = 4;output;
store_id = 102; source_id = 2;dest_id = 4;output;
store_id = 103; source_id = 5;dest_id = 6;output;
store_id = 103; source_id = 5;dest_id = 7;output;
store_id = 103; source_id = 8;dest_id = 10;output;
store_id = 103; source_id = 9;dest_id = 10;output;

run;

data want;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 102; source_id = 2;dest_id = 4;output;
store_id = 103; source_id = 5;dest_id = 6;output;
store_id = 103; source_id = 8;dest_id = 10;output;
run;
Solution
‎04-21-2016 08:24 PM
Respected Advisor
Posts: 3,156

Re: Match on two variable

[ Edited ]

Hash comes in handy:

data want;
if _n_=1 then do;

dcl hash h();
h.definekey('id');
h.definedata('id');
h.definedone();
call missing (id);
end;

set have;
if h.check(key:source_id) and h.check(key:dest_id) then output;
rc=h.ref(key:source_id,data:source_id);
rc=h.ref(key:dest_id,data:dest_id);
drop id rc;
run;
 
Super User
Posts: 19,782

Re: Match on two variable

@Haikuo I think the forum garbled your code a bit?  

Contributor
Posts: 30

Re: Match on two variable

Thanks Haikuo. I just modified little bit and it's working. Thanks for solution. will implement to real data and let you know reult. 

data want;
if _n_=1 then do;

dcl hash h();
h.definekey('id');
h.definedata('id');
h.definedone();
call missing (id);
end;

set have;
if h.check(key:source_id) and h.check(key:dest_id) then do ;output;
rc=h.ref(key:source_id,data:source_id);
rc=h.ref(key:dest_id,data:dest_id);
drop id rc;
end;
run;
Respected Advisor
Posts: 3,156

Re: Match on two variable

[ Edited ]

Don't think your version is working. Noticing the 'And' in the following code, so if you have a 'Or', you will be skipping those 'id's that need to be uploaded into Hash.

if h.check(key:source_id) and h.check(key:dest_id) 

 

Edit:

Maybe I have mistaken your intention, so do please do ignore my comments above if it is the case. 

Respected Advisor
Posts: 3,156

Re: Match on two variable

Yes, @Reeza. It has been doing this to all of the hash code from day one. I have no idea when and if it can be fixed Smiley Sad

Contributor
Posts: 30

Re: Match on two variable

Yes, I need to skip  id's to load into hash if match is not found so added do loop condition to your code so hash object only updated when record is written to output. It's what I wanted. 

 

Even I implemented it to real data and it's working fine. Didn't get any unexpected error or result.

 

Thanks Haikuo for your support. 

 

Thanks,

Senior User
Posts: 1

Re: Difficult to Match and Remove observation based on combination of two variables

This one seems to be working too!

 

data have_not (drop=one_source_id one_dest_id);

retain one_source_id one_dest_id;

set have;

 

if _N_=1 then

do;

one_source_id=source_id;

one_dest_id=dest_id;

end;

else

do;

if source_id = one_source_id then delete;

if dest_id=one_dest_id then delete;

if source_id ne lag(source_id) then one_source_id=source_id;

if dest_id ne lag(dest_id) then one_dest_id=dest_id;

end;

run;

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 422 views
  • 0 likes
  • 4 in conversation