Solved: Difficult to Match and Remove observation based on combination of two...

nkm123 · Posted 04-21-2016 03:38 PM

Hi,

I have dataset in following format

data have;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 101; source_id = 2;dest_id = 3;output;
store_id = 102; source_id = 1;dest_id = 4;output;
store_id = 102; source_id = 2;dest_id = 4;output;
run;

and my requirement is to take out all duplicates where source or dest id is already taken. so final dataset should be like this

data want;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 102; source_id = 2;dest_id = 4;output;
run;

Logic:-

1) First record doesn't have any previous record to compare so will be there in final dataset.

2) second record should be removed as dest_id = 3 is taken by first observation.

3) third record should also be removed as source id = 1 is already taken by first observation.

4) should keep fourth observation as source and dest id are not taken previously.

I have millions of observation and different combination (source and dest id is not 1 and 2 but can have as many as digit ).

Thanks in advance for help.

Thanks,

Haikuo · Posted 04-21-2016 05:03 PM

Hash comes in handy:

data want;
if _n_=1 then do;

dcl hash h();
h.definekey('id');
h.definedata('id');
h.definedone();
call missing (id);
end;

set have;
if h.check(key:source_id) and h.check(key:dest_id) then output;
rc=h.ref(key:source_id,data&colon;source_id);
rc=h.ref(key:dest_id,data&colon;dest_id);
drop id rc;
run;

View solution in original post

nkm123 · Posted 04-21-2016 03:46 PM

More Examples:-

data have;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 101; source_id = 2;dest_id = 3;output;
store_id = 102; source_id = 1;dest_id = 4;output;
store_id = 102; source_id = 2;dest_id = 4;output;
store_id = 103; source_id = 5;dest_id = 6;output;
store_id = 103; source_id = 5;dest_id = 7;output;
store_id = 103; source_id = 8;dest_id = 10;output;
store_id = 103; source_id = 9;dest_id = 10;output;

run;

data want;
store_id = 101; source_id = 1;dest_id = 3;output;
store_id = 102; source_id = 2;dest_id = 4;output;
store_id = 103; source_id = 5;dest_id = 6;output;
store_id = 103; source_id = 8;dest_id = 10;output;
run;

Haikuo · Posted 04-21-2016 05:03 PM

Hash comes in handy:

data want;
if _n_=1 then do;

dcl hash h();
h.definekey('id');
h.definedata('id');
h.definedone();
call missing (id);
end;

set have;
if h.check(key:source_id) and h.check(key:dest_id) then output;
rc=h.ref(key:source_id,data&colon;source_id);
rc=h.ref(key:dest_id,data&colon;dest_id);
drop id rc;
run;

Reeza · Posted 04-21-2016 05:58 PM

@Haikuo I think the forum garbled your code a bit?

nkm123 · Posted 04-21-2016 07:06 PM

Thanks Haikuo. I just modified little bit and it's working. Thanks for solution. will implement to real data and let you know reult.

data want;
if _n_=1 then do;

dcl hash h();
h.definekey('id');
h.definedata('id');
h.definedone();
call missing (id);
end;

set have;
if h.check(key:source_id) and h.check(key:dest_id) then do ;output;
rc=h.ref(key:source_id,data&colon;source_id);
rc=h.ref(key:dest_id,data&colon;dest_id);
drop id rc;
end;
run;

Haikuo · Posted 04-21-2016 07:19 PM

Don't think your version is working. Noticing the 'And' in the following code, so if you have a 'Or', you will be skipping those 'id's that need to be uploaded into Hash.

if h.check(key:source_id) and h.check(key:dest_id)

Edit:

Maybe I have mistaken your intention, so do please do ignore my comments above if it is the case.

Haikuo · Posted 04-21-2016 07:15 PM

Yes, @Reeza. It has been doing this to all of the hash code from day one. I have no idea when and if it can be fixed :(.

nkm123 · Posted 04-21-2016 08:21 PM

Yes, I need to skip id's to load into hash if match is not found so added do loop condition to your code so hash object only updated when record is written to output. It's what I wanted.

Even I implemented it to real data and it's working fine. Didn't get any unexpected error or result.

Thanks Haikuo for your support.

Thanks,

vkmanchi · Posted 04-22-2016 03:26 PM

This one seems to be working too!

data have_not (drop=one_source_id one_dest_id);

retain one_source_id one_dest_id;

set have;

if _N_=1 then

do;

one_source_id=source_id;

one_dest_id=dest_id;

end;

else

do;

if source_id = one_source_id then delete;

if dest_id=one_dest_id then delete;

if source_id ne lag(source_id) then one_source_id=source_id;

if dest_id ne lag(dest_id) then one_dest_id=dest_id;

end;

run;

Difficult to Match and Remove observation based on combination of two variables

Re: Match on two variable

Re: Match on two variable

Re: Match on two variable

Re: Match on two variable

Re: Match on two variable

Re: Match on two variable

Re: Match on two variable

Re: Match on two variable

Re: Difficult to Match and Remove observation based on combination of two variables

SAS Innovate 2025: Call for Content

Classroom Training Available!