Solved: deduplicate data constrained to have different values across subgroups

ciro · Posted 05-30-2024 07:30 AM

Hi,

I would like to deduplicate the dataset 'have' with the constraint that for each 'group' there will be an observation for each 'id' and that across ids within 'group' all the 'x' values are different.

My dataset want should be like this:

group	id	x
a	2	1
a	3	2
a	4	3
b	2	1
b	3	3
b	4	4
b	5	2

I would appreciate very much a range of techniques.

thank you very much in advance

ballardw · Posted 05-30-2024 12:02 PM

For the case where the number of X = the number of Id the following works.

proc sql;
   create table one as
   select distinct group,id 
   from have
   ;
   create table two as
   select distinct group,x 
   from have
   ;
quit;

data want;
   merge one two;
   by group;
run;

I doubt that an extension of this works for any of the unbalanced cases though. There will be a note about the merge having more than one data set with repeats of by values. Since the behavior of merge in that case has unexpected results when the number of duplicates is different than there isn't a problem with the case where the number is the same. Actually the BY may not be needed at all but without knowing what else is going on in your data...

The Select distinct in the Proc Sql will tend to order the variables so the lowest sort order Id will tend to get the lowest X value. If a random behavior is actually desired then randomize one of the sets within values of group.

View solution in original post

svh · Posted 05-30-2024 09:27 AM

It seems there needs to be some logic for which value of x (per value of id within group) is selected. E.g., why shouldn't the first three rows of your want table look like this?

group	id	x
a	2	3
a	3	2
a	4	1

ciro · Posted 05-30-2024 10:33 AM

What it is needed is just that within group x is unique, even if it is choosen ar random within id

Quentin · Posted 05-30-2024 09:46 AM

Can you please post your HAVE dataset as a CARDS data step? Many people won't download files. Also, please post the code you have tried, and describe the problem with the code you tried (errors in your log? incorrect results?)

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

ballardw · Posted 05-30-2024 10:00 AM

Perhaps if you gave us some description of what this process is supposed to be used for we can provide better help.

Often people will ask about a single step of a possible solution and not realize that there is a procedure in SAS that will accomplish the task (i.e. the end goal) without having to do the intermediate steps manually.

For example, your description smells to me a bit like an assignment of subjects (X variable) to treatments (ID variable)

Also, have you verified in your "real" data that there are enough X values to satisfy your "across ids within 'group' all the 'x' values are different." condition? I am assuming your example Have data is not your actual data. This question is important because if you have 4 id values within a Group value and only 3 X values (your group=a only shows 3 x values for example) then the condition cannot be met or one of the Id would have to be dropped from the output.

ciro · Posted 05-30-2024 11:18 AM

Dear Ballardw,

thank you very much for your kind and insigthful replay.

Maybe I should have posted the problem at an earlier stage as you suggested.

let me try again.

the dataset have is the result of this operation:

data have1;
input group $ match x;
cards;
a 1000 1
a 1000 2
a 1000 3
b 2000 1
b 2000 2
b 2000 3
b 2000 4
;
run;

data have2;
input group $ match id;
cards;
a 1000 2
a 1000 3
a 1000 4
b 2000 2
b 2000 3
b 2000 4
b 2000 5
;
run;

proc sql;
create table have as select a.*,b.id
from have1 as a
left join
have2 as b
on a.group=b.group and a.match=b.match
order by group, id, x;
quit;

data have;
retain group id x;
set have;
run;

In other terms I have to match the subject x of have1 within group with the id of have2.

the two dataset have only in common the variable group and the variable match. the problem is that

within a group different subjects can have the same match variable but in the want dataset I need that x in each group be unique.

You are right the real problem is more complex as there can be exceeding rows in both datasete and/or the matching variable do not match. In reality what I would like is to have those that have not joined in residuals dataset.

The problem presented is a simplification to start to solve the problem of having multiple subject with the same matching variable and an equal number of subjects.

Thank you again in advance.

ballardw · Posted 05-30-2024 12:02 PM

For the case where the number of X = the number of Id the following works.

proc sql;
   create table one as
   select distinct group,id 
   from have
   ;
   create table two as
   select distinct group,x 
   from have
   ;
quit;

data want;
   merge one two;
   by group;
run;

I doubt that an extension of this works for any of the unbalanced cases though. There will be a note about the merge having more than one data set with repeats of by values. Since the behavior of merge in that case has unexpected results when the number of duplicates is different than there isn't a problem with the case where the number is the same. Actually the BY may not be needed at all but without knowing what else is going on in your data...

The Select distinct in the Proc Sql will tend to order the variables so the lowest sort order Id will tend to get the lowest X value. If a random behavior is actually desired then randomize one of the sets within values of group.

ciro · Posted 05-30-2024 10:35 AM

data have;
input group $ id x;
cards;
a 2 1
a 2 2
a 2 3
a 3 1
a 3 2
a 3 3
a 4 1
a 4 2
a 4 3
b 2 1
b 2 2
b 2 3
b 2 4
b 3 1
b 3 2
b 3 3
b 3 4
b 4 1
b 4 2
b 4 3
b 4 4
b 5 1
b 5 2
b 5 3
b 5 4
;
run;

deduplicate data constrained to have different values across subgroups

Re: deduplicate data constrained to have different values across subgroups

Re: deduplicate data constrained to have different values across subgroups

Re: deduplicate data constrained to have different values across subgroups

Re: deduplicate data constrained to have different values across subgroups

Re: deduplicate data constrained to have different values across subgroups

Re: deduplicate data constrained to have different values across subgroups

Re: deduplicate data constrained to have different values across subgroups

Re: deduplicate data constrained to have different values across subgroups

SAS Innovate 2025: Call for Content

Classroom Training Available!