Quartz | Level 8

## deduplicate data constrained to have different values across subgroups

Hi,

I would like to deduplicate the dataset 'have' with the constraint that for each 'group' there will be an observation for each 'id' and that across ids within 'group' all the 'x' values are different.

My dataset want should be like this:

 group id x a 2 1 a 3 2 a 4 3 b 2 1 b 3 3 b 4 4 b 5 2

I would appreciate very much a range of techniques.

thank you very much in advance

1 ACCEPTED SOLUTION

Accepted Solutions
Super User

## Re: deduplicate data constrained to have different values across subgroups

For the case where the number of X = the number of Id the following works.

```proc sql;
create table one as
select distinct group,id
from have
;
create table two as
select distinct group,x
from have
;
quit;

data want;
merge one two;   by group;
run;```

I doubt that an extension of this works for any of the unbalanced cases though. There will be a note about the merge having more than one data set with repeats of by values. Since the behavior of merge in that case has unexpected results when the number of duplicates is different than there isn't a problem with the case where the number is the same. Actually the BY may not be needed at all but without knowing what else is going on in your data...

The Select distinct in the Proc Sql will tend to order the variables so the lowest sort order Id will tend to get the lowest X value. If a random behavior is actually desired then randomize one of the sets within values of group.

7 REPLIES 7
Lapis Lazuli | Level 10

## Re: deduplicate data constrained to have different values across subgroups

It seems there needs to be some logic for which value of x (per value of id within group) is selected. E.g., why shouldn't the first three rows of your want table look like this?

 group id x a 2 3 a 3 2 a 4 1
Quartz | Level 8

## Re: deduplicate data constrained to have different values across subgroups

What it is needed is just that within group x is unique, even if it is choosen ar random within id
Super User

## Re: deduplicate data constrained to have different values across subgroups

Can you please post your HAVE dataset as a CARDS data step?  Many people won't download files.   Also, please post the code you have tried, and describe the problem with the code you tried (errors in your log? incorrect results?)

BASUG is hosting free webinars Next up: Mike Raithel presenting on validating data files on Wednesday July 17. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Super User

## Re: deduplicate data constrained to have different values across subgroups

Perhaps if you gave us some description of what this process is supposed to be used for we can provide better help.

Often people will ask about a single step of a possible solution and not realize that there is a procedure in SAS that will accomplish the task (i.e. the end goal) without having to do the intermediate steps manually.

For example, your description smells to me a bit like an assignment of subjects (X variable) to treatments (ID variable)

Also, have you verified in your "real" data that there are enough X values to satisfy your "across ids within 'group' all the 'x' values are different." condition? I am assuming your example Have data is not your actual data. This question is important because if you have 4 id values within a Group value and only 3 X values (your group=a only shows 3 x values for example) then the condition cannot be met or one of the Id would have to be dropped from the output.

Quartz | Level 8

## Re: deduplicate data constrained to have different values across subgroups

Dear Ballardw,

thank you very much for your kind and insigthful replay.

Maybe I should have posted the problem at an earlier stage as you suggested.

let me try again.

the dataset have is the result of this operation:

data have1;
input group \$ match x;
cards;
a 1000 1
a 1000 2
a 1000 3
b 2000 1
b 2000 2
b 2000 3
b 2000 4
;
run;

data have2;
input group \$ match id;
cards;
a 1000 2
a 1000 3
a 1000 4
b 2000 2
b 2000 3
b 2000 4
b 2000 5
;
run;

proc sql;
create table have as select a.*,b.id
from have1 as a
left join
have2 as b
on a.group=b.group and a.match=b.match
order by group, id, x;
quit;

data have;
retain group id x;
set have;
run;

In other terms I have to match the subject x of have1 within group with the id of have2.

the two dataset have only in common the variable group and the variable match. the problem is that

within a group different subjects can have the same match variable but in the want dataset I need that x in each group be unique.

You are right the real problem is more complex as there can be exceeding rows in both datasete and/or the matching variable do not match. In reality what I would like is to have those that have not joined in residuals dataset.

The problem presented is a simplification to start to solve the problem of having multiple subject with the same matching variable and an equal number of subjects.

Super User

## Re: deduplicate data constrained to have different values across subgroups

For the case where the number of X = the number of Id the following works.

```proc sql;
create table one as
select distinct group,id
from have
;
create table two as
select distinct group,x
from have
;
quit;

data want;
merge one two;   by group;
run;```

I doubt that an extension of this works for any of the unbalanced cases though. There will be a note about the merge having more than one data set with repeats of by values. Since the behavior of merge in that case has unexpected results when the number of duplicates is different than there isn't a problem with the case where the number is the same. Actually the BY may not be needed at all but without knowing what else is going on in your data...

The Select distinct in the Proc Sql will tend to order the variables so the lowest sort order Id will tend to get the lowest X value. If a random behavior is actually desired then randomize one of the sets within values of group.

Quartz | Level 8

## Re: deduplicate data constrained to have different values across subgroups

data have;
input group \$ id x;
cards;
a 2 1
a 2 2
a 2 3
a 3 1
a 3 2
a 3 3
a 4 1
a 4 2
a 4 3
b 2 1
b 2 2
b 2 3
b 2 4
b 3 1
b 3 2
b 3 3
b 3 4
b 4 1
b 4 2
b 4 3
b 4 4
b 5 1
b 5 2
b 5 3
b 5 4
;
run;

Discussion stats
• 7 replies
• 478 views
• 0 likes
• 4 in conversation