Hi,
I would like to deduplicate the dataset 'have' with the constraint that for each 'group' there will be an observation for each 'id' and that across ids within 'group' all the 'x' values are different.
My dataset want should be like this:
group | id | x |
a | 2 | 1 |
a | 3 | 2 |
a | 4 | 3 |
b | 2 | 1 |
b | 3 | 3 |
b | 4 | 4 |
b | 5 | 2 |
I would appreciate very much a range of techniques.
thank you very much in advance
For the case where the number of X = the number of Id the following works.
proc sql; create table one as select distinct group,id from have ; create table two as select distinct group,x from have ; quit; data want; merge one two;
by group; run;
I doubt that an extension of this works for any of the unbalanced cases though. There will be a note about the merge having more than one data set with repeats of by values. Since the behavior of merge in that case has unexpected results when the number of duplicates is different than there isn't a problem with the case where the number is the same. Actually the BY may not be needed at all but without knowing what else is going on in your data...
The Select distinct in the Proc Sql will tend to order the variables so the lowest sort order Id will tend to get the lowest X value. If a random behavior is actually desired then randomize one of the sets within values of group.
It seems there needs to be some logic for which value of x (per value of id within group) is selected. E.g., why shouldn't the first three rows of your want table look like this?
group | id | x |
a | 2 | 3 |
a | 3 | 2 |
a | 4 | 1 |
Can you please post your HAVE dataset as a CARDS data step? Many people won't download files. Also, please post the code you have tried, and describe the problem with the code you tried (errors in your log? incorrect results?)
Perhaps if you gave us some description of what this process is supposed to be used for we can provide better help.
Often people will ask about a single step of a possible solution and not realize that there is a procedure in SAS that will accomplish the task (i.e. the end goal) without having to do the intermediate steps manually.
For example, your description smells to me a bit like an assignment of subjects (X variable) to treatments (ID variable)
Also, have you verified in your "real" data that there are enough X values to satisfy your "across ids within 'group' all the 'x' values are different." condition? I am assuming your example Have data is not your actual data. This question is important because if you have 4 id values within a Group value and only 3 X values (your group=a only shows 3 x values for example) then the condition cannot be met or one of the Id would have to be dropped from the output.
Dear Ballardw,
thank you very much for your kind and insigthful replay.
Maybe I should have posted the problem at an earlier stage as you suggested.
let me try again.
the dataset have is the result of this operation:
data have1;
input group $ match x;
cards;
a 1000 1
a 1000 2
a 1000 3
b 2000 1
b 2000 2
b 2000 3
b 2000 4
;
run;
data have2;
input group $ match id;
cards;
a 1000 2
a 1000 3
a 1000 4
b 2000 2
b 2000 3
b 2000 4
b 2000 5
;
run;
proc sql;
create table have as select a.*,b.id
from have1 as a
left join
have2 as b
on a.group=b.group and a.match=b.match
order by group, id, x;
quit;
data have;
retain group id x;
set have;
run;
In other terms I have to match the subject x of have1 within group with the id of have2.
the two dataset have only in common the variable group and the variable match. the problem is that
within a group different subjects can have the same match variable but in the want dataset I need that x in each group be unique.
You are right the real problem is more complex as there can be exceeding rows in both datasete and/or the matching variable do not match. In reality what I would like is to have those that have not joined in residuals dataset.
The problem presented is a simplification to start to solve the problem of having multiple subject with the same matching variable and an equal number of subjects.
Thank you again in advance.
For the case where the number of X = the number of Id the following works.
proc sql; create table one as select distinct group,id from have ; create table two as select distinct group,x from have ; quit; data want; merge one two;
by group; run;
I doubt that an extension of this works for any of the unbalanced cases though. There will be a note about the merge having more than one data set with repeats of by values. Since the behavior of merge in that case has unexpected results when the number of duplicates is different than there isn't a problem with the case where the number is the same. Actually the BY may not be needed at all but without knowing what else is going on in your data...
The Select distinct in the Proc Sql will tend to order the variables so the lowest sort order Id will tend to get the lowest X value. If a random behavior is actually desired then randomize one of the sets within values of group.
data have;
input group $ id x;
cards;
a 2 1
a 2 2
a 2 3
a 3 1
a 3 2
a 3 3
a 4 1
a 4 2
a 4 3
b 2 1
b 2 2
b 2 3
b 2 4
b 3 1
b 3 2
b 3 3
b 3 4
b 4 1
b 4 2
b 4 3
b 4 4
b 5 1
b 5 2
b 5 3
b 5 4
;
run;
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.