BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
ciro
Quartz | Level 8

Hi,

I would like to deduplicate the dataset 'have' with the constraint that for each 'group' there will be an observation for each 'id' and that across ids within 'group' all the 'x' values are different.

My dataset want should be like this:

group  id x
a 2 1
a 3 2
a 4 3
b 2 1
b 3 3
b 4 4
b 5 2

 

I would appreciate very much a range of techniques.

 

thank you very much in advance

 



1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

For the case where the number of X = the number of Id the following works.

 

proc sql;
   create table one as
   select distinct group,id 
   from have
   ;
   create table two as
   select distinct group,x 
   from have
   ;
quit;

data want;
   merge one two;
by group; run;

I doubt that an extension of this works for any of the unbalanced cases though. There will be a note about the merge having more than one data set with repeats of by values. Since the behavior of merge in that case has unexpected results when the number of duplicates is different than there isn't a problem with the case where the number is the same. Actually the BY may not be needed at all but without knowing what else is going on in your data...

 

The Select distinct in the Proc Sql will tend to order the variables so the lowest sort order Id will tend to get the lowest X value. If a random behavior is actually desired then randomize one of the sets within values of group.

View solution in original post

7 REPLIES 7
svh
Lapis Lazuli | Level 10 svh
Lapis Lazuli | Level 10

It seems there needs to be some logic for which value of x (per value of id within group) is selected. E.g., why shouldn't the first three rows of your want table look like this?

 

group  id x
a 2 3
a 3 2
a 4 1
ciro
Quartz | Level 8
What it is needed is just that within group x is unique, even if it is choosen ar random within id
Quentin
Super User

Can you please post your HAVE dataset as a CARDS data step?  Many people won't download files.   Also, please post the code you have tried, and describe the problem with the code you tried (errors in your log? incorrect results?)

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.
ballardw
Super User

Perhaps if you gave us some description of what this process is supposed to be used for we can provide better help.

 

Often people will ask about a single step of a possible solution and not realize that there is a procedure in SAS that will accomplish the task (i.e. the end goal) without having to do the intermediate steps manually.

 

For example, your description smells to me a bit like an assignment of subjects (X variable) to treatments (ID variable)

 

Also, have you verified in your "real" data that there are enough X values to satisfy your "across ids within 'group' all the 'x' values are different." condition? I am assuming your example Have data is not your actual data. This question is important because if you have 4 id values within a Group value and only 3 X values (your group=a only shows 3 x values for example) then the condition cannot be met or one of the Id would have to be dropped from the output.

 

 

ciro
Quartz | Level 8

Dear Ballardw,

thank you very much for your kind and insigthful replay.

Maybe I should have posted the problem at an earlier stage as you suggested.

let me try again.

the dataset have is the result of this operation:

data have1;
input group $ match x;
cards;
a 1000 1
a 1000 2
a 1000 3
b 2000 1
b 2000 2
b 2000 3
b 2000 4
;
run;

data have2;
input group $ match id;
cards;
a 1000 2
a 1000 3
a 1000 4
b 2000 2
b 2000 3
b 2000 4
b 2000 5
;
run;

proc sql;
create table have as select a.*,b.id
from have1 as a
left join
have2 as b
on a.group=b.group and a.match=b.match
order by group, id, x;
quit;

data have;
retain group id x;
set have;
run;

 

In other terms I have to match the subject x of have1 within group with the id of have2.

the two dataset have only in common the variable group and the variable match. the problem is that 

within a group different subjects can have the same match variable but in the want dataset I need that x in each group be unique.

 

You are right the real problem is more complex as there can be exceeding rows in both datasete and/or the matching variable do not match. In reality what I would like is to have those that have not joined in residuals dataset.

The problem presented is a simplification to start to solve the problem of having multiple subject with the same matching variable and an equal number of subjects. 

Thank you again in advance.

 

ballardw
Super User

For the case where the number of X = the number of Id the following works.

 

proc sql;
   create table one as
   select distinct group,id 
   from have
   ;
   create table two as
   select distinct group,x 
   from have
   ;
quit;

data want;
   merge one two;
by group; run;

I doubt that an extension of this works for any of the unbalanced cases though. There will be a note about the merge having more than one data set with repeats of by values. Since the behavior of merge in that case has unexpected results when the number of duplicates is different than there isn't a problem with the case where the number is the same. Actually the BY may not be needed at all but without knowing what else is going on in your data...

 

The Select distinct in the Proc Sql will tend to order the variables so the lowest sort order Id will tend to get the lowest X value. If a random behavior is actually desired then randomize one of the sets within values of group.

ciro
Quartz | Level 8

data have;
input group $ id x;
cards;
a 2 1
a 2 2
a 2 3
a 3 1
a 3 2
a 3 3
a 4 1
a 4 2
a 4 3
b 2 1
b 2 2
b 2 3
b 2 4
b 3 1
b 3 2
b 3 3
b 3 4
b 4 1
b 4 2
b 4 3
b 4 4
b 5 1
b 5 2
b 5 3
b 5 4
;
run;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 789 views
  • 0 likes
  • 4 in conversation