BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6

Hello,

 

Imagine this is what I have:

id  x   y    z     target
1   a   b    c          1
1   a   b    c          1
1   a   b    c          1
2   a   b    c          0
2   a   b    c          0
3   a   b    c          0
4   a   b    c          0
4   a   b    c          0
5   a   b    c          1
5   a   b    c          1 
6   a   b    c          0
6   a   b    c          0
6   a   b    c          0
7   a   b    c          0
7   a   b    c          0

What I need is to keep the same amount of IDs (and its rows) based on the target variable, randomly, so I have a balanced dataset to create a predictive model.

 

I have been searching but can't seem to find anything similar. Thank you for the help

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

So let's me some sample data that has different number of distinct ID values per TARGET value.

So this has 2 IDS with TARGET=1 and 4 IDS with TARGET=0.

data have;
  input id x $ y $ z $ target;
cards;
1 a b c 1
1 a b c 1
1 a b c 1
2 a b c 0
2 a b c 0
3 a b c 0
4 a b c 0
5 a b c 1
5 a b c 1 
6 a b c 0
6 a b c 0
;

Now let's get the distinct list of IDS and how many ids are in the smaller target group.

proc sql noprint;
  create table ids as 
    select distinct id,target 
    from have
    order by target,id
  ;
  select min(n) into :size trimmed
  from (select target,count(*) as n from ids group by target)
  ;
quit;

Then let's sample the IDS from the two groups.

proc surveyselect data=ids  n=&size /*seed=47279*/ out=sample;
  strata target;
run;

And finally use the sampled ID values to subset the original data.

proc sql noprint;
  create table want as 
    select * from have
    where id in (select id from sample)
  ;
quit;

Results:

Obs    id    x    y    z    target

 1      1    a    b    c       1
 2      1    a    b    c       1
 3      1    a    b    c       1
 4      2    a    b    c       0
 5      2    a    b    c       0
 6      3    a    b    c       0
 7      5    a    b    c       1
 8      5    a    b    c       1

View solution in original post

13 REPLIES 13
ballardw
Super User

Is this supposed to be a random selection?

Exactly how is "based on the target variable" to be used? Not obvious as you do not show a result, desired or possible.

 

By "amount of people" do you mean the same number of unique ids? How many people do you want in the final result?

One of each is a "same amount". So must be a bit more going on here.

Do you know how many people are in each target?

 

 

 

fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
Keep the same amount of IDs, yes.

In this example we have 2 IDs with target 1 and 5 IDs with target 0, so it is a not balanced dataset based on the target variable. My original dataset is composed by 1142 IDs with target 1 and 8395 IDs with target 0.

I want to keep the dataset as big as possible, so, to keep the same amount of IDs for each value of the target variable, the output would be, for example, 2 IDs with target 1 (which are in disadvantage) and 2 IDs with target 0.
And I said randomly because there are no further rules to filter who with target 1 is being kept.

Tom
Super User Tom
Super User

You should be able to just use PROC SURVEYSELECT with SIZE= option.

Calculate the size of the smallest group and use that as the SIZE= option.

Here is example using SASHELP.CLASS as dataset and SEX as the stratifying variable.

proc sort data=sashelp.class out=have;
  by sex;
run;

proc sql noprint;
select min(count) into :size 
  from (select sex,count(*) as count from have group by sex)
;
quit;
%put &=size;


proc surveyselect data=have n=&size seed=47279 out=want;
  strata sex;
run;
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
Sorry for the delay, thank you, but it didn't work
PaigeMiller
Diamond | Level 26

@fcf wrote:
Sorry for the delay, thank you, but it didn't work

IMPORTANT CONCEPT: if you tell us it didn't work, and provide no other information, we can't help you. You need to explain and provide information about what you did and what happened.

 

Show us exactly the code you used. If there is an ERROR in the log, show us the ENTIRE log (that's 100% of the log, every single character, do not chop anything out). If the results are wrong, show us the wrong output and explain why its wrong and what you want to see instead.

--
Paige Miller
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
I can't seem to 100% understand what happened but I think the output is returning 50% rows with target 0 and 50% rows with target 1, but that is not what I need.

I need to have a output with 50% IDS that have target 0 and 50% IDS that have target 1, mantaining all rows of those ids.
Tom
Super User Tom
Super User

Do you have repeated observations for the same ID in your original dataset?

 

It sounds like you want to sample from just the unique set of ID values and then pull all observations for those ids.

So first make the unique list of ids (and grouping variable).  Then sample from that.  Then use that list of sampled ids to get all observations for those ids from the original dataset.

 

Let us know if you need help coding that.

fcf
Fluorite | Level 6 fcf
Fluorite | Level 6

According to the input I posted, this is an example of the output I need:

 

id  x   y    z     target
1   a   b    c          1
1   a   b    c          1
1   a   b    c          1
2   a   b    c          0
2   a   b    c          0
3   a   b    c          0
5   a   b    c          1
5   a   b    c          1 

 

Tom
Super User Tom
Super User

@fcf wrote:

According to the input I posted, this is an example of the output I need:

 

id  x   y    z     target
1   a   b    c          1
1   a   b    c          1
1   a   b    c          1
2   a   b    c          0
2   a   b    c          0
3   a   b    c          0
5   a   b    c          1
5   a   b    c          1 

 


So you DO have repeats.  For ID=1 there are 3 observations.

Here is one way to create a dataset that has only one observation per ID.

proc sort data=have(keep=id target) out=unique nodupkey;
  by id target;
run;
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
Yes, but the problem it's not there. Like I said, I wrote "a", "b" and "c" just as examples. In my dataset, there are no duplicate rows. The thing is two focus on the id and the target associated. I can even only have the ID and the TARGET variables. What I need is a way to keep 50% of the ids with target 0 and 50% of the ids with target 1. Then I can perfom a join or something to gather all the rows associated with the IDS.

I posted that way because it would be faster to get the ouput I want.
Tom
Super User Tom
Super User

So let's me some sample data that has different number of distinct ID values per TARGET value.

So this has 2 IDS with TARGET=1 and 4 IDS with TARGET=0.

data have;
  input id x $ y $ z $ target;
cards;
1 a b c 1
1 a b c 1
1 a b c 1
2 a b c 0
2 a b c 0
3 a b c 0
4 a b c 0
5 a b c 1
5 a b c 1 
6 a b c 0
6 a b c 0
;

Now let's get the distinct list of IDS and how many ids are in the smaller target group.

proc sql noprint;
  create table ids as 
    select distinct id,target 
    from have
    order by target,id
  ;
  select min(n) into :size trimmed
  from (select target,count(*) as n from ids group by target)
  ;
quit;

Then let's sample the IDS from the two groups.

proc surveyselect data=ids  n=&size /*seed=47279*/ out=sample;
  strata target;
run;

And finally use the sampled ID values to subset the original data.

proc sql noprint;
  create table want as 
    select * from have
    where id in (select id from sample)
  ;
quit;

Results:

Obs    id    x    y    z    target

 1      1    a    b    c       1
 2      1    a    b    c       1
 3      1    a    b    c       1
 4      2    a    b    c       0
 5      2    a    b    c       0
 6      3    a    b    c       0
 7      5    a    b    c       1
 8      5    a    b    c       1
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
Thank you so much, that's exactly what I needed!
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
No, I don't have repeated observations, I just posted "a", "b", "c" because regardless of the information there, I'll focus on the variables "id" and "target" to determine the ids that stay in the output. I want 50% ids with target 1 (and all the rows associated with those ids) and 50% ids with target 0 and also all rows associated with those ids.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 1516 views
  • 1 like
  • 4 in conversation