Solved: Re: How to keep the same amount of people based on a column?

fcf · Posted 07-21-2021 11:16 AM

Hello,

Imagine this is what I have:

id  x   y    z     target
1   a   b    c          1
1   a   b    c          1
1   a   b    c          1
2   a   b    c          0
2   a   b    c          0
3   a   b    c          0
4   a   b    c          0
4   a   b    c          0
5   a   b    c          1
5   a   b    c          1 
6   a   b    c          0
6   a   b    c          0
6   a   b    c          0
7   a   b    c          0
7   a   b    c          0

What I need is to keep the same amount of IDs (and its rows) based on the target variable, randomly, so I have a balanced dataset to create a predictive model.

I have been searching but can't seem to find anything similar. Thank you for the help

Tom · Posted 08-10-2021 09:54 AM

So let's me some sample data that has different number of distinct ID values per TARGET value.

So this has 2 IDS with TARGET=1 and 4 IDS with TARGET=0.

data have;
  input id x $ y $ z $ target;
cards;
1 a b c 1
1 a b c 1
1 a b c 1
2 a b c 0
2 a b c 0
3 a b c 0
4 a b c 0
5 a b c 1
5 a b c 1 
6 a b c 0
6 a b c 0
;

Now let's get the distinct list of IDS and how many ids are in the smaller target group.

proc sql noprint;
  create table ids as 
    select distinct id,target 
    from have
    order by target,id
  ;
  select min(n) into :size trimmed
  from (select target,count(*) as n from ids group by target)
  ;
quit;

Then let's sample the IDS from the two groups.

proc surveyselect data=ids  n=&size /*seed=47279*/ out=sample;
  strata target;
run;

And finally use the sampled ID values to subset the original data.

proc sql noprint;
  create table want as 
    select * from have
    where id in (select id from sample)
  ;
quit;

Results:

Obs    id    x    y    z    target

 1      1    a    b    c       1
 2      1    a    b    c       1
 3      1    a    b    c       1
 4      2    a    b    c       0
 5      2    a    b    c       0
 6      3    a    b    c       0
 7      5    a    b    c       1
 8      5    a    b    c       1

View solution in original post

ballardw · Posted 07-21-2021 11:36 AM

Is this supposed to be a random selection?

Exactly how is "based on the target variable" to be used? Not obvious as you do not show a result, desired or possible.

By "amount of people" do you mean the same number of unique ids? How many people do you want in the final result?

One of each is a "same amount". So must be a bit more going on here.

Do you know how many people are in each target?

fcf · Posted 07-21-2021 11:47 AM

Keep the same amount of IDs, yes.

In this example we have 2 IDs with target 1 and 5 IDs with target 0, so it is a not balanced dataset based on the target variable. My original dataset is composed by 1142 IDs with target 1 and 8395 IDs with target 0.

I want to keep the dataset as big as possible, so, to keep the same amount of IDs for each value of the target variable, the output would be, for example, 2 IDs with target 1 (which are in disadvantage) and 2 IDs with target 0.
And I said randomly because there are no further rules to filter who with target 1 is being kept.

Tom · Posted 07-21-2021 12:29 PM

You should be able to just use PROC SURVEYSELECT with SIZE= option.

Calculate the size of the smallest group and use that as the SIZE= option.

Here is example using SASHELP.CLASS as dataset and SEX as the stratifying variable.

proc sort data=sashelp.class out=have;
  by sex;
run;

proc sql noprint;
select min(count) into :size 
  from (select sex,count(*) as count from have group by sex)
;
quit;
%put &=size;


proc surveyselect data=have n=&size seed=47279 out=want;
  strata sex;
run;

fcf · Posted 08-10-2021 07:45 AM

Sorry for the delay, thank you, but it didn't work

PaigeMiller · Posted 08-10-2021 08:35 AM

@fcf wrote:
Sorry for the delay, thank you, but it didn't work

IMPORTANT CONCEPT: if you tell us it didn't work, and provide no other information, we can't help you. You need to explain and provide information about what you did and what happened.

Show us exactly the code you used. If there is an ERROR in the log, show us the ENTIRE log (that's 100% of the log, every single character, do not chop anything out). If the results are wrong, show us the wrong output and explain why its wrong and what you want to see instead.

--
Paige Miller

fcf · Posted 08-10-2021 09:10 AM

I can't seem to 100% understand what happened but I think the output is returning 50% rows with target 0 and 50% rows with target 1, but that is not what I need.

I need to have a output with 50% IDS that have target 0 and 50% IDS that have target 1, mantaining all rows of those ids.

Tom · Posted 08-10-2021 09:19 AM

Do you have repeated observations for the same ID in your original dataset?

It sounds like you want to sample from just the unique set of ID values and then pull all observations for those ids.

So first make the unique list of ids (and grouping variable). Then sample from that. Then use that list of sampled ids to get all observations for those ids from the original dataset.

Let us know if you need help coding that.

fcf · Posted 08-10-2021 09:19 AM

According to the input I posted, this is an example of the output I need:

id  x   y    z     target
1   a   b    c          1
1   a   b    c          1
1   a   b    c          1
2   a   b    c          0
2   a   b    c          0
3   a   b    c          0
5   a   b    c          1
5   a   b    c          1

Tom · Posted 08-10-2021 09:32 AM

@fcf wrote:

According to the input I posted, this is an example of the output I need:

id  x   y    z     target
1   a   b    c          1
1   a   b    c          1
1   a   b    c          1
2   a   b    c          0
2   a   b    c          0
3   a   b    c          0
5   a   b    c          1
5   a   b    c          1

So you DO have repeats. For ID=1 there are 3 observations.

Here is one way to create a dataset that has only one observation per ID.

proc sort data=have(keep=id target) out=unique nodupkey;
  by id target;
run;

fcf · Posted 08-10-2021 09:38 AM

Yes, but the problem it's not there. Like I said, I wrote "a", "b" and "c" just as examples. In my dataset, there are no duplicate rows. The thing is two focus on the id and the target associated. I can even only have the ID and the TARGET variables. What I need is a way to keep 50% of the ids with target 0 and 50% of the ids with target 1. Then I can perfom a join or something to gather all the rows associated with the IDS.

I posted that way because it would be faster to get the ouput I want.

Tom · Posted 08-10-2021 09:54 AM

So let's me some sample data that has different number of distinct ID values per TARGET value.

So this has 2 IDS with TARGET=1 and 4 IDS with TARGET=0.

data have;
  input id x $ y $ z $ target;
cards;
1 a b c 1
1 a b c 1
1 a b c 1
2 a b c 0
2 a b c 0
3 a b c 0
4 a b c 0
5 a b c 1
5 a b c 1 
6 a b c 0
6 a b c 0
;

Now let's get the distinct list of IDS and how many ids are in the smaller target group.

proc sql noprint;
  create table ids as 
    select distinct id,target 
    from have
    order by target,id
  ;
  select min(n) into :size trimmed
  from (select target,count(*) as n from ids group by target)
  ;
quit;

Then let's sample the IDS from the two groups.

proc surveyselect data=ids  n=&size /*seed=47279*/ out=sample;
  strata target;
run;

And finally use the sampled ID values to subset the original data.

proc sql noprint;
  create table want as 
    select * from have
    where id in (select id from sample)
  ;
quit;

Results:

Obs    id    x    y    z    target

 1      1    a    b    c       1
 2      1    a    b    c       1
 3      1    a    b    c       1
 4      2    a    b    c       0
 5      2    a    b    c       0
 6      3    a    b    c       0
 7      5    a    b    c       1
 8      5    a    b    c       1

fcf · Posted 08-10-2021 10:06 AM

Thank you so much, that's exactly what I needed!

fcf · Posted 08-10-2021 09:22 AM

No, I don't have repeated observations, I just posted "a", "b", "c" because regardless of the information there, I'll focus on the variables "id" and "target" to determine the ids that stay in the output. I want 50% ids with target 1 (and all the rows associated with those ids) and 50% ids with target 0 and also all rows associated with those ids.

SAS Innovate 2025: Call for Content

Classroom Training Available!