Hello,
Imagine this is what I have:
id x y z target
1 a b c 1
1 a b c 1
1 a b c 1
2 a b c 0
2 a b c 0
3 a b c 0
4 a b c 0
4 a b c 0
5 a b c 1
5 a b c 1
6 a b c 0
6 a b c 0
6 a b c 0
7 a b c 0
7 a b c 0
What I need is to keep the same amount of IDs (and its rows) based on the target variable, randomly, so I have a balanced dataset to create a predictive model.
I have been searching but can't seem to find anything similar. Thank you for the help
So let's me some sample data that has different number of distinct ID values per TARGET value.
So this has 2 IDS with TARGET=1 and 4 IDS with TARGET=0.
data have;
input id x $ y $ z $ target;
cards;
1 a b c 1
1 a b c 1
1 a b c 1
2 a b c 0
2 a b c 0
3 a b c 0
4 a b c 0
5 a b c 1
5 a b c 1
6 a b c 0
6 a b c 0
;
Now let's get the distinct list of IDS and how many ids are in the smaller target group.
proc sql noprint;
create table ids as
select distinct id,target
from have
order by target,id
;
select min(n) into :size trimmed
from (select target,count(*) as n from ids group by target)
;
quit;
Then let's sample the IDS from the two groups.
proc surveyselect data=ids n=&size /*seed=47279*/ out=sample;
strata target;
run;
And finally use the sampled ID values to subset the original data.
proc sql noprint;
create table want as
select * from have
where id in (select id from sample)
;
quit;
Results:
Obs id x y z target 1 1 a b c 1 2 1 a b c 1 3 1 a b c 1 4 2 a b c 0 5 2 a b c 0 6 3 a b c 0 7 5 a b c 1 8 5 a b c 1
Is this supposed to be a random selection?
Exactly how is "based on the target variable" to be used? Not obvious as you do not show a result, desired or possible.
By "amount of people" do you mean the same number of unique ids? How many people do you want in the final result?
One of each is a "same amount". So must be a bit more going on here.
Do you know how many people are in each target?
You should be able to just use PROC SURVEYSELECT with SIZE= option.
Calculate the size of the smallest group and use that as the SIZE= option.
Here is example using SASHELP.CLASS as dataset and SEX as the stratifying variable.
proc sort data=sashelp.class out=have;
by sex;
run;
proc sql noprint;
select min(count) into :size
from (select sex,count(*) as count from have group by sex)
;
quit;
%put &=size;
proc surveyselect data=have n=&size seed=47279 out=want;
strata sex;
run;
@fcf wrote:
Sorry for the delay, thank you, but it didn't work
IMPORTANT CONCEPT: if you tell us it didn't work, and provide no other information, we can't help you. You need to explain and provide information about what you did and what happened.
Show us exactly the code you used. If there is an ERROR in the log, show us the ENTIRE log (that's 100% of the log, every single character, do not chop anything out). If the results are wrong, show us the wrong output and explain why its wrong and what you want to see instead.
Do you have repeated observations for the same ID in your original dataset?
It sounds like you want to sample from just the unique set of ID values and then pull all observations for those ids.
So first make the unique list of ids (and grouping variable). Then sample from that. Then use that list of sampled ids to get all observations for those ids from the original dataset.
Let us know if you need help coding that.
According to the input I posted, this is an example of the output I need:
id x y z target
1 a b c 1
1 a b c 1
1 a b c 1
2 a b c 0
2 a b c 0
3 a b c 0
5 a b c 1
5 a b c 1
@fcf wrote:
According to the input I posted, this is an example of the output I need:
id x y z target 1 a b c 1 1 a b c 1 1 a b c 1 2 a b c 0 2 a b c 0 3 a b c 0 5 a b c 1 5 a b c 1
So you DO have repeats. For ID=1 there are 3 observations.
Here is one way to create a dataset that has only one observation per ID.
proc sort data=have(keep=id target) out=unique nodupkey;
by id target;
run;
So let's me some sample data that has different number of distinct ID values per TARGET value.
So this has 2 IDS with TARGET=1 and 4 IDS with TARGET=0.
data have;
input id x $ y $ z $ target;
cards;
1 a b c 1
1 a b c 1
1 a b c 1
2 a b c 0
2 a b c 0
3 a b c 0
4 a b c 0
5 a b c 1
5 a b c 1
6 a b c 0
6 a b c 0
;
Now let's get the distinct list of IDS and how many ids are in the smaller target group.
proc sql noprint;
create table ids as
select distinct id,target
from have
order by target,id
;
select min(n) into :size trimmed
from (select target,count(*) as n from ids group by target)
;
quit;
Then let's sample the IDS from the two groups.
proc surveyselect data=ids n=&size /*seed=47279*/ out=sample;
strata target;
run;
And finally use the sampled ID values to subset the original data.
proc sql noprint;
create table want as
select * from have
where id in (select id from sample)
;
quit;
Results:
Obs id x y z target 1 1 a b c 1 2 1 a b c 1 3 1 a b c 1 4 2 a b c 0 5 2 a b c 0 6 3 a b c 0 7 5 a b c 1 8 5 a b c 1
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.