BookmarkSubscribeRSS Feed
EC27556
Quartz | Level 8

I have a 5m dataset.

 

95k have the target variable=1, the rest =0. I want to take a biased sample where I include all 95k cases and a selection of the =0 cases so the split will be 10% true and 90% false. Could anyone share some code to do this please?

 

Thanks

2 REPLIES 2
PeterClemmensen
Tourmaline | Level 20

So you want all the 95k 1's. And you want those 95k to be 10% of your resulting dataset, meaning that you want 950000 - 95000 of the remaining obs from your data?

EC27556
Quartz | Level 8

Yes,

 

Ultimately, I have 100 datasets and would like the sample to always have 10% hit and keep all target incidences where possible. So I want something I can loop for all datasets.

 

Unfortunately, sometimes I wont be able to use all of my target 'hit' observations because they already represent more than 10% of the aggregate datasets. in this case I would undersample the 'hits' to ensure I have 10% in the sample.

 

For the most part though, the datasets at an aggregate level have less than 10% of data that has a hit for the target variable. I would like some code to oversample the target variables so I can create a sample with 10% observations that have a hit.

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 922 views
  • 0 likes
  • 2 in conversation