How to split data into representative subsets

Luhan — Sun, 11 Mar 2018 13:04:11 GMT

Hello,

my goal is to create training and testing subsets that are representative of the original data set.

Situation

For example, the original data have is made up by 80% of data from the US (region="US") and by 20% of data from Asia (region="Asia").

data have(drop=i);
	do i=1 to 10;
		if i>8 then region = "Asia";
		       else region = "US";
		output;
	end;
run;

I now want to randomly split have in 2 subsets each also being made up by 80% of data from the US and by 20% of data from Asia.

Solution so far

Apparently, there is a comprehensive SAS procedure called SURVEYSELECT to handle this task. This is the best solution I was able to come up with to get the job done (splitting have into want1 and want2; to keep things simple, a splitting ratio of 50% was applied):

/* BEFORE surveyselect... */

/* (1) we need number of obs */
%let dsid  = %sysfunc(open(have));
%let nobs  = %sysfunc(attrn(&dsid, nobs));
%let close = %sysfunc(close(&dsid.));

/* (2) we need to sort the data */
proc sql noprint;
	create view haveV as select * from have
	order by region;
quit;


proc surveyselect noprint
	data     = haveV
	out      = have2
	outall
	sampsize = %sysevalf(&nobs.*.5)
	seed     = 100
;
	strata region / alloc = prop;
run;


/* AFTER surveyselect... */

/* ...we need to split the data set ourselves */
data want1 want2;
	set have2;
	if Selected then output want1;
	            else output want2;
run;

Question

Is there a better approach (or a better way to use SURVEYSELECT) to get the splitting job done?

Considering that this splitting job is a frequently occuring task that can be accomplished with a few lines of code in other languages, there should be a better solution in SAS that doesn't suffer from the following short comings of the solution I found:

I couldn't see how to pass a splitting ratio to SURVEYSELECT (samprate doesn't work here). The procedure rather needs the absolute number of obs for a sample. So, in order to apply a splitting ratio of x% (here: 50%), I first have to determine the number of obs in the original data set and compute the number of obs for the subset (here: &nobs.*.5).
To ensure that SURVEYSELECT works with any data set, one would actually need to temporarily rename the variables in the original data set (haven't done this here). For example, one of my data sets already has a variable called "SELECTED", so in this case SURVEYSELECT would issue a WARNING and couldn't compute the column "SELECTED". This is a frequent problem when I'm working with SAS and I've already written a macro that temporarily renames all variables of a data set to "___var1", "___var2" ... "___varN", but still..
After using SURVEYSELECT, (a) a copy of the original data set (which can be quite large) is produced (instead of just directly creating the 2 subsets) and (b) another whole pass through the newly created data is required to create the subsets.
The input data has to be sorted. That means, if it's (technically) not possible to use a view (like I did here), one has to either sort the original data set or (temporarily) create a copy of it.

Thank you very much!

Re: How to split data into representative subsets

TheresaStemler — Wed, 14 Mar 2018 21:21:45 GMT

Hi Luhan,

I forwarded your post to my colleagues. Several replies stated that you can use the SAMPRATE in PROC SURVEYSELECT. Another suggested if you have access to High Performance procedures to try PROC HPSAMPLE. And the final suggestion was this blog:

https://blogs.sas.com/content/sastraining/2017/01/23/stratified-random-sample-whats-efficient/

I hope this helps,

theresa

topic Re: How to split data into representative subsets in SAS Academy for Data Science

How to split data into representative subsets

Re: How to split data into representative subsets