Don't just think like a data scientist. Be one.

How to split data into representative subsets

Reply
New Contributor
Posts: 2

How to split data into representative subsets

[ Edited ]

Hello,

 

my goal is to create training and testing subsets that are representative of the original data set.

 

Situation

 

For example, the original data have is made up by 80% of data from the US (region="US") and by 20% of data from Asia (region="Asia").

data have(drop=i);
	do i=1 to 10;
		if i>8 then region = "Asia";
		       else region = "US";
		output;
	end;
run;

 I now want to randomly split have in 2 subsets each also being made up by 80% of data from the US and by 20% of data from Asia.

 

Solution so far

 

Apparently, there is a comprehensive SAS procedure called SURVEYSELECT to handle this task. This is the best solution I was able to come up with to get the job done (splitting have into want1 and want2; to keep things simple, a splitting ratio of 50% was applied):

 

/* BEFORE surveyselect... */

/* (1) we need number of obs */
%let dsid  = %sysfunc(open(have));
%let nobs  = %sysfunc(attrn(&dsid, nobs));
%let close = %sysfunc(close(&dsid.));

/* (2) we need to sort the data */
proc sql noprint;
	create view haveV as select * from have
	order by region;
quit;


proc surveyselect noprint
	data     = haveV
	out      = have2
	outall
	sampsize = %sysevalf(&nobs.*.5)
	seed     = 100
;
	strata region / alloc = prop;
run;


/* AFTER surveyselect... */

/* ...we need to split the data set ourselves */
data want1 want2;
	set have2;
	if Selected then output want1;
	            else output want2;
run;

 

Question

 

Is there a better approach (or a better way to use SURVEYSELECT) to get the splitting job done?

 

Considering that this splitting job is a frequently occuring task that can be accomplished with a few lines of code in other languages, there should be a better solution in SAS that doesn't suffer from the following short comings of the solution I found:

 

  1. I couldn't see how to pass a splitting ratio to SURVEYSELECT (samprate doesn't work here). The procedure rather needs the absolute number of obs for a sample. So, in order to apply a splitting ratio of x% (here: 50%), I first have to determine the number of obs in the original data set and compute the number of obs for the subset (here: &nobs.*.5).
  2. To ensure that SURVEYSELECT works with any data set, one would actually need to temporarily rename the variables in the original data set (haven't done this here). For example, one of my data sets already has a variable called "SELECTED", so in this case SURVEYSELECT would issue a WARNING and couldn't compute the column "SELECTED".  This is a frequent problem when I'm working with SAS and I've already written a macro that temporarily renames all variables of a data set to "___var1", "___var2" ... "___varN", but still..
  3. After using SURVEYSELECT, (a) a copy of the original data set (which can be quite large) is produced (instead of just directly creating the 2 subsets) and (b) another whole pass through the newly created data is required to create the subsets.
  4. The input data has to be sorted. That means, if it's (technically) not possible to use a view (like I did here), one has to either sort the original data set or (temporarily) create a copy of it.

Thank you very much!

 

BR

Lu

SAS Moderator
Posts: 30

Re: How to split data into representative subsets

Hi Luhan,

I forwarded your post to my colleagues. Several replies stated that you can use the SAMPRATE in PROC SURVEYSELECT. Another suggested if you have access to High Performance procedures to try PROC HPSAMPLE. And the final suggestion was this blog:

https://blogs.sas.com/content/sastraining/2017/01/23/stratified-random-sample-whats-efficient/

I hope this helps,

theresa

 

 

Ask a Question
Discussion stats
  • 1 reply
  • 136 views
  • 0 likes
  • 2 in conversation