Re: Bootstrapping Data with Several Rows Per Observation

davidsmarch1 · Posted 12-14-2020 08:45 PM

Hello all and thanks ahead of time for any help.

I have a dataset with 117 subjects containing a variable (yt) that changes over time. Each trial can last anywhere from 20-90 samples (rows) depending on how long the subject takes to respond. The data looks like this:

subject	race	target	distractor	time	X	Y	yt
1	asian	calm	dang	0.02	0.00098	0.05097	-0.00112
1	asian	calm	dang	0.04	0.00192	0.05237	-0.00221
1	asian	calm	dang	0.06	0.00255	0.05625	-0.00292
1	asian	calm	dang	0.08	0.00379	0.06979	-0.00433
1	asian	calm	dang	0.1	0.00525	0.08874	-0.00601
1	asian	calm	dang	0.12	0.00598	0.10097	-0.00688
1	asian	calm	dang	0.14	0.00604	0.11558	-0.00697
1	asian	calm	dang	0.16	0.00573	0.13953	-0.00673
1	asian	calm	dang	0.18	0.00567	0.16903	-0.00701
1	asian	calm	dang	0.2	0.00662	0.19264	-0.00854
1	asian	calm	dang	0.22	0.00772	0.23229	-0.01042
1	asian	calm	dang	0.24	0.00849	0.27722	-0.01157
1	asian	calm	dang	0.26	0.009	0.30805	-0.0122

You can see that the yt changes over time. There is also 3 races, 2 targets, and 2 distractors that I'll end up doing the below for each combination.

I need to know at which times does each yt differs from 0 so I'm running a glm regressing each yt against 0 at each time point and using the ods output to save the parameter estimate.

ods output  ParameterEstimates = acd (keep = time Estimate 
StdErr tValue Probt Cond rename=(Dependent=Cond);
proc glm;
model yt = / solution ss3;
by time; run;

The parameters estimates dataset now contains 90 rows of data that look like the below (I cut it at 6):

time	Cond	Estimate	StdErr	tValue	Probt
0.02	yt	-0.000164850	0.00003056	-5.39	<.0001
0.04	yt	-0.000226894	0.00010429	-2.18	0.0316
0.06	yt	-0.000319427	0.00019644	-1.63	0.1066
0.08	yt	-0.000488531	0.00035148	-1.39	0.1672
0.10	yt	-0.000739901	0.00059421	-1.25	0.2156
0.12	yt	-0.001051132	0.00081990	-1.28	0.2024

What I want to do is bootstrap subsamples of participants from the original dataset where I pull out random sample sizes of the original set (say, 25%, 50%, 75% of participants) 10000 times and run the same regression and output another ParameterEstimates table. Eventually I want to set together all the output tables so that I can look at the distribution of Probt at each time across the bootstrapped samples. I'm not sure, but I may need a variable in each dataset to identify which run this 90 rows belongs to. I hope that makes sense. I've attached a sample of the dataset here with 20 participants. Please advise!

ballardw · Posted 12-15-2020 01:35 AM

I can't tell quite whether any of your variables are involved in grouping for your selection criteria or not. I am going to assume so as that is the slightly more complex version for selecting sample.

Your " random sample sizes of the original set (say, 25%, 50%, 75% of participants) 10000 times" can be accomplished with Proc Surveyselect and the Reps (replicates) to generate multiple samples using similar options.

Here is an example using the SASHELP.Class data set as an example. It is small enough that you can compare the result to the basic set relatively easily.

/* sort data so we can use Strat
   based on sex later
*/
proc sort data=sashelp.class
   out=work.class;
   by sex;
run;

proc surveyselect data=work.class
    out=work.sample
    samprate=.25  reps=10;
    strata sex;
run;

The Samprate in the example selects a 25% sample from each level of the Strata variable Sex. The Reps=10 says to create 10 samples. There will be a variable added to indicate which replicate a group of records comes from.

If you wanted to run a separate analysis for each level of Sex and 10 separate regressions you would use a

By Sex Replicate;

in the regression.

If you don't need strata, which could be several variables but the input needs to be sorted by all, then drop the strata statement.

I suspect that Race, Target and Distractor would be CLASS variables and used in your model to get comparisons between the levels (or combinations).

I would suggest, if you intend to combine any of these sets to make sure there is a variable indicating the source. The indsname option at the time you combine sets can make this easy. Example

data junk;
   set sashelp.class
       sashelp.cars (obs=20)
       indsname=ds
   ;
   source=ds;
run;

Not a particularly useful set but the object is to show there is a variable added to the data named Source and has the name of the data set each record came from.

davidsmarch1 · Posted 12-15-2020 10:28 AM

The subject variable is the grouping criteria. There are 117 subjects, and each of their 6 trials (cond) is many rows long. I had originally tried this method before posting, but it doesn't seem to work as you describe. When I use the code:

				proc surveyselect data=gomp_bar_reorder_manyrows out=bootsample
				samprate = .25 rep = 10;
				strata subject;
				run;

What I end up with is 25% of each subject's rows of data randomly pulled into a replicate. That is, the original condition which went time .02, .04, .06, .08... now goes time .04, .12, .14, .20 as it didn't keep the subject together, it simply pulled out 25% of their data. I need to pull entire subjects, not samples within each subject. I want to take say, 25% of participants, not 25% of a participant's data.

ballardw · Posted 12-15-2020 12:59 PM

@davidsmarch1 wrote:

The subject variable is the grouping criteria. There are 117 subjects, and each of their 6 trials (cond) is many rows long. I had originally tried this method before posting, but it doesn't seem to work as you describe. When I use the code:
				proc surveyselect data=gomp_bar_reorder_manyrows out=bootsample
				samprate = .25 rep = 10;
				strata subject;
				run;
What I end up with is 25% of each subject's rows of data randomly pulled into a replicate. That is, the original condition which went time .02, .04, .06, .08... now goes time .04, .12, .14, .20 as it didn't keep the subject together, it simply pulled out 25% of their data. I need to pull entire subjects, not samples within each subject. I want to take say, 25% of participants, not 25% of a participant's data.

You did not describe very clearly what you wanted in the first place.. Do you want subjects selected? That would mean make a data set with just the subjects, one record per, select, and then join the selections back to the observation data.

An example of joining selected data back to your original:

Proc sql;
   create table want as
   select b.* , a.replicate 
   from selected as a
        left join
        full_data as b
        on a.id =b.id
   ;
quit;

This assumes the survey select output data set was created from a set with one record per Id.

davidsmarch1 · Posted 12-15-2020 01:05 PM

I thought I described it pretty clearly. Each subject has 6 trials. Each trial has up to 90 rows of data, one row for each time point (see the original post and sample data).

I want to sample subjects, not rows within subjects. I need to keep the entirety of a trial together in the sampling. I can't have it taking 25% of a subjects trials, but I need 25% of the subjects. And each subject's entire length of trials needs to be retained. It's not sufficient to simply reduce each subject to one row and sample from that, which is I think what you are proposing. I need to keep the entire trial coherent and sample participants, not rows within a participant.

Random Sampling of Data with Several Rows Per Observation