Re: Question about proc survey select

deleted_user · Posted 08-31-2009 02:01 PM

I have a data set with 15 observation hours for each of my subjects. I'm trying to use survey select to generate new data sets with decreasing numbers of hours, that I'm then comparing to the total hours. (i.e. what is the correlation between 14 and 15 hours? 13 and 15? 12 and 15? etc.) I have 2 strata, basically age and subject. My problem is that I'd like to keep the same set of hours for each subject, and I can't figure out how to do that. For example, when I generate a set containing 3 hours out of the 15, if the hours for subject 1 are 8, 10, and 12 then I want the hours for subject 2 (and all others) to also be 8, 10 and 12. Is there any way to do this?

data_null__ · Posted 09-01-2009 10:41 AM

> I have a data set with 15 observation hours for each
> of my subjects.

Does the data set produced by the following code model your data?
[pre]
proc plan seed=618029071;
factors
subjid = 10 ordered
agegroup = 1 of 3 random
time = 15 ordered
y1 = 1 of 200
/ noprint;
treatments y2=1 of 50 y3=1 of 30;
output out=plan;
run;
quit;
proc print;
run;
[/pre]

> I'm trying to use survey select to
> generate new data sets with decreasing numbers of
> hours, that I'm then comparing to the total hours.
> (i.e. what is the correlation between 14 and 15
> hours? 13 and 15? 12 and 15? etc.) I have 2 strata,
> basically age and subject. My problem is that I'd
> like to keep the same set of hours for each subject,
> and I can't figure out how to do that. For example,
> when I generate a set containing 3 hours out of the
> 15, if the hours for subject 1 are 8, 10, and 12 then
> I want the hours for subject 2 (and all others) to
> also be 8, 10 and 12. Is there any way to do this?

I don't think SURVERSELECT is going to work well this. SURVEYSELECT selects observations from data sets. It sounds like you want to select levels of a variable (TIME). In your example.

[pre]where time in(8,10,12) [/pre]

There are a number of ways to select (k of n) values at random. There is CALL RANPERK

[pre]
CALL RANPERK Routine
Randomly permutes the values of the arguments, and returns a permutation of k out of n values [/pre]

Also PROC PLAN in the FACTORS statement. I used this above to create sample data.

[pre]
name=m < OF n > < selection-type >

where
name
is a valid SAS name. This gives the name of a factor in the design.

m
is a positive integer that gives the number of values to be selected. If n is specified, the value of m must be less than or equal to n.

n
is a positive integer that gives the number of values to be selected from.
[/pre]

There are others, these are the ones I'm most familiar with.

If I am correct the details of which method(s) might be most appropriate depend on the output you desire. You mentioned correlation. If you describe (with sample data) how the data should look to produce the analysis this will help refine the solution.

Also, do you want to do this for all (n of m) subsets and do you want replication? That is replications of subsets of size n.

deleted_user · Posted 09-02-2009 04:53 PM

Thanks for your help.
This is what my data looks like (very similar to what you posted)
proc plan;
factors
age = 3 ordered
subjid = 5 ordered
hour = 15 ordered
y1 = 1 of 200
/ noprint;
output out=plan;
run;
quit;
proc print;
run;

I'm having some trouble getting sample data output to look like what I want as the end result, but this is close:
proc plan;
factors
age = 3 ordered
subjid = 5 ordered
hour = 14 ordered
y1 = 1 of 200
/ noprint;
output out=sample1;
run;
quit;
proc print;
run;

What I actually want is a random selection of 14 out of the original 15 hours, instead of hours 1 to 14. My snag is that I want the same hours for each age and subject, so that the output would look like the above if hour 15 was the hour that was randomly chosen to be thrown out.
Another way I've thought to do this is to copy the original data 10 times and add a replication column, so that my input data would look something like:
proc plan;
factors
age = 3 ordered
subjid = 5 ordered
rep = 10 ordered
hour = 15 ordered
/ noprint;
output out=sample2;
run;
quit;
proc print;
run;

I would then use a random number generator to select the hours that I want to keep and have SAS delete all the other hours with
data sample2;
set sample2;
if rep = 1 and hour = 10 then delete;
if rep = 2 and hour = 5 then delete;
if rep = 3 and hour = 14 then delete;
run;

(Obviously, with the real data I would continue so it included all 10 replicates.)
The problem with this is that it gets very time consuming, since I want to do this, not just for 14 hours, but with 13 hours, 12 hours, 11 hours, all the way down to 1 hour.

I hope that made things clearer and not more confusing. If you have advice for either of these methods I'd really appreciate it.

data_null__ · Posted 09-02-2009 05:58 PM

I think the following may give you what you want. I used RANPERK to get a list of hours to KEEP. I generated IF statements with a data step because it seemed easy enough to do it that way. Writing wallpaper with code.

This produced a sample for each K from 14 to 1 by -1 the variable K will act as a BY variable in your analysis.

Let me know if this does what you want.

[pre]
dm 'clear log; clear output;';
proc plan seed=767318578;
factors
age = 3 ordered
subjid = 5 ordered
hour = 15 ordered
y1 = 1 of 200
/ noprint;
output out=plan;
run;
quit;
*proc print;
run;
filename FT85F001 temp;
data hours;
file FT85F001;
seed=1046482356;
array _h[15] (1:15);
do k = dim(_h)-1 to 1 by -1;
put +3 k= ';';
put +3 'if hour in(' @;
call ranperk(seed,k,of _h

);
do _n_ = 1 to k;
put _h[_n_] 3. @;
end;
put ') then output;';
end;
run;
data hoursV / view=hoursV;
set plan;
%inc FT85F001 / source2;
run;
proc sort data=hoursV out=hours;
by descending k;
run;
proc print data=_last_(obs=100);
run;
[/pre]

deleted_user · Posted 09-03-2009 01:33 PM

Perfect! Thank you so much for your help with this.

Question about proc survey select