10-09-2017 04:44 AM
I have the following scenario and wanted to ask what the best way of sampling is:
I have a huge dataset of teachers and their reporting (skills).
Every teacher has to conduct reports about the progress of his / her students. But some teachers will create much larger quantities of reports than others - simply because they have more pupils or are more efficient in writing reports.
Teachers shall be assessed regarding the reporting skills, whereby the way they assess their pupils is taken into account. However, I am not worried about this yet.
I simply want to find a method of sampling teachers in the best possible way. Say we have a dataset of 20 mil teacher-reports. Some teachers have written wast amount of reports during their lifetime, others only a few. So if I was to sample them randomly, those who have written more reports during their lifetime would be somewhat oversampled.
I guess I could dedup all teachers, and then take a random sample.
But I do not urgently want to do that due to the sheer size of the dataset.
Is there any better way? I want to streamline the process of sampling evenly. How can I achieve this?
Many thanks for any thoughts and help.
10-09-2017 06:34 AM
First of all, your initial dataset should be sorted in a sensible way (teacher-ID, then date,...), so you can easily merge from it.
At least have an index on teacher-ID defined.
proc sort data=reports (keep=teacher_ID) out=teachers nodupkey ; by teacher_ID; run;
and you will have a deduplicated dataset, from which it is easy to create samples.
Suppose you have a UUID-based key, which takes just 16 binary bytes to store, you'll end up with ~16 MB per 1 million teachers.
Since all files are sorted by the same key, a merge to get the individual reports is just a sequential scan through the datasets.
10-09-2017 03:45 PM
Let's say dataset HAVE is sorted by teacherid, where there can be any number of reports by a given teacher. And you want a sample of, say 3 from each teacher. This is straightforward by using double DO loops with SETs inside:
data want; do nrpt=1 by 1 until (last.teacherid); set have; by teacherid; end; nwant=3; do nrpt=nrpt to 1 by -1; set have; if rand('uniform') <= nwant/nrpt then do; nwant=nwant-1; output; end; end; run;
10-10-2017 04:31 PM
What do you mean by: " I want to streamline the process of sampling evenly."
Do you mean select the same number of records for each teacher or select the same proportion such as 10% per teacher?
If you select a number of records what do you want done with the teachers that have fewer than that number?
In either case I think that proc surveyselect may work. Sort you data by teacher and use teacher as a stratum variable.
Here are examples using the SASHELP.CLASS data set. The first Surveyselect block selects 5 records from each sex, the second selects 25% from each sex.
proc sort data=sashelp.class out=work.classsort; by sex; run; proc surveyselect data=work.classsort out=work.numberperstrat sampsize=5 ; stratum sex; run; proc surveyselect data=work.classsort out=work.rateperstrat samprate=.25 ; stratum sex; run;
There are additional options you use to control minimum or maximum samples per strata. Also the output data set contains a selection probability and the sample design weight for using that record and by default all of the other variables in the data set are included in the output.