BookmarkSubscribeRSS Feed
Tinker_
Fluorite | Level 6

Hi

 

I have the following scenario and wanted to ask what the best way of sampling is:

 

I have a huge dataset of teachers and their reporting (skills). 

Every teacher has to conduct reports about the progress of his / her students. But some teachers will create much larger quantities of reports than others - simply because they have more pupils or are more efficient in writing reports. 

 

Teachers shall be assessed regarding the reporting skills, whereby the way they assess their pupils is taken into account. However, I am not worried about this yet. 

I simply want to find a method of sampling teachers in the best possible way. Say we have a dataset of 20 mil teacher-reports. Some teachers have written wast amount of reports during their lifetime, others only a few. So if I was to sample them randomly, those who have written more reports during their lifetime would be somewhat oversampled. 

I guess I could dedup all teachers, and then take a random sample. 

 

But I do not urgently want to do that due to the sheer size of the dataset. 

 

Is there any better way? I want to streamline the process of sampling evenly.  How can I achieve this?

Many thanks for any thoughts and help. 

3 REPLIES 3
Kurt_Bremser
Super User

First of all, your initial dataset should be sorted in a sensible way (teacher-ID, then date,...), so you can easily merge from it.

At least have an index on teacher-ID defined.

 

Then do

proc sort
  data=reports (keep=teacher_ID)
  out=teachers
  nodupkey
;
by teacher_ID;
run;

and you will have a deduplicated dataset, from which it is easy to create samples.

Suppose you have a UUID-based key, which takes just 16 binary bytes to store, you'll end up with ~16 MB per 1 million teachers.

 

Since all files are sorted by the same key, a merge to get the individual reports is just a sequential scan through the datasets.

mkeintz
PROC Star

Let's say dataset HAVE is sorted by teacherid, where there can be any number of reports by a given teacher.   And you want a sample of, say 3 from each teacher.  This is straightforward by using double  DO loops with SETs inside: 

 

data want;
  do nrpt=1 by 1 until (last.teacherid);
    set have;
    by teacherid;
  end;

  nwant=3;

  do nrpt=nrpt to 1 by -1;
    set have;
    if rand('uniform') <= nwant/nrpt then do;
      nwant=nwant-1;
      output;
    end;
  end;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

What do you mean by: " I want to streamline the process of sampling evenly."

Do you mean select the same number of records for each teacher or select the same proportion such as 10% per teacher?

 

If you select a number of records what do you want done with the teachers that have fewer than that number?

 

In either case I think that proc surveyselect may work. Sort you data by teacher and use teacher as a stratum variable.

Here are examples using the SASHELP.CLASS data set. The first Surveyselect block selects 5 records from each sex, the second selects 25% from each sex.

proc sort data=sashelp.class out=work.classsort;
   by sex;
run;

proc surveyselect data=work.classsort
     out=work.numberperstrat
     sampsize=5
     ;
   stratum sex;
run;


proc surveyselect data=work.classsort
     out=work.rateperstrat
     samprate=.25
     ;
   stratum sex;
run;

There are additional options you use to control minimum or maximum samples per strata. Also the output data set contains a selection probability and the sample design weight for using that record and by default all of the other variables in the data set are included in the output.

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 470 views
  • 2 likes
  • 4 in conversation