Random sample patient record with predefined distribution rate from ea...

Sujithpeta · Posted 03-12-2020 03:09 PM

Hey,

Have:

Patient membership records spanning from 2015 to 2018, not every patient would have all years of membership enrollment.

Want:

I want to get a random sample of patients coming from each year (2015 to 2018) at a 38, 22,19,21% respectively without repeating the same patient ID.

Is it possible to do all of these in one proc?

Thanks

ChrisNZ · Posted 03-12-2020 05:59 PM

>Is it possible to do all of these in one proc?

I don't think so.

>not every patient would have all years of membership enrollment.

No preference in terms of percentage of various patient tenures?

High-Performance SAS Coding - Third Edition

mkeintz · Posted 03-12-2020 10:56 PM

You have not provided a sample data set, so my suggestion is totally untested. I presume you have a data set with ID and YEAR variables (or date variable from which YEAR can be extracted). Each ID may have any number of records (including zero records) in each year.

You want a random sample (at different sampling rates) for each of 4 years. And if an ID is drawn for one year, it is not eligible to be drawn from another year.

It is conceivable that this is not possible. Consider exactly 100 patients, each with one record in each year. Then your samples of 38%, 22%, 19% and 21% means you would draw one record from each of the patients. Now imagine that the 38% year (call it year X) is missing from the "last" id (i.e. the id is present only in the other 3 years). The ramdom sample size of 38% of 99 is still presumably 38 obs. If your randomization scheme, over the course of the first 99 draws, selects a complete complement for the other years, and 37 for yearX, then the 100th observation is not sampled - it is not available for yearX and it is not needed for the other years.

I.e. it is possible your data may be pathological enough to make it impossible to get the sample you want - even if all the sampling rates were identical 25%. This is because the same ID may be present in multiple years, yet is not allowed in more than one stratum (i.e. one year).

This task will probably require some data step coding.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 03-13-2020 01:00 AM

"I want to get a random sample of patients coming from each year (2015 to 2018) at a 38, 22,19,21% respectively without repeating the same patient ID."

Does that mean you only consider a specific patient id for your sample in the first year it appears in your source data or does this mean as long as you haven't selected a specific patient id in another year, it's still up for grabs for your sample and you just don't want repeated ID's in your sample.

And depending on your answer:

What does 21% for your last year mean? 21% based on the total rows in your source, or 21% of the source rows for this specific year (and "excluded" Id's counted or not?), or 21% of rows in the sample to be from the last year?

How did you come up with these percentages per year in first place? Are they based on your current source data and you just want to end-up with the same number of patients per year in your sample?

Here an attempt to create sample HAVE data for your case. Can you please verify if this data is suitable.

/* create sample Have data */
data _null_;
  length year id 8;
  dcl hash h1(multidata:'n');
  h1.defineKey('year','id');
  h1.defineData('year','id');
  h1.defineDone();
  call streaminit(2);
  do year=2016 to 2019;
    _stop=rand('integer',1000,3000);
    do _j=1 to _stop;
      id=rand('integer',1,10000);
      _rc=h1.ref();
    end;
  end;
  h1.output(dataset:'have');
  stop;
run;

Sujithpeta · Posted 03-13-2020 02:40 PM

The code you shared was through error.

Here is how the data is structured:

ID Year

A 2015

A 2016

B 2015

B 2017

B 2018

C 2016

D 2018

Patient ID, not repeating in same year and across years.

% comes from a case group whose disease index year distribution is in the mentioned rates.

Does this help? @Patrick

Patrick · Posted 03-13-2020 10:11 PM

The code I've shared works for me as posted. Looks like you're on a too old SAS version for something in the code.

I still don't understand where the percentages would need to be applied and you haven't explained this further/answered my questions.

Random sample patient record with predefined distribution rate from each year

Re: Random sample patient record with predefined distribution rate from each year

Re: Random sample patient record with predefined distribution rate from each year

Re: Random sample patient record with predefined distribution rate from each year

Re: Random sample patient record with predefined distribution rate from each year

Re: Random sample patient record with predefined distribution rate from each year

Random sample patient record with predefined distribution rate from each year

Re: Random sample patient record with predefined distribution rate from each year

Re: Random sample patient record with predefined distribution rate from each year

Re: Random sample patient record with predefined distribution rate from each year

Re: Random sample patient record with predefined distribution rate from each year

Re: Random sample patient record with predefined distribution rate from each year

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!