Re: Simulate (or re-create) a dataset

Dmcav6 · Posted 09-06-2017 01:24 PM

I have a dataset with about 5688 records of subjects that have needed sedations for a procedure. The records have information such as the procedure type, the type of physician/professional administering the sedation, etc...So some doctors have more records because they do more procedures.

I want to simulate (or re-create) this dataset, so that there are 146164 records or as close to this number of records as possible. I want to maintain the 'weighting' of records entered by the sedation doctor.

Here's an example data. Let me describe what I desire by using this small sample. So say I had these 10 records, I want to have an output dataset of 146 records, resampled from this dataset of 10. I want to keep the 'weighting' or proportion of the physician type where the Anesthesiologist accounts for a majority of the records, in this example 7 out of 10 (70%).

Procedure Sedative route Physician type

Laceration/suture IV Nurse Anesthesiologist

Dental procedure IV Dental surgeon

Dental surgery IV Dental surgeion

MRI IV Anesthesiologist

Gastro endoscopy - U IV Anesthesiologist

Gastro endoscopy – L IV Anesthesiologist

Other IV Anesthesiologist

Lumbar puncture IV Anesthesiologist

MRI IV Anesthesiologist

CT scan IV Anesthesiologist

PaigeMiller · Posted 09-06-2017 01:30 PM

Dmcav6 wrote:

Here's an example data. Let me describe what I desire by using this small sample. So say I had these 10 records, I want to have an output dataset of 146 records, resampled from this dataset of 10. I want to keep the 'weighting' or proportion of the physician type where the Anesthesiologist accounts for a majority of the records, in this example 7 out of 10 (70%).

Procedure Sedative route Physician type

Laceration/suture            IV Nurse Anesthesiologist

Dental procedure            IV                            Dental surgeon

Dental surgery IV                            Dental surgeion

MRI IV                            Anesthesiologist

Gastro endoscopy - U IV                            Anesthesiologist

Gastro endoscopy – L IV                            Anesthesiologist

Other IV                            Anesthesiologist

Lumbar puncture IV                            Anesthesiologist

MRI IV                            Anesthesiologist

CT scan IV                            Anesthesiologist

Resampling can't give you exactly 70% anesthesiologist, it will give you some number that is a random deviation from 70%.

You could randomly pick integers 1 to 10 until you have 146 observations.

It's not clear to me why you would do this.

data randomselect;
    do I=1 to 146;
          randominteger=floor(uniform(0)*10);
          output;
    end;
run;

--
Paige Miller

Dmcav6 · Posted 09-06-2017 01:43 PM

The reason why I am doing this is because the physicians are entering their data for the sedations that they perform. I do not know if they are entering selected records (ie. positive outcomes. Are they entering only sedation records that the subject does not experience an adverse event, theres more to the dataset that I wasn't explaining).

But, I do know the total number of sedation each physician performs each year.

Ultimately, I want to have the total number of records be equal to the number of sedations performed in the year.

So in my 10 sample dataset, I know the procedure record that the physician enters. I also know that I want to have a total of 146 records because that is the sum of the total number of sedations those physicians perform. I want to resample those 10 records to obtain 146 records. My denominator will be 146 records, and I want to summarize/analyze this.

I hope this helps you follow.
Thank you.

PaigeMiller · Posted 09-06-2017 01:49 PM

@Dmcav6 wrote:

I do not know if they are entering selected records (ie. positive outcomes. Are they entering only sedation records that the subject does not experience an adverse event, theres more to the dataset that I wasn't explaining).

If any of this is true, then random sampling isn't going to give you good data quality that represents the data that actually should have been recorded.

--
Paige Miller

Reeza · Posted 09-06-2017 02:25 PM

I'm not going to go into the validity of this approach - I suspect there's many, but I'm also certain that PROC SURVEYSELECT can do this. You can specify a dataset within the samplesize that shows the number required for each physician.

Your better bet is to access the administrative data base that should have billing codes and adverse advent billing codes. It's much harder but it's a likely better data source.

Simulate (or re-create) a dataset