I have a dataset with about 5688 records of subjects that have needed sedations for a procedure. The records have information such as the procedure type, the type of physician/professional administering the sedation, etc...So some doctors have more records because they do more procedures.
I want to simulate (or re-create) this dataset, so that there are 146164 records or as close to this number of records as possible. I want to maintain the 'weighting' of records entered by the sedation doctor.
Here's an example data. Let me describe what I desire by using this small sample. So say I had these 10 records, I want to have an output dataset of 146 records, resampled from this dataset of 10. I want to keep the 'weighting' or proportion of the physician type where the Anesthesiologist accounts for a majority of the records, in this example 7 out of 10 (70%).
Procedure Sedative route Physician type
Laceration/suture IV Nurse Anesthesiologist
Dental procedure IV Dental surgeon
Dental surgery IV Dental surgeion
MRI IV Anesthesiologist
Gastro endoscopy - U IV Anesthesiologist
Gastro endoscopy – L IV Anesthesiologist
Other IV Anesthesiologist
Lumbar puncture IV Anesthesiologist
MRI IV Anesthesiologist
CT scan IV Anesthesiologist
Dmcav6 wrote:
Here's an example data. Let me describe what I desire by using this small sample. So say I had these 10 records, I want to have an output dataset of 146 records, resampled from this dataset of 10. I want to keep the 'weighting' or proportion of the physician type where the Anesthesiologist accounts for a majority of the records, in this example 7 out of 10 (70%).
Procedure Sedative route Physician type
Laceration/suture IV Nurse Anesthesiologist
Dental procedure IV Dental surgeon
Dental surgery IV Dental surgeion
MRI IV Anesthesiologist
Gastro endoscopy - U IV Anesthesiologist
Gastro endoscopy – L IV Anesthesiologist
Other IV Anesthesiologist
Lumbar puncture IV Anesthesiologist
MRI IV Anesthesiologist
CT scan IV Anesthesiologist
Resampling can't give you exactly 70% anesthesiologist, it will give you some number that is a random deviation from 70%.
You could randomly pick integers 1 to 10 until you have 146 observations.
It's not clear to me why you would do this.
data randomselect; do I=1 to 146; randominteger=floor(uniform(0)*10); output; end; run;
The reason why I am doing this is because the physicians are entering their data for the sedations that they perform. I do not know if they are entering selected records (ie. positive outcomes. Are they entering only sedation records that the subject does not experience an adverse event, theres more to the dataset that I wasn't explaining).
But, I do know the total number of sedation each physician performs each year.
Ultimately, I want to have the total number of records be equal to the number of sedations performed in the year.
So in my 10 sample dataset, I know the procedure record that the physician enters. I also know that I want to have a total of 146 records because that is the sum of the total number of sedations those physicians perform. I want to resample those 10 records to obtain 146 records. My denominator will be 146 records, and I want to summarize/analyze this.
I hope this helps you follow.
Thank you.
@Dmcav6 wrote:
I do not know if they are entering selected records (ie. positive outcomes. Are they entering only sedation records that the subject does not experience an adverse event, theres more to the dataset that I wasn't explaining).
If any of this is true, then random sampling isn't going to give you good data quality that represents the data that actually should have been recorded.
I'm not going to go into the validity of this approach - I suspect there's many, but I'm also certain that PROC SURVEYSELECT can do this. You can specify a dataset within the samplesize that shows the number required for each physician.
Your better bet is to access the administrative data base that should have billing codes and adverse advent billing codes. It's much harder but it's a likely better data source.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.