Here is the report I wrote based on the various approaches discussed in this thread. The intended audience is the other programmers I work with.
Benchmarking Comparison of SAS Procedures
Various SAS procedures were tested to determine the best method to match the diagnosis and procedure codes in patient data files to lists of codes to be flagged for various medical conditions of interest.
There were 3 main approaches: Proc SQL merge, data step arrays, and data step hash. There were also some minor variants of these approaches that were tested as well. The input patient data file was about 78.5GB. The output file varies by approach.
Proc SQL merge (see Appendix A for syntax)
The lists of codes to be flagged were imported from Excel tab-by-tab, and a flag variable was added corresponding to each code. The imported lists were then combined into a single SAS file. This file was matched on all 27 proc variables and 26 DX variables in the data file using JOIN. Variants were tested with various Where statements (for example, requiring that the proc/DX variables in the data file must not be blank, in an effort to avoid the time needed to match blanks to the flag file). The Where statements had little to no effect. The resulting output file was approximately 133GB.
IMPORTANT NOTE: During testing, some slight discrepancies in flag frequencies (< 0.5%) were discovered between the Proc SQL and Array versions. In each occurrence, the Proc SQL version failed to flag a small number of cases that were flagged in the array version. Manual review confirmed these cases should have been flagged. These discrepancies were apparently due to SAS running out of memory during the merge, even though less than half of the computer’s RAM was in use. SAS did not provide any error messages or even warnings about this in the log. By default, SAS only uses 2GB of RAM. Increasing the RAM corrected the frequency issue, but did not have a notable impact on processing speed. See Appendix D for discussion of this issue and how to correct it.
Here are typical results:
real time 28:08:32.61
cpu time 7:24:39.17
Data step Array (see Appendix B for syntax)
The lists of codes to be flagged are imported into list variables. Each tab has 2 list variables, one for proc and one for DX codes. An array is then used to compare all proc variables in the data file to the proc list variables, and a similar array is used for DX variables. The resulting output file is approximately 159GB.
real time 13:07:03.42
cpu time 12:53:12.39
A variant was tested that included a command to delete records for which all flags = 0. This variant saved over 2 hours of processing time. The resulting output file was approximately 142GB.
real time 10:47:45.35
cpu time 10:43:20.82
Data step Hash (see Appendix C for syntax)
The lists of codes to be flagged are imported from Excel tab-by-tab, and a flag variable is added corresponding to each code. The imported lists are then combined into a single SAS file. A hash is created for proc codes and another is created for DX codes. An array then uses the proc hash to compare all proc variables in the data file to the proc list variables, and a similar array is used for DX variables. The resulting output file was approximately 320GB.
real time 4:06:06.53
cpu time 39:04.64
NOTE: There were some discrepancies in the frequency output (hash vs array), similar to (though less severe than) the discrepancies in the original Proc SQL output. Again, this seems to be related to SAS running out of memory. However, in this case the memory issues probably occurred during the step to find the max of each flag, rather than during the merge itself.
DISCUSSION
The clear winner in terms of real processing time was the hash approach. However, there are some drawbacks to this approach as well. The output file size was more than twice as large as the other two approaches. This not only requires more hard drive space, but it also results in longer processing times for subsequent steps. For example, the next step in the testing SAS program was a Proc SQL designed to find the max of each flag variable grouped by patient ID. This step took much longer than it did with the other approaches – so much so that it not only negated the increased processing speed for the flag merge, it actually took longer overall than the Proc SQL approach! On the other hand, this may not be an issue in other programs with less-complicated downstream steps.
While the array approach was nearly three times faster than the Proc SQL approach in real time, consideration must be given to CPU time as well. In this metric, Proc SQL was more than 30% faster. The Proc SQL approach requires extensive reading and writing of data, and the very long overall processing time is due to the relatively slow I/O capabilities of our virtual machines. At my previous position, I used a computer where the I/O speeds were in the range of 8-10 times higher. On a machine with similar transfer speeds, the Proc SQL version would have been faster by more than three hours.
In summary, there is no “one size fits all” answer. Each approach has advantages and disadvantages. Familiarity with the data, an understanding of the capabilities of the computer being used, and knowledge of SAS procedures and their inner workings are the keys to choosing the best approach for large processing tasks.
APPENDIX A – Syntax for Proc SQL approach
/*Section 3.2 -- Merge proc codes and DX codes from hyst_prepost_ICD_CPT with flags from
HystProlapseCode_comb_mx, among index cases.*/
PROC SQL;
Create table temp.hyst_prepost_ICD_CPT_flags_CPU
as Select distinct a.member_id, a.episode_id, a.claim_id, a.dt_svc_from,
a.dt_svc_end, a.svc_from_dt, a.svc_end_dt, a.svc_from_dt_ICD, a.icd_dx1_prof,
a.comp_icd_dx1, a.comp_icd_dx2, a.comp_icd_dx3, a.comp_icd_dx4, a.comp_icd_dx5,
a.comp_icd_dx6, a.comp_icd_dx7, a.comp_icd_dx8, a.comp_icd_dx9, a.comp_icd_dx10,
a.comp_icd_dx11, a.comp_icd_dx12, a.comp_icd_dx13, a.comp_icd_dx14, a.comp_icd_dx15, a.comp_icd_dx16, a.comp_icd_dx17, a.comp_icd_dx18, a.comp_icd_dx19, a.comp_icd_dx20,
a.comp_icd_dx21, a.comp_icd_dx22, a.comp_icd_dx23, a.comp_icd_dx24, a.comp_icd_dx25, a.cpt, a.cpt_prof, a.icd_pr1, a.icd_pr2, a.icd_pr3, a.icd_pr4, a.icd_pr5, a.icd_pr6, a.icd_pr7, a.icd_pr8, a.icd_pr9, a.icd_pr10, a.icd_pr11, a.icd_pr12, a.icd_pr13, a.icd_pr14, a.icd_pr15, a.icd_pr16, a.icd_pr17, a.icd_pr18, a.icd_pr19, a.icd_pr20, a.icd_pr21, a.icd_pr22, a.icd_pr23, a.icd_pr24, a.icd_pr25, b.HEDIS_proc_code, b.HEDIS_DX_code, b.Code_Description,
b.mFLAG_cpt_hyst_ab,
b.mFLAG_cpt_hyst_all,
b.mFLAG_Dxs_Prolapse,
[FLAG VARIABLE LIST TRUNCATED]
/*These variables are needed later*/
a.year, a.flg_age_0_17, a.member_birth_dt, a.val_age, a.zip_cd,
a.flg_cond_hysterectomy, a.hyst_ep_dt_beg, a.healthcare_vis_dt, a.healthcare_vis_days,
a.drg, a.clm_type, a.e_clm_type, a.payer, a.provider_npi, a.provider_splty,
a.provider_type_cd, a.facility_npi
From temp.hyst_prepost_ICD_CPT a JOIN temp.HystProlapseCode_comb_mx b
On a.cpt = b.HEDIS_proc_code OR a.cpt_prof = b.HEDIS_proc_code OR
a.icd_pr1 = b.HEDIS_proc_code OR
a.icd_pr2 = b.HEDIS_proc_code OR
a.icd_pr3 = b.HEDIS_proc_code OR
a.icd_pr4 = b.HEDIS_proc_code OR
a.icd_pr5 = b.HEDIS_proc_code OR
a.icd_pr6 = b.HEDIS_proc_code OR
a.icd_pr7 = b.HEDIS_proc_code OR
a.icd_pr8 = b.HEDIS_proc_code OR
a.icd_pr9 = b.HEDIS_proc_code OR
a.icd_pr10 = b.HEDIS_proc_code OR
a.icd_pr11 = b.HEDIS_proc_code OR
a.icd_pr12 = b.HEDIS_proc_code OR
a.icd_pr13 = b.HEDIS_proc_code OR
a.icd_pr14 = b.HEDIS_proc_code OR
a.icd_pr15 = b.HEDIS_proc_code OR
a.icd_pr16 = b.HEDIS_proc_code OR
a.icd_pr17 = b.HEDIS_proc_code OR
a.icd_pr18 = b.HEDIS_proc_code OR
a.icd_pr19 = b.HEDIS_proc_code OR
a.icd_pr20 = b.HEDIS_proc_code OR
a.icd_pr21 = b.HEDIS_proc_code OR
a.icd_pr22 = b.HEDIS_proc_code OR
a.icd_pr23 = b.HEDIS_proc_code OR
a.icd_pr24 = b.HEDIS_proc_code OR
a.icd_pr25 = b.HEDIS_proc_code OR
a.icd_dx1_prof = b.HEDIS_DX_code OR
a.comp_icd_DX1 = b.HEDIS_DX_code OR
a.comp_icd_DX2 = b.HEDIS_DX_code OR
a.comp_icd_DX3 = b.HEDIS_DX_code OR
a.comp_icd_DX4 = b.HEDIS_DX_code OR
a.comp_icd_DX5 = b.HEDIS_DX_code OR
a.comp_icd_DX6 = b.HEDIS_DX_code OR
a.comp_icd_DX7 = b.HEDIS_DX_code OR
a.comp_icd_DX8 = b.HEDIS_DX_code OR
a.comp_icd_DX9 = b.HEDIS_DX_code OR
a.comp_icd_DX10 = b.HEDIS_DX_code OR
a.comp_icd_DX11 = b.HEDIS_DX_code OR
a.comp_icd_DX12 = b.HEDIS_DX_code OR
a.comp_icd_DX13 = b.HEDIS_DX_code OR
a.comp_icd_DX14 = b.HEDIS_DX_code OR
a.comp_icd_DX15 = b.HEDIS_DX_code OR
a.comp_icd_DX16 = b.HEDIS_DX_code OR
a.comp_icd_DX17 = b.HEDIS_DX_code OR
a.comp_icd_DX18 = b.HEDIS_DX_code OR
a.comp_icd_DX19 = b.HEDIS_DX_code OR
a.comp_icd_DX20 = b.HEDIS_DX_code OR
a.comp_icd_DX21 = b.HEDIS_DX_code OR
a.comp_icd_DX22 = b.HEDIS_DX_code OR
a.comp_icd_DX23 = b.HEDIS_DX_code OR
a.comp_icd_DX24 = b.HEDIS_DX_code OR
a.comp_icd_DX25 = b.HEDIS_DX_code
;
QUIT;
APPENDIX B – Syntax for Array approach
/*Section 3.2 -- Create the flag file using array method*/
DATA temp.hyst_prepost_ICD_CPT_flags_TEST; SET temp.hyst_prepost_ICD_CPT;
/*Set up an array to recode missing values to 0*/
Array _arr(*) FLAG_cpt_hyst_ab FLAG_cpt_hyst_all
[FLAG VARIABLE LIST TRUNCATED]
FLAG_Dxs_Prolapse;
Do i=1 to dim(_arr);
If _arr(i)=. then _arr(i)=0;
End;
Drop i;
Array dx_codes $ ICD_DX1 ICD_DX1_PROF COMP_ICD_DX1-COMP_ICD_DX25;
Array proc_codes $ CPT CPT_PROF ICD_PR1-ICD_PR25;
do index=1 to dim(dx_codes );
if dx_codes[index] in: (&cpt_hyst_ab_DX.) THEN FLAG_cpt_hyst_ab=1;
if dx_codes[index] in: (&cpt_hyst_all_DX.) THEN FLAG_cpt_hyst_all=1;
[FLAG VARIABLE LIST TRUNCATED]
if dx_codes[index] in: (&Dxs_Prolapse_DX.) THEN FLAG_Dxs_Prolapse=1;
END;
do index=1 to dim(proc_codes );
if proc_codes[index] in: (&cpt_hyst_ab_pr.) THEN FLAG_cpt_hyst_ab=1;
if proc_codes[index] in: (&cpt_hyst_all_pr.) THEN FLAG_cpt_hyst_all=1;
[FLAG VARIABLE LIST TRUNCATED]
if proc_codes[index] in: (&Dxs_Prolapse_pr.) THEN FLAG_Dxs_Prolapse=1;
END;
/*Delete records that don't have any flags*/
IF (FLAG_cpt_hyst_ab ^=1 AND
FLAG_cpt_hyst_all ^=1 AND
FLAG_cpt_hyst_lap_all ^=1 AND
[FLAG VARIABLE LIST TRUNCATED]
FLAG_Dxs_Prolapse ^=1) THEN delete;
APPENDIX C – Syntax for Hash approach
/*Section 3.2 -- Create the flag file using hash method*/
data temp.hyst_prepost_ICD_CPT_flags_HASH(drop=_:);
LENGTH HEDIS_proc_code HEDIS_DX_code $ 8; /*Testing to see if this fixes error, seems
to work*/
if _n_=1 then
do;
if 0 then set temp.HystProlapseCode_comb_mx;
dcl hash h_proc (dataset:"temp.HystProlapseCode_comb_mx");
h_proc.defineKey('HEDIS_proc_code');
h_proc.defineData(all:'y');
h_proc.defineDone();
dcl hash h_dx(dataset:"temp.HystProlapseCode_comb_mx");
h_dx.defineKey('HEDIS_DX_code');
h_dx.defineData(all:'y');
h_dx.defineDone();
end;
call missing(of _all_);
set temp.hyst_prepost_ICD_CPT;
Array dx_codes $ ICD_DX1 ICD_DX1_PROF COMP_ICD_DX1-COMP_ICD_DX25;
Array proc_codes $ CPT CPT_PROF ICD_PR1-ICD_PR25;
array _a_proc $ CPT CPT_PROF ICD_PR1-ICD_PR25;
do _i=1 to dim(_a_proc);
if h_proc.find(key:_a_proc[_i])=0 then
do;
output;
/*return;--The return statement writes a row to output as soon as it finds a matching
flag and then stops further checks. But since there can be multiple matches per
record, it should be commented out*/
end;
end;
array _a_dx $ ICD_DX1 ICD_DX1_PROF COMP_ICD_DX1-COMP_ICD_DX25;
do _i=1 to dim(_a_dx);
if h_dx.find(key:_a_dx[_i])=0 then
do;
output;
/*return;*/
end;
end;
/*Set up an array to recode missing values to 0*/
Array _arr(*) mFLAG_cpt_hyst_ab mFLAG_cpt_hyst_all mFLAG_cpt_hyst_lap_all
[FLAG VARIABLE LIST TRUNCATED]
mFLAG_Dxs_Prolapse;
Do i=1 to dim(_arr);
If _arr(i)=. then _arr(i)=0;
End;
Drop i;
RUN;
APPENDIX D – Setting memory usage for SAS
By default, SAS only uses 2GB of RAM. Most computers produced in the last five years have at least 8GB of RAM, and most high-end machines have 32-64GB. Our virtual machines have 16GB. Taking advantage of this additional RAM may improve SAS’ performance.
In most situations, increasing RAM usage is accomplished by editing the sasv9.cfg file. However, we do not editing permission for this file on the VM. As work-around, use the following steps:
Click the magnifying glass icon in the bottom left of the screen (next to the Start menu icon). Type “run” in the search box, and then right-click on the “Run” app. Select Open file location. In the resulting folder, right-click on the Run icon and copy it to your desktop.
Double-click on the Run icon you just created on your desktop. Type “sas.exe -memsize 16g” in the dialog box and click OK. SAS will open, and will have access to additional RAM. Use this desktop shortcut each time you open SAS.
However, some RAM is reserved for Windows and other programs, so SAS will not be able to access all 16GB. Use the following syntax in SAS to see how much RAM is available:
/*This shows available RAM*/
data _null_;
mem = input(getoption('xmrlmem'),20.2)/10e6;
format mem 20.2;
put "You have " mem "GB memory available";
run;
... View more