@Wolverine wrote:
I suspect that SAS ran out of RAM during the big SQL merge and dropped some matches. Of course it would have been nice if there had been an error or warning msg about that...
I forgot to mention that while SAS did not give me an error msg, Firefox crashed and there was a Windows dialog saying that it closed Firefox due to lack of RAM. On the other hand, Task Manager never showed RAM usage going above 50%, at least not while I was watching it.
proc setinit;run;
proc options option=cpucount;
proc options option=memsize;
run;
Out of curiousity what is the output from the following code above - output will be in the log I believe.
@Reeza wrote:
proc setinit;run; proc options option=cpucount; proc options option=memsize; run;
Out of curiousity what is the output from the following code above - output will be in the log I believe.
Here is the log:
NOTE: SAS (r) Proprietary Software 9.4 (TS1M6)
Licensed to [redacted] - T&R - SFA, Site 70080722.
NOTE: This session is executing on the X64_SRV16 platform.
NOTE: Analytical products:
SAS/STAT 15.1
SAS/ETS 15.1
SAS/OR 15.1
SAS/IML 15.1
SAS/QC 15.1
NOTE: Additional host information:
X64_SRV16 WIN 10.0.14393 Server
NOTE: SAS initialization used:
real time 5.53 seconds
cpu time 3.54 seconds
1 proc setinit;run;
NOTE: PROCEDURE SETINIT used (Total process time):
real time 0.06 seconds
cpu time 0.07 seconds
Original site validation data
Current version: 9.04.01M6P111518
Site name: '[redacted] - T&R - SFA'.
Site number: 70080722.
CPU A: Model name='' model number='' serial=''.
Expiration: 30JUN2023.
Grace Period: 45 days (ending 14AUG2023).
Warning Period: 45 days (ending 28SEP2023).
System birthday: 06JUN2022.
Operating System: WX64_SV .
Product expiration dates:
---Base SAS Software 30JUN2023 (CPU A)
---SAS/STAT 30JUN2023 (CPU A)
---SAS/GRAPH 30JUN2023 (CPU A)
---SAS/ETS 30JUN2023 (CPU A)
---SAS/FSP 30JUN2023 (CPU A)
---SAS/OR 30JUN2023 (CPU A)
---SAS/AF 30JUN2023 (CPU A)
---SAS/IML 30JUN2023 (CPU A)
---SAS/QC 30JUN2023 (CPU A)
---SAS/SHARE 30JUN2023 (CPU A)
---SAS/ASSIST 30JUN2023 (CPU A)
---SAS/CONNECT 30JUN2023 (CPU A)
---SAS/EIS 30JUN2023 (CPU A)
---SAS/SHARE*NET 30JUN2023 (CPU A)
---MDDB Server common products 30JUN2023 (CPU A)
---SAS Integration Technologies 30JUN2023 (CPU A)
---SAS/Secure 168-bit 30JUN2023 (CPU A)
---SAS/Secure Windows 30JUN2023 (CPU A)
---SAS Enterprise Guide 30JUN2023 (CPU A)
---SAS Bridge for ESRI 30JUN2023 (CPU A)
---OR OPT 30JUN2023 (CPU A)
---OR PRS 30JUN2023 (CPU A)
---OR IVS 30JUN2023 (CPU A)
---OR LSO 30JUN2023 (CPU A)
---SAS/ACCESS Interface to DB2 30JUN2023 (CPU A)
---SAS/ACCESS Interface to Oracle 30JUN2023 (CPU A)
---SAS/ACCESS Interface to SAP ASE 30JUN2023 (CPU A)
---SAS/ACCESS Interface to PC Files 30JUN2023 (CPU A)
---SAS/ACCESS Interface to ODBC 30JUN2023 (CPU A)
---SAS/ACCESS Interface to OLE DB 30JUN2023 (CPU A)
---SAS/ACCESS Interface to R/3 30JUN2023 (CPU A)
---SAS/ACCESS Interface to Teradata 30JUN2023 (CPU A)
---SAS/ACCESS Interface to Microsoft SQL Server 30JUN2023 (CPU A)
---SAS/ACCESS Interface to MySQL 30JUN2023 (CPU A)
---SAS/IML Studio 30JUN2023 (CPU A)
---SAS Workspace Server for Local Access 30JUN2023 (CPU A)
---SAS Workspace Server for Enterprise Access 30JUN2023 (CPU A)
---SAS/ACCESS Interface to Netezza 30JUN2023 (CPU A)
---SAS/ACCESS Interface to Aster nCluster 30JUN2023 (CPU A)
---SAS/ACCESS Interface to Greenplum 30JUN2023 (CPU A)
---SAS/ACCESS Interface to SAP IQ 30JUN2023 (CPU A)
---SAS/ACCESS to Hadoop 30JUN2023 (CPU A)
---SAS/ACCESS to Vertica 30JUN2023 (CPU A)
---SAS/ACCESS to Postgres 30JUN2023 (CPU A)
---SAS/ACCESS to Impala 30JUN2023 (CPU A)
---SAS/ACCESS to Salesforce 30JUN2023 (CPU A)
---SAS/ACCESS to HAWQ 30JUN2023 (CPU A)
---SAS/ACCESS to Amazon Redshift 30JUN2023 (CPU A)
---High Performance Suite 30JUN2023 (CPU A)
---SAS/ACCESS to SAP HANA 30JUN2023 (CPU A)
---SAS/ACCESS Interface to the PI System 30JUN2023 (CPU A)
---SAS/ACCESS to JDBC 30JUN2023 (CPU A)
---prod 1312 30JUN2023 (CPU A)
2 proc options option=cpucount;
SAS (r) Proprietary Software Release 9.4 TS1M6
CPUCOUNT=4 Specifies the number of processors that thread-enabled applications should assume are available for concurrent
processing.
NOTE: PROCEDURE OPTIONS used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
3 proc options option=memsize;
4 run;
SAS (r) Proprietary Software Release 9.4 TS1M6
MEMSIZE=2147483648
Specifies the limit on the amount of virtual memory that can be used during a SAS session.
NOTE: PROCEDURE OPTIONS used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
@Reeza wrote:
You only have 2GB of RAM assigned to SAS, which is possibly an issue in your speed/response.
The VM has 16GB of RAM, and SAS is really the only software I run on it. To increase the RAM available to SAS, I tried updating the sasv9.cfg file. but I can't save my changes because I don't have admin rights on the VM. I was able to go the Run dialog in SAS and enter "SAS.exe -memsize 12g". Then I ran
/*This shows available RAM*/
data _null_;
mem = input(getoption('xmrlmem'),20.2)/10e6;
format mem 20.2;
put "You have " mem "GB memory available";
run;
This shows I have ~10GB (not sure why there's a discrepancy between the 12g I entered the 10g available, but I suspect it has to do with using 1000KB per MB rather than 1024).
Is there a way I can make the change via syntax? It would be easier to have a line or 2 at the beginning of every program than it would be to open SAS via the Run dialog every time.
I'm rerunning the Proc SQL merge just to test how much faster it is with the increased RAM.
MEMSIZE is a SAS start up option so it won't work within a program. Talk to your IT folks about getting temporary admin access to make the change.
If you're launching SAS via an icon you can add those options to the icon under properties in Windows.
@Wolverine wrote:
... stuff deleted ...
Next, I will try the hash approach suggested earlier in this thread, and I will report back on my findings.
Also, if you know that certain codes are notably more likely than others, then list them at the start of their respective arrays. And put the more commonly matched array first. You could get fancier, but I won't go into it.
I'm starting to look at the hash approach, and I have a couple of questions for @mkeintz and @Patrick :
Now, in a context more familiar to me: If the temp.preport_FLAGS dataset has no duplicate HEDIS_proc_code values or duplicate HEDIS_dx_code values, then this data step could be a lot faster.
Not sure what this means. Under my original Proc SQL approach, the HEDIS variables are in the lookup table (HystProlapseCode_comb_mx) not in hyst_prepost_ICD_CPT. Under the array approach, the HEDIS variables are turned into macro variables. It is possible for a record in hyst_prepost_ICD_CPT to have the same code in multiple DX/proc variables, because that file includes both hospital and doctors' billing records.
There are no identical variable names in the two datasets. That's because this code would allow the preport_flags variable values to overwrite the same name variables found in the prepost_ICD_CPT
In which 2 datasets? hyst_prepost_ICD_CPT versus HystProlapseCode_comb_mx? I don't think they have any of the same variable names, but I will confirm that before running.
The code above writes a row to output as soon as it finds a matching flag and then stops further checks (return statement).
If for a single row there can be multiple flags that match and you want to write out the row multiple times then you would have to remove the RETURN; statements.
As I mentioned above, it is possible for records in hyst_prepost_ICD_CPT to match on multiple flags. Furthermore, there are some duplicate codes in the HEDIS lists. For example, FLAG_proc_bladder_all contains the combination of all codes included in FLAG_proc_bladder_open and FLAG_proc_bladder_perc. I know could write a simple IF-THEN to flag _all whenever either _open and _perc are flagged, but there are quite a few of these types of combinations and I'd prefer not to have to track them all down.
Here is the report I wrote based on the various approaches discussed in this thread. The intended audience is the other programmers I work with.
Benchmarking Comparison of SAS Procedures
Various SAS procedures were tested to determine the best method to match the diagnosis and procedure codes in patient data files to lists of codes to be flagged for various medical conditions of interest.
There were 3 main approaches: Proc SQL merge, data step arrays, and data step hash. There were also some minor variants of these approaches that were tested as well. The input patient data file was about 78.5GB. The output file varies by approach.
Proc SQL merge (see Appendix A for syntax)
The lists of codes to be flagged were imported from Excel tab-by-tab, and a flag variable was added corresponding to each code. The imported lists were then combined into a single SAS file. This file was matched on all 27 proc variables and 26 DX variables in the data file using JOIN. Variants were tested with various Where statements (for example, requiring that the proc/DX variables in the data file must not be blank, in an effort to avoid the time needed to match blanks to the flag file). The Where statements had little to no effect. The resulting output file was approximately 133GB.
IMPORTANT NOTE: During testing, some slight discrepancies in flag frequencies (< 0.5%) were discovered between the Proc SQL and Array versions. In each occurrence, the Proc SQL version failed to flag a small number of cases that were flagged in the array version. Manual review confirmed these cases should have been flagged. These discrepancies were apparently due to SAS running out of memory during the merge, even though less than half of the computer’s RAM was in use. SAS did not provide any error messages or even warnings about this in the log. By default, SAS only uses 2GB of RAM. Increasing the RAM corrected the frequency issue, but did not have a notable impact on processing speed. See Appendix D for discussion of this issue and how to correct it.
Here are typical results:
real time 28:08:32.61
cpu time 7:24:39.17
Data step Array (see Appendix B for syntax)
The lists of codes to be flagged are imported into list variables. Each tab has 2 list variables, one for proc and one for DX codes. An array is then used to compare all proc variables in the data file to the proc list variables, and a similar array is used for DX variables. The resulting output file is approximately 159GB.
real time 13:07:03.42
cpu time 12:53:12.39
A variant was tested that included a command to delete records for which all flags = 0. This variant saved over 2 hours of processing time. The resulting output file was approximately 142GB.
real time 10:47:45.35
cpu time 10:43:20.82
Data step Hash (see Appendix C for syntax)
The lists of codes to be flagged are imported from Excel tab-by-tab, and a flag variable is added corresponding to each code. The imported lists are then combined into a single SAS file. A hash is created for proc codes and another is created for DX codes. An array then uses the proc hash to compare all proc variables in the data file to the proc list variables, and a similar array is used for DX variables. The resulting output file was approximately 320GB.
real time 4:06:06.53
cpu time 39:04.64
NOTE: There were some discrepancies in the frequency output (hash vs array), similar to (though less severe than) the discrepancies in the original Proc SQL output. Again, this seems to be related to SAS running out of memory. However, in this case the memory issues probably occurred during the step to find the max of each flag, rather than during the merge itself.
DISCUSSION
The clear winner in terms of real processing time was the hash approach. However, there are some drawbacks to this approach as well. The output file size was more than twice as large as the other two approaches. This not only requires more hard drive space, but it also results in longer processing times for subsequent steps. For example, the next step in the testing SAS program was a Proc SQL designed to find the max of each flag variable grouped by patient ID. This step took much longer than it did with the other approaches – so much so that it not only negated the increased processing speed for the flag merge, it actually took longer overall than the Proc SQL approach! On the other hand, this may not be an issue in other programs with less-complicated downstream steps.
While the array approach was nearly three times faster than the Proc SQL approach in real time, consideration must be given to CPU time as well. In this metric, Proc SQL was more than 30% faster. The Proc SQL approach requires extensive reading and writing of data, and the very long overall processing time is due to the relatively slow I/O capabilities of our virtual machines. At my previous position, I used a computer where the I/O speeds were in the range of 8-10 times higher. On a machine with similar transfer speeds, the Proc SQL version would have been faster by more than three hours.
In summary, there is no “one size fits all” answer. Each approach has advantages and disadvantages. Familiarity with the data, an understanding of the capabilities of the computer being used, and knowledge of SAS procedures and their inner workings are the keys to choosing the best approach for large processing tasks.
APPENDIX A – Syntax for Proc SQL approach
/*Section 3.2 -- Merge proc codes and DX codes from hyst_prepost_ICD_CPT with flags from
HystProlapseCode_comb_mx, among index cases.*/
PROC SQL;
Create table temp.hyst_prepost_ICD_CPT_flags_CPU
as Select distinct a.member_id, a.episode_id, a.claim_id, a.dt_svc_from,
a.dt_svc_end, a.svc_from_dt, a.svc_end_dt, a.svc_from_dt_ICD, a.icd_dx1_prof,
a.comp_icd_dx1, a.comp_icd_dx2, a.comp_icd_dx3, a.comp_icd_dx4, a.comp_icd_dx5,
a.comp_icd_dx6, a.comp_icd_dx7, a.comp_icd_dx8, a.comp_icd_dx9, a.comp_icd_dx10,
a.comp_icd_dx11, a.comp_icd_dx12, a.comp_icd_dx13, a.comp_icd_dx14, a.comp_icd_dx15, a.comp_icd_dx16, a.comp_icd_dx17, a.comp_icd_dx18, a.comp_icd_dx19, a.comp_icd_dx20,
a.comp_icd_dx21, a.comp_icd_dx22, a.comp_icd_dx23, a.comp_icd_dx24, a.comp_icd_dx25, a.cpt, a.cpt_prof, a.icd_pr1, a.icd_pr2, a.icd_pr3, a.icd_pr4, a.icd_pr5, a.icd_pr6, a.icd_pr7, a.icd_pr8, a.icd_pr9, a.icd_pr10, a.icd_pr11, a.icd_pr12, a.icd_pr13, a.icd_pr14, a.icd_pr15, a.icd_pr16, a.icd_pr17, a.icd_pr18, a.icd_pr19, a.icd_pr20, a.icd_pr21, a.icd_pr22, a.icd_pr23, a.icd_pr24, a.icd_pr25, b.HEDIS_proc_code, b.HEDIS_DX_code, b.Code_Description,
b.mFLAG_cpt_hyst_ab,
b.mFLAG_cpt_hyst_all,
b.mFLAG_Dxs_Prolapse,
[FLAG VARIABLE LIST TRUNCATED]
/*These variables are needed later*/
a.year, a.flg_age_0_17, a.member_birth_dt, a.val_age, a.zip_cd,
a.flg_cond_hysterectomy, a.hyst_ep_dt_beg, a.healthcare_vis_dt, a.healthcare_vis_days,
a.drg, a.clm_type, a.e_clm_type, a.payer, a.provider_npi, a.provider_splty,
a.provider_type_cd, a.facility_npi
From temp.hyst_prepost_ICD_CPT a JOIN temp.HystProlapseCode_comb_mx b
On a.cpt = b.HEDIS_proc_code OR a.cpt_prof = b.HEDIS_proc_code OR
a.icd_pr1 = b.HEDIS_proc_code OR
a.icd_pr2 = b.HEDIS_proc_code OR
a.icd_pr3 = b.HEDIS_proc_code OR
a.icd_pr4 = b.HEDIS_proc_code OR
a.icd_pr5 = b.HEDIS_proc_code OR
a.icd_pr6 = b.HEDIS_proc_code OR
a.icd_pr7 = b.HEDIS_proc_code OR
a.icd_pr8 = b.HEDIS_proc_code OR
a.icd_pr9 = b.HEDIS_proc_code OR
a.icd_pr10 = b.HEDIS_proc_code OR
a.icd_pr11 = b.HEDIS_proc_code OR
a.icd_pr12 = b.HEDIS_proc_code OR
a.icd_pr13 = b.HEDIS_proc_code OR
a.icd_pr14 = b.HEDIS_proc_code OR
a.icd_pr15 = b.HEDIS_proc_code OR
a.icd_pr16 = b.HEDIS_proc_code OR
a.icd_pr17 = b.HEDIS_proc_code OR
a.icd_pr18 = b.HEDIS_proc_code OR
a.icd_pr19 = b.HEDIS_proc_code OR
a.icd_pr20 = b.HEDIS_proc_code OR
a.icd_pr21 = b.HEDIS_proc_code OR
a.icd_pr22 = b.HEDIS_proc_code OR
a.icd_pr23 = b.HEDIS_proc_code OR
a.icd_pr24 = b.HEDIS_proc_code OR
a.icd_pr25 = b.HEDIS_proc_code OR
a.icd_dx1_prof = b.HEDIS_DX_code OR
a.comp_icd_DX1 = b.HEDIS_DX_code OR
a.comp_icd_DX2 = b.HEDIS_DX_code OR
a.comp_icd_DX3 = b.HEDIS_DX_code OR
a.comp_icd_DX4 = b.HEDIS_DX_code OR
a.comp_icd_DX5 = b.HEDIS_DX_code OR
a.comp_icd_DX6 = b.HEDIS_DX_code OR
a.comp_icd_DX7 = b.HEDIS_DX_code OR
a.comp_icd_DX8 = b.HEDIS_DX_code OR
a.comp_icd_DX9 = b.HEDIS_DX_code OR
a.comp_icd_DX10 = b.HEDIS_DX_code OR
a.comp_icd_DX11 = b.HEDIS_DX_code OR
a.comp_icd_DX12 = b.HEDIS_DX_code OR
a.comp_icd_DX13 = b.HEDIS_DX_code OR
a.comp_icd_DX14 = b.HEDIS_DX_code OR
a.comp_icd_DX15 = b.HEDIS_DX_code OR
a.comp_icd_DX16 = b.HEDIS_DX_code OR
a.comp_icd_DX17 = b.HEDIS_DX_code OR
a.comp_icd_DX18 = b.HEDIS_DX_code OR
a.comp_icd_DX19 = b.HEDIS_DX_code OR
a.comp_icd_DX20 = b.HEDIS_DX_code OR
a.comp_icd_DX21 = b.HEDIS_DX_code OR
a.comp_icd_DX22 = b.HEDIS_DX_code OR
a.comp_icd_DX23 = b.HEDIS_DX_code OR
a.comp_icd_DX24 = b.HEDIS_DX_code OR
a.comp_icd_DX25 = b.HEDIS_DX_code
;
QUIT;
APPENDIX B – Syntax for Array approach
/*Section 3.2 -- Create the flag file using array method*/
DATA temp.hyst_prepost_ICD_CPT_flags_TEST; SET temp.hyst_prepost_ICD_CPT;
/*Set up an array to recode missing values to 0*/
Array _arr(*) FLAG_cpt_hyst_ab FLAG_cpt_hyst_all
[FLAG VARIABLE LIST TRUNCATED]
FLAG_Dxs_Prolapse;
Do i=1 to dim(_arr);
If _arr(i)=. then _arr(i)=0;
End;
Drop i;
Array dx_codes $ ICD_DX1 ICD_DX1_PROF COMP_ICD_DX1-COMP_ICD_DX25;
Array proc_codes $ CPT CPT_PROF ICD_PR1-ICD_PR25;
do index=1 to dim(dx_codes );
if dx_codes[index] in: (&cpt_hyst_ab_DX.) THEN FLAG_cpt_hyst_ab=1;
if dx_codes[index] in: (&cpt_hyst_all_DX.) THEN FLAG_cpt_hyst_all=1;
[FLAG VARIABLE LIST TRUNCATED]
if dx_codes[index] in: (&Dxs_Prolapse_DX.) THEN FLAG_Dxs_Prolapse=1;
END;
do index=1 to dim(proc_codes );
if proc_codes[index] in: (&cpt_hyst_ab_pr.) THEN FLAG_cpt_hyst_ab=1;
if proc_codes[index] in: (&cpt_hyst_all_pr.) THEN FLAG_cpt_hyst_all=1;
[FLAG VARIABLE LIST TRUNCATED]
if proc_codes[index] in: (&Dxs_Prolapse_pr.) THEN FLAG_Dxs_Prolapse=1;
END;
/*Delete records that don't have any flags*/
IF (FLAG_cpt_hyst_ab ^=1 AND
FLAG_cpt_hyst_all ^=1 AND
FLAG_cpt_hyst_lap_all ^=1 AND
[FLAG VARIABLE LIST TRUNCATED]
FLAG_Dxs_Prolapse ^=1) THEN delete;
APPENDIX C – Syntax for Hash approach
/*Section 3.2 -- Create the flag file using hash method*/
data temp.hyst_prepost_ICD_CPT_flags_HASH(drop=_:);
LENGTH HEDIS_proc_code HEDIS_DX_code $ 8; /*Testing to see if this fixes error, seems
to work*/
if _n_=1 then
do;
if 0 then set temp.HystProlapseCode_comb_mx;
dcl hash h_proc (dataset:"temp.HystProlapseCode_comb_mx");
h_proc.defineKey('HEDIS_proc_code');
h_proc.defineData(all:'y');
h_proc.defineDone();
dcl hash h_dx(dataset:"temp.HystProlapseCode_comb_mx");
h_dx.defineKey('HEDIS_DX_code');
h_dx.defineData(all:'y');
h_dx.defineDone();
end;
call missing(of _all_);
set temp.hyst_prepost_ICD_CPT;
Array dx_codes $ ICD_DX1 ICD_DX1_PROF COMP_ICD_DX1-COMP_ICD_DX25;
Array proc_codes $ CPT CPT_PROF ICD_PR1-ICD_PR25;
array _a_proc $ CPT CPT_PROF ICD_PR1-ICD_PR25;
do _i=1 to dim(_a_proc);
if h_proc.find(key:_a_proc[_i])=0 then
do;
output;
/*return;--The return statement writes a row to output as soon as it finds a matching
flag and then stops further checks. But since there can be multiple matches per
record, it should be commented out*/
end;
end;
array _a_dx $ ICD_DX1 ICD_DX1_PROF COMP_ICD_DX1-COMP_ICD_DX25;
do _i=1 to dim(_a_dx);
if h_dx.find(key:_a_dx[_i])=0 then
do;
output;
/*return;*/
end;
end;
/*Set up an array to recode missing values to 0*/
Array _arr(*) mFLAG_cpt_hyst_ab mFLAG_cpt_hyst_all mFLAG_cpt_hyst_lap_all
[FLAG VARIABLE LIST TRUNCATED]
mFLAG_Dxs_Prolapse;
Do i=1 to dim(_arr);
If _arr(i)=. then _arr(i)=0;
End;
Drop i;
RUN;
APPENDIX D – Setting memory usage for SAS
By default, SAS only uses 2GB of RAM. Most computers produced in the last five years have at least 8GB of RAM, and most high-end machines have 32-64GB. Our virtual machines have 16GB. Taking advantage of this additional RAM may improve SAS’ performance.
In most situations, increasing RAM usage is accomplished by editing the sasv9.cfg file. However, we do not editing permission for this file on the VM. As work-around, use the following steps:
Click the magnifying glass icon in the bottom left of the screen (next to the Start menu icon). Type “run” in the search box, and then right-click on the “Run” app. Select Open file location. In the resulting folder, right-click on the Run icon and copy it to your desktop.
Double-click on the Run icon you just created on your desktop. Type “sas.exe -memsize 16g” in the dialog box and click OK. SAS will open, and will have access to additional RAM. Use this desktop shortcut each time you open SAS.
However, some RAM is reserved for Windows and other programs, so SAS will not be able to access all 16GB. Use the following syntax in SAS to see how much RAM is available:
/*This shows available RAM*/
data _null_;
mem = input(getoption('xmrlmem'),20.2)/10e6;
format mem 20.2;
put "You have " mem "GB memory available";
run;
Thanks for the very comprehensive report. I find it very informative.
One question: why is the hash output so much larger than the others? Does it have more records (i.e. superfluous records)? If so, probably the code can be changed to eliminate them.
Hi,
I like the idea of coding up the same algorithm using multiple approaches to compare efficiency.
That said, I think you should work to get all the approaches to match in their output. You say:
IMPORTANT NOTE: During testing, some slight discrepancies in flag frequencies (< 0.5%) were discovered between the Proc SQL and Array versions. In each occurrence, the Proc SQL version failed to flag a small number of cases that were flagged in the array version. Manual review confirmed these cases should have been flagged. These discrepancies were apparently due to SAS running out of memory during the merge, even though less than half of the computer’s RAM was in use. SAS did not provide any error messages or even warnings about this in the log. By default, SAS only uses 2GB of RAM. Increasing the RAM corrected the frequency issue, but did not have a notable impact on processing speed.
This should not happen. If SAS runs out of memory, you should get an error in the log. You definitely should not get the wrong result (with no error). if you really have a case where SQL is giving you the wrong result, I would send it in to tech support. Same for your statement that hash approach had some discrepancies. I think it's likely that there are some edge cases in your data that are falling through some cracks in your code. But if you have a repeatable example of discrepancy (especially one where the results of the code vary with the amount of memory available to the SAS session), please send it in to tech support.
Also confused by your statement that the output dataset from the hash approach was double the size of other approaches. If the output datasets from each approach are identical (e.g. judged via PROC COMPARE to compare the metadata and data), this shouldn't happen. Unless maybe you changed compression options.
@Quentin wrote:
I like the idea of coding up the same algorithm using multiple approaches to compare efficiency.
That said, I think you should work to get all the approaches to match in their output.
This should not happen. If SAS runs out of memory, you should get an error in the log. You definitely should not get the wrong result (with no error). if you really have a case where SQL is giving you the wrong result, I would send it in to tech support. Same for your statement that hash approach had some discrepancies. I think it's likely that there are some edge cases in your data that are falling through some cracks in your code. But if you have a repeatable example of discrepancy (especially one where the results of the code vary with the amount of memory available to the SAS session), please send it in to tech support.
Also confused by your statement that the output dataset from the hash approach was double the size of other approaches. If the output datasets from each approach are identical (e.g. judged via PROC COMPARE to compare the metadata and data), this shouldn't happen. Unless maybe you changed compression options.
Before this project, I had virtually no experience w/ arrays or hash objects. I don't know why the flag counts were slightly different. The only clue I had that RAM could be an issue was that Firefox crashed and presented a dialog box indicating that the crash was due to insufficient memory. When I manually reviewed the discrepant cases, the codes that matched for those cases had been successfully matched on many other cases. And after increasing the memory, the discrepancies disappeared.
It wasn't just the output filesize that was different among the various approaches, it was also the number of records they contained. So data compression wouldn't explain the differences. Perhaps Proc SQL does not continue to search for matches on a given flag after it has already matched that flag, whereas the hash version DOES continue to search? In that case, there could be duplicate records for a given case that match on the same flag, and that could explain the differences in filesize. I could run Proc SQL w/ select distinct on the hash output file to see if it finds and eliminates any duplicate records.
I will review all of this in an attempt to make sure I haven't made any errors. If I can't find any, I'll submit it to tech support. However, I'm getting busy with other projects, so I don't have as much time to dedicate to this right now. So it may take a while😐
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.