SAS generating a large file, but barely using any computer resources - Page 2

mkeintz · Posted 09-11-2022 11:21 AM

The code you present includes this:

  array _a_proc {*} icd_pr1-icd_pr25 icd_dx1_prof;
  do _i=1 to dim(_a_proc);
    if h_proc.find(key:_a_proc[_i])=0 then 
      do;
        output;
        return;
      end;
  end;

  array _a_dx {*} comp_icd_DX1-comp_icd_DX25;
  do _i=1 to dim(_a_dx);
    if h_dx.find(key:_a_dx[_i])=0 then 
      do;
        output;
        return;
      end;
  end;

which is presented as an alternative to my code:

  array icd_p {*} cpt cpt_prof icd_pr1-icd_pr25 icd_dx1_prof ;
  /*Check the 28 vars above until first HEDIS_proc_code match (i.e. rc=0)*/
  do i=1 to dim(icd_p) until (rc=0);
    rc=hash_pc.find(key:icd_p{i});
  end;

  array icd_d {*} comp_icd_DX1-comp_icd_DX25 ;
  /*If still unmatched, check the 25 vars above until first HEDIS_dx_code match */
  if rc^=0 then do i=1 to dim(icd_d) until (rc=0);
    rc=hash_dx.find(key:icd_d{i});
  end;

  if rc=0 then output; /*Only output successful matches */

These are logically equivalent in that each stops looping once a match is found. They both do the same number of loop iterations. Your code may make the loop-stopping more self-evident, but I prefer the economy of the "until ..." expression to produce the same effect.

And the "call missing(of _all_)" statement is not harmful (and is good use in some circumstances). But it is superfluous in this case, since only successful hash "find's" are output, meaning there will be no inadvertent output of previously-"found" values from the prepost_flags dataset.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Wolverine · Posted 09-12-2022 11:20 AM

I appreciate all the suggestions, and I should have some time this week to do some comparison testing among these approaches. I'm going to be doing plenty of similar work in the future, so I really want to find the best solution.

But what I'm really trying to understand in this post is what's going on "under the hood". For example, if my original approach is very IO intensive, I would expect to see sustained periods of near-max transfer speeds. Or if it requires a lot of processing, I would expect high CPU usage. But there were 8+ hour stretches where CPU <10%, RAM usage was <40%, and transfer speeds were slower than an old dial-up modem! What is SAS doing when it doesn't appear to be doing much at all?

SASKiwi · Posted 09-12-2022 04:01 PM

Without knowing more about your VM setup, its hard to know what's going on. IO is definitely the problem, not CPU or memory. Do you have Task Manager available on your VM? I would expect to see a lot of activity in the SAS WORK utility files but these days Windows Explorer won't necessarily show real time updates.

Wolverine · Posted 09-16-2022 10:03 AM

@SASKiwi wrote:

Without knowing more about your VM setup, its hard to know what's going on. IO is definitely the problem, not CPU or memory. Do you have Task Manager available on your VM? I would expect to see a lot of activity in the SAS WORK utility files but these days Windows Explorer won't necessarily show real time updates.

I have access to Task Manager (which is where I pulled the performance numbers from), but not Resource Manager (which provides more in-depth performance numbers). I do see SAS using a lot of space in the D:\Data drive, and I assume that is where the WORK utility files are stored. Still, with the amount of space being used and the amount of time it takes to do it, it works out to about 4.7MB/s transfer speed. That's less than half of my home internet speed. That still seems pretty slow for a "local" drive. I should also note that when the match is first submitted, SAS will transfer data at upwards of 50MB/s to the temp directory for nearly an hour.

Anyway, I've been benchmarking various approaches. I've tried a few variants of the PROC SQL merge that involve altering the Where clause (including removing it altogether). It doesn't seem to make much difference, as every version has taken 27-30hrs to run.

I'm now attempting to use the approach that I mentioned earlier from @Reeza. I've added a macro to it to import the code lists, and added a %global statement to initialize the code list variables. However, it seems to initialize them as numeric variables when they should be character, and that causes an error when I try to match the data to the code lists in the arrays. Here is the syntax with a couple of the lists to be created:

%macro import(sheet);

/*global statement initializes one or more macro variables.*/
%global &sheet._DX &sheet._pr;

PROC SQL noprint;
	Select quote(HEDIS_DX_code) into :&sheet._DX separated by ", "
	From temp.cpt_hyst_&sheet.; 
QUIT;

PROC SQL noprint;
	Select quote(HEDIS_proc_code) into :&sheet._pr separated by ", "
	From temp.cpt_hyst_&sheet.; 
QUIT;

%mend;
%import (cpt_hyst_all)
%import (cpt_hyst_lap_all)

DATA temp.hyst_prepost_ICD_CPT_flags_TEST; SET temp.hyst_prepost_ICD_CPT;

/*Set up an array to recode missing values to 0*/
Array _arr(*) FLAG_cpt_hyst_ab FLAG_cpt_hyst_all
Do i=1 to dim(_arr);
If _arr(i)=. then _arr(i)=0;
End;
Drop i;

Array dx_codes PRINCIPAL_DIAGNOSIS_CODE DIAGNOSIS_CODE_1-DIAGNOSIS_CODE_36;
Array proc_codes PROCEDURE_CODE ICD_PROCEDURE_CODE_1-ICD_PROCEDURE_CODE_25;

do index=1 to dim(dx_codes );
if dx_codes[index] in: (&cpt_hyst_ab_DX.) THEN FLAG_cpt_hyst_ab=1;
if dx_codes[index] in: (&cpt_hyst_all_DX.) THEN FLAG_cpt_hyst_all=1;

END;

do index=1 to dim(proc_codes );
if proc_codes[index] in: (&cpt_hyst_ab_pr.) THEN FLAG_cpt_hyst_ab=1;
if proc_codes[index] in: (&cpt_hyst_all_pr.) THEN FLAG_cpt_hyst_all=1;
END;

RUN;

And a portion of the error msg:

NOTE: Invalid numeric data, 'N9961' , at line 380 column 6.
NOTE: Invalid numeric data, 'S372' , at line 381 column 6.
NOTE: Invalid numeric data, 'S350' , at line 382 column 6.
NOTE: Invalid numeric data, 'N99510' , at line 383 column 6.
NOTE: Invalid numeric data, 'T811' , at line 384 column 6.

Reeza · Posted 09-16-2022 11:13 AM

You need to show the code from the log to tie the errors back to the code, what is line 380-384?

Wolverine · Posted 09-16-2022 11:21 AM

Spoiler

@Reeza wrote:
You need to show the code from the log to tie the errors back to the code, what is line 380-384?

wrote:You need to show the code from the log to tie the errors back to the code, what is line 380-384?

The errors begin at the first IF statement in the dx_codes array. I only showed 2 of them here, but the full array includes over 100 DX codes lists. There is a similar error for every line of the dx_codes array until SAS begins suppressing them after reaching the 100th error.

Reeza · Posted 09-16-2022 11:29 AM

Showing the log without any details would make this a much faster/easier process.

You can try adding $ after the array declaration to specify character arrays.

Array dx_codes $ PRINCIPAL_DIAGNOSIS_CODE DIAGNOSIS_CODE_1-DIAGNOSIS_CODE_36;
Array proc_codes $ PROCEDURE_CODE ICD_PROCEDURE_CODE_1-ICD_PROCEDURE_CODE_25;

@Wolverine wrote:

Spoiler

@Reeza wrote:
You need to show the code from the log to tie the errors back to the code, what is line 380-384?

The errors begin at the first IF statement in the dx_codes array. I only showed 2 of them here, but the full array includes over 100 DX codes lists. There is a similar error for every line of the dx_codes array until SAS begins suppressing them after reaching the 100th error.

Quentin · Posted 09-16-2022 11:46 AM

Your array statements are wrong. You're accidentally using an old (deprecated) implicit array because you don't specify the dimension. That requires the first variable listed to be a numeric field.

Try:

Array dx_codes {*} $ PRINCIPAL_DIAGNOSIS_CODE DIAGNOSIS_CODE_1-DIAGNOSIS_CODE_36;
Array proc_codes {*} $ PROCEDURE_CODE ICD_PROCEDURE_CODE_1-ICD_PROCEDURE_CODE_25;

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

Tom · Posted 09-16-2022 06:31 PM

@Quentin wrote:

Your array statements are wrong. You're accidentally using an old (deprecated) implicit array because you don't specify the dimension. That requires the first variable listed to be a numeric field.

Try:
Array dx_codes {*} $ PRINCIPAL_DIAGNOSIS_CODE DIAGNOSIS_CODE_1-DIAGNOSIS_CODE_36;
Array proc_codes {*} $ PROCEDURE_CODE ICD_PROCEDURE_CODE_1-ICD_PROCEDURE_CODE_25;

It does not matter whether or not you include that extra {*}, or [*] or (*) , when defining the array. Either way you get the exact same array with the same number of elements. And you can reference the elements by explicitly including an index value in the array reference: dx_codes[i] for example.

There is only one implication of defining the array in a way that would allow you to reference the elements implicitly and that is that the index variable, which will be _i_ since none was specified in the ARRAY statement, will be dropped from the data vector. So if you also wanted to make a variable named _i_ in the output dataset you would have to call it something else during the data step and rename it using the RENAME statement or RENAME= dataset option.

Quentin · Posted 09-16-2022 07:57 PM

Thanks for the correction, @Tom . Not having messed with implicit arrays, I misread an old paper and thought that the first variable listed after the array name was was an index. If you want to specify an index for an implicit array, it is placed after the array name, but in parentheses.

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

FreelanceReinh · Posted 09-16-2022 04:50 PM

Assuming the code was submitted as posted (including the removed IF statements that have been mentioned):

The "Invalid numeric data" notes indeed suggest that array dx_codes was defined as a numeric array. But the variables constituting the array (PRINCIPAL_DIAGNOSIS_CODE, etc.) must come from the input dataset temp.hyst_prepost_ICD_CPT (otherwise they wouldn't have non-missing values).

This implies:

These variables must all be numeric because a mixture of character and numeric variables would cause an error message from the array definition.
Declaring the array as character won't help (although interestingly SAS doesn't complain) because the variable types cannot be changed after the SET statement.
Non-numeric codes like 'N9961' wouldn't need to be checked if the numeric variable type was correct. But, of course, the type is not correct.

Conclusion: The correction should rather start where dataset temp.hyst_prepost_ICD_CPT was created (if not earlier).

Reeza · Posted 09-16-2022 06:10 PM

Supposedly this query was working, see link below.

Seeing the actual log and full code would probably help a lot here.

https://communities.sas.com/t5/SAS-Programming/Medical-data-matching-diagnosis-codes-with-multiple-v...

Wolverine · Posted 09-22-2022 01:01 PM

There were 2 issues with my array statements. First, I needed the $. Second, I copied the array statements that @Reeza provided in my other post, but forgot to update the list of variable names for this data🤕 Here are the array statements that I used, which ran without error:

Array dx_codes $ ICD_DX1 ICD_DX1_PROF COMP_ICD_DX1-COMP_ICD_DX25;
Array proc_codes $ CPT CPT_PROF ICD_PR1-ICD_PR25;

This ran in 13h7m, less than half the ~30hrs it took with the Proc SQL match! Then I also added statements to delete any records where all flags=0. This improved processing time even further to 10h47m.

I then compared frequencies from the original Proc SQL version against the frequencies from this array version. I expected them to be an exact match, but there were a few minor discrepancies. In each of these discrepancies, there were a few more cases flagged in the array version than in the SQL version (in the range of 1-30 extra cases out of n=~7000). I don't think it's enough to make a statistically significant difference, but it is strange. Can anyone offer an explanation?

Next, I will try the hash approach suggested earlier in this thread, and I will report back on my findings.

SASKiwi · Posted 09-22-2022 04:42 PM

How about comparing the inputs for the cases where there are discrepancies and deciding which method is correct? If the new method is correct and the old one isn't then no problem. If the new method is incorrect for some cases then post these examples here if you can't fix it yourself.

Wolverine · Posted 09-23-2022 11:06 AM

@SASKiwi wrote:

How about comparing the inputs for the cases where there are discrepancies and deciding which method is correct? If the new method is correct and the old one isn't then no problem. If the new method is incorrect for some cases then post these examples here if you can't fix it yourself.

That's more complicated than it sounds due to the number of cases flagged, the number of records for discrepant cases, and the number of variables that could contain potential matches. But I did find an example and it does indeed have the correct proc code for that particular flag. It is present in the input data file (hyst_prepost_ICD_CPT) and in the flag table (HystProlapseCode_comb_mx), and it is correctly entered (ie, no blanks or extra characters) in each file. And the "On" statement in the Proc SQL includes the correct variables. I couldn't see any reason why they weren't matched, so I began to wonder about some kind of computer issue. I ran the Proc SQL with a version of the data file with only this case, and it was flagged! That suggests to me that there is nothing wrong with the syntax or even the data, but rather I suspect that SAS ran out of RAM during the big SQL merge and dropped some matches. Of course it would have been nice if there had been an error or warning msg about that...

Also, if you know that certain codes are notably more likely than others, then list them at the start of their respective arrays. And put the more commonly matched array first. You could get fancier, but I won't go into it.

That's good to know. But luckily the order I used in the array statements is also the order in which they are most likely to be matched.

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

Re: SAS generating a large file, but barely using any computer resources

SAS Innovate 2025: Call for Content

Classroom Training Available!