I'm running SAS on a Windows-based virtual machine. I used PROC SQL to join 2 files, and it took 30 hours. The output file is about 140GB. While this file was running, I used the Task Manager to examine CPU, Memory, and Ethernet usage, to see which of them was the bottleneck. Because it's a VM, there are no hard drives directly connected, and Ethernet is used to transfer all data in and out.
During the first couple of hours, it varied. Sometimes the CPU would be running high, sometimes it would be Memory, and sometimes it would be Ethernet. However, for many hours during the middle of the process, it didn't seem like anything was running high. CPU was in the 3-8% range (occasionally spiking to about 30% for a second and then dropping down again), Memory held steady at 39%, and Ethernet was about 88Kbs upload and 0Kbs download.
So what is SAS doing for all those hours? It's barely processing data, it's not holding much data in active memory, and it's not transferring much data. It's frustrating to wait 30 hours for a file when it seems like SAS isn't doing anything!
If this SQL accesses databases, a lot of the time would be while the databases are doing the work, so SAS isn't doing anything until the DB returns the information to SAS.
The files are .sas7bdat. As I understand it, these are static files, so SAS is not querying a database.
Without knowing what was requested it is kind of hard to guess what might be going on.
You might want to provide the example code and the number of records in each data set.
It's a complicated merge because there are about 50 variables that it can match on. The file prepost_ICD_CPT has about 54 million records and is about 82GB. The file prepost_FLAGS has 930 records and is about 1MB.
PROC SQL;
Create table temp.prepost_ICD_CPT_flags
as Select distinct a.*,b.*
From temp.prepost_ICD_CPT a JOIN temp.prepost_FLAGS b
On a.cpt = b.HEDIS_proc_code OR a.cpt_prof = b.HEDIS_proc_code OR
a.icd_pr1 = b.HEDIS_proc_code OR
a.icd_pr2 = b.HEDIS_proc_code OR
a.icd_pr3 = b.HEDIS_proc_code OR
a.icd_pr4 = b.HEDIS_proc_code OR
a.icd_pr5 = b.HEDIS_proc_code OR
a.icd_pr6 = b.HEDIS_proc_code OR
a.icd_pr7 = b.HEDIS_proc_code OR
a.icd_pr8 = b.HEDIS_proc_code OR
a.icd_pr9 = b.HEDIS_proc_code OR
a.icd_pr10 = b.HEDIS_proc_code OR
a.icd_pr11 = b.HEDIS_proc_code OR
a.icd_pr12 = b.HEDIS_proc_code OR
a.icd_pr13 = b.HEDIS_proc_code OR
a.icd_pr14 = b.HEDIS_proc_code OR
a.icd_pr15 = b.HEDIS_proc_code OR
a.icd_pr16 = b.HEDIS_proc_code OR
a.icd_pr17 = b.HEDIS_proc_code OR
a.icd_pr18 = b.HEDIS_proc_code OR
a.icd_pr19 = b.HEDIS_proc_code OR
a.icd_pr20 = b.HEDIS_proc_code OR
a.icd_pr21 = b.HEDIS_proc_code OR
a.icd_pr22 = b.HEDIS_proc_code OR
a.icd_pr23 = b.HEDIS_proc_code OR
a.icd_pr24 = b.HEDIS_proc_code OR
a.icd_pr25 = b.HEDIS_proc_code OR
a.icd_dx1_prof = b.HEDIS_proc_code OR
a.comp_icd_DX1 = b.HEDIS_DX_code OR
a.comp_icd_DX2 = b.HEDIS_DX_code OR
a.comp_icd_DX3 = b.HEDIS_DX_code OR
a.comp_icd_DX4 = b.HEDIS_DX_code OR
a.comp_icd_DX5 = b.HEDIS_DX_code OR
a.comp_icd_DX6 = b.HEDIS_DX_code OR
a.comp_icd_DX7 = b.HEDIS_DX_code OR
a.comp_icd_DX8 = b.HEDIS_DX_code OR
a.comp_icd_DX9 = b.HEDIS_DX_code OR
a.comp_icd_DX10 = b.HEDIS_DX_code OR
a.comp_icd_DX11 = b.HEDIS_DX_code OR
a.comp_icd_DX12 = b.HEDIS_DX_code OR
a.comp_icd_DX13 = b.HEDIS_DX_code OR
a.comp_icd_DX14 = b.HEDIS_DX_code OR
a.comp_icd_DX15 = b.HEDIS_DX_code OR
a.comp_icd_DX16 = b.HEDIS_DX_code OR
a.comp_icd_DX17 = b.HEDIS_DX_code OR
a.comp_icd_DX18 = b.HEDIS_DX_code OR
a.comp_icd_DX19 = b.HEDIS_DX_code OR
a.comp_icd_DX20 = b.HEDIS_DX_code OR
a.comp_icd_DX21 = b.HEDIS_DX_code OR
a.comp_icd_DX22 = b.HEDIS_DX_code OR
a.comp_icd_DX23 = b.HEDIS_DX_code OR
a.comp_icd_DX24 = b.HEDIS_DX_code OR
a.comp_icd_DX25 = b.HEDIS_DX_code
Where /*Delete records that don't match any of the flag proc or DX codes*/
(a.cpt in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.cpt_prof in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr1 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr2 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr3 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr4 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr5 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr6 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr7 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr8 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr9 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr10 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr11 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr12 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr13 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr14 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr15 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr16 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr17 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr18 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr19 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr20 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr21 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr22 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr23 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr24 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_pr25 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
a.icd_DX1_prof in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX1 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX2 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX3 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX4 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX5 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX6 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX7 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX8 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX9 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX10 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX11 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX12 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX13 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX14 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX15 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX16 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX17 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX18 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX19 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX20 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX21 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX22 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX23 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX24 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
a.comp_icd_DX25 in (select HEDIS_DX_code from temp.prepost_FLAGS));
QUIT;
Where is the TEMP SAS library stored? On your local hard drive?
Your query is obviously IO bound and is seriously inefficient because of all of the sub-queries you have in your WHERE clause. The fact that you have to check over 50 variables to do your join points to a very inefficient table design. Rewriting this in a DATA step will speed this up drastically. We can't help you with that until you post some sample data for the two input tables.
@SASKiwi wrote:
Where is the TEMP SAS library stored? On your local hard drive?
Your query is obviously IO bound and is seriously inefficient because of all of the sub-queries you have in your WHERE clause. The fact that you have to check over 50 variables to do your join points to a very inefficient table design. Rewriting this in a DATA step will speed this up drastically. We can't help you with that until you post some sample data for the two input tables.
I recognize that the syntax is very inefficient, but let me provide some context. This is an older bit of code that I wrote a few months ago. Originally, only proc codes were required. So while it was slow, it still completed by the following morning. In the months since, I worked on some similar projects and developed some better code. I even posted about it here and @Reeza helped me further improve my approach.
This past Friday afternoon, I got a request to update this analysis file to include DX codes. Rather than trying to rewrite the syntax with this better approach, it was a lot faster and easier to simply expand what I had already written. I knew it would be slow, but it had the long Labor Day weekend to run. I remotely logged in several times over the weekend to keep an eye on it and make sure it didn't crash. That's when I noticed the lack of computer resources being used, and that piqued my curiosity.
And that's what this post is really about. What is SAS doing when it doesn't appear to be doing much at all? You say that it's very IO intensive, but doesn't that imply sustained periods of near-max transfer speeds?
It's a virtual machine, so my understanding is that there is no local hard drive. All data is transferred via Ethernet. There is a C: drive (where Windows is stored) and an E: drive (titled "Data"), but even if they are connected locally, neither of these is sufficiently large to work as the temp directory.
Are you using SAS installed on the VM or is it on a server?
It's the sub-queries that are adding a lot of extra IO. Every row read in the main table, requires the sub-queries to be run so its no wonder it runs so slowly. Personally I don't think it is worth persevering with this approach unless you are OK with having to run it overnight all the time.
Since you have one quite small lookup table, a DATA step using two hash objects (for the separate keys) will be much faster.
And you must seriously consider a long layout for prepost_ICD_CPT, see Maxim 19.
I don't think you need the WHERE clause, which looks like it's filtering rows with a successful match. It's just reiterating your intention to produce an inner join, which I think is already the case for an unmodified "JOIN" operation. Others more expert than I can give a more authoritative opinion, but I believe the particular WHERE clauses you specify are totally superfluous ... but very resource costly. So maybe just dropping the WHERE conditions would speed things up sufficiently, without changing results.
Now, in a context more familiar to me: If the temp.preport_FLAGS dataset has no duplicate HEDIS_proc_code values or duplicate HEDIS_dx_code values, then this data step could be a lot faster.
data temp.prepost_ICD_CPT_flags (drop=rc i);
set temp.prepost_ICD_CPT ;
if _n_=1 then do;
if 0 then set temp.preport_FLAGS;
declare hash hash_pc (dataset:'temp.preport_FLAGS');
hash_pc.definekey('HEDIS_proc_code');
hash_pc.definedata(all:'Y');
hash_pc.definedone();
declare hash hash_dx (dataset:'temp.preport_FLAGS');
hash_dx.definekey('HEDIS_dx_code');
hash_dx.definedata(all:'Y');
hash_dx.definedone();
end;
array icd_p {*} cpt cpt_prof icd_pr1-icd_pr25 icd_dx1_prof ;
/*Check the 28 vars above until first HEDIS_proc_code match (i.e. rc=0)*/
do i=1 to dim(icd_p) until (rc=0);
rc=hash_pc.find(key:icd_p{i});
end;
array icd_d {*} comp_icd_DX1-comp_icd_DX25 ;
/*If still unmatched, check the 25 vars above until first HEDIS_dx_code match */
if rc^=0 then do i=1 to dim(icd_d) until (rc=0);
rc=hash_dx.find(key:icd_d{i});
end;
if rc=0 then output; /*Only output successful matches */
run;
BTW this code assumes:
Of course, this is untested code in the absence of sample data.
Hello @Wolverine
Further to what has been said in the posts above I would like to share my observation.
First, having looked at the code and the output file size, the time of 30 hours is understandable. It's running on a VM would not matter (More often then not servers run on VM infrastructure). Similarly the data being in external data bases would impact all programs and not just this one. It is the norm that external database and SAS work on different machines.
In this code a large dataset (54 million records) is being joined to a small dataset (930 records).
Consider alternative approaches such as the one by @mkeintz OR from this paper ( https://support.sas.com/resources/papers/proceedings/proceedings/sugi26/p102-26.pdf ).
Here a coding variant to the hash lookup approach as mentioned by @Kurt_Bremser and for which @mkeintz already posted some code.
Fully untested of course.
data want(drop=_:);
if _n_=1 then
do;
if 0 then set temp.prepost_FLAGS;
dcl hash h_proc (dataset:"temp.prepost_FLAGS");
h_proc.defineKey('HEDIS_proc_code');
h_proc.defineData(all:'y');
h_proc.defineDone();
dcl hash h_dx(dataset:"temp.prepost_FLAGS");
h_dx.defineKey('HEDIS_DX_code');
h_dx.defineData(all:'y');
h_dx.defineDone();
end;
call missing(of _all_);
set temp.prepost_ICD_CPT;
array _a_proc {*} icd_pr1-icd_pr25 icd_dx1_prof;
do _i=1 to dim(_a_proc);
if h_proc.find(key:_a_proc[_i])=0 then
do;
output;
return;
end;
end;
array _a_dx {*} comp_icd_DX1-comp_icd_DX25;
do _i=1 to dim(_a_dx);
if h_dx.find(key:_a_dx[_i])=0 then
do;
output;
return;
end;
end;
run;
/* in case there can be duplicates in temp.prepost_ICD_CPT here mimicking the DISTINCT from the SQL */
proc sort data=want;
by _all_;
run;
The code above writes a row to output as soon as it finds a matching flag and then stops further checks (return statement).
If for a single row there can be multiple flags that match and you want to write out the row multiple times then you would have to remove the RETURN; statements.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.