BookmarkSubscribeRSS Feed
Wolverine
Pyrite | Level 9

I'm running SAS on a Windows-based virtual machine.  I used PROC SQL to join 2 files, and it took 30 hours.  The output file is about 140GB.  While this file was running, I used the Task Manager to examine CPU, Memory, and Ethernet usage, to see which of them was the bottleneck.  Because it's a VM, there are no hard drives directly connected, and Ethernet is used to transfer all data in and out.

 

During the first couple of hours, it varied.  Sometimes the CPU would be running high, sometimes it would be Memory, and sometimes it would be Ethernet.  However, for many hours during the middle of the process, it didn't seem like anything was running high.  CPU was in the 3-8% range (occasionally spiking to about 30% for a second and then dropping down again), Memory held steady at 39%, and Ethernet was about 88Kbs upload and 0Kbs download.

 

So what is SAS doing for all those hours?  It's barely processing data, it's not holding much data in active memory, and it's not transferring much data.  It's frustrating to wait 30 hours for a file when it seems like SAS isn't doing anything!

42 REPLIES 42
PaigeMiller
Diamond | Level 26

If this SQL accesses databases, a lot of the time would be while the databases are doing the work, so SAS isn't doing anything until the DB returns the information to SAS.

--
Paige Miller
Wolverine
Pyrite | Level 9

The files are .sas7bdat.  As I understand it, these are static files, so SAS is not querying a database.

ballardw
Super User

Without knowing what was requested it is kind of hard to guess what might be going on.

You might want to provide the example code and the number of records in each data set.

 

 

Wolverine
Pyrite | Level 9

It's a complicated merge because there are about 50 variables that it can match on.  The file prepost_ICD_CPT has about 54 million records and is about 82GB.  The file prepost_FLAGS has 930 records and is about 1MB.

 

 

PROC SQL;
	Create table temp.prepost_ICD_CPT_flags
	as Select distinct a.*,b.*
	From temp.prepost_ICD_CPT a JOIN temp.prepost_FLAGS b
	On a.cpt = b.HEDIS_proc_code OR a.cpt_prof = b.HEDIS_proc_code OR
		a.icd_pr1 = b.HEDIS_proc_code OR 
		a.icd_pr2 = b.HEDIS_proc_code OR 
		a.icd_pr3 = b.HEDIS_proc_code OR 
		a.icd_pr4 = b.HEDIS_proc_code OR 
		a.icd_pr5 = b.HEDIS_proc_code OR 
		a.icd_pr6 = b.HEDIS_proc_code OR 
		a.icd_pr7 = b.HEDIS_proc_code OR 
		a.icd_pr8 = b.HEDIS_proc_code OR 
		a.icd_pr9 = b.HEDIS_proc_code OR 
		a.icd_pr10 = b.HEDIS_proc_code OR 
		a.icd_pr11 = b.HEDIS_proc_code OR 
		a.icd_pr12 = b.HEDIS_proc_code OR 
		a.icd_pr13 = b.HEDIS_proc_code OR 
		a.icd_pr14 = b.HEDIS_proc_code OR 
		a.icd_pr15 = b.HEDIS_proc_code OR 
		a.icd_pr16 = b.HEDIS_proc_code OR 
		a.icd_pr17 = b.HEDIS_proc_code OR 
		a.icd_pr18 = b.HEDIS_proc_code OR 
		a.icd_pr19 = b.HEDIS_proc_code OR 
		a.icd_pr20 = b.HEDIS_proc_code OR 
		a.icd_pr21 = b.HEDIS_proc_code OR 
		a.icd_pr22 = b.HEDIS_proc_code OR 
		a.icd_pr23 = b.HEDIS_proc_code OR 
		a.icd_pr24 = b.HEDIS_proc_code OR 
		a.icd_pr25 = b.HEDIS_proc_code OR
		a.icd_dx1_prof = b.HEDIS_proc_code OR
		a.comp_icd_DX1 = b.HEDIS_DX_code OR 
		a.comp_icd_DX2 = b.HEDIS_DX_code OR 
		a.comp_icd_DX3 = b.HEDIS_DX_code OR 
		a.comp_icd_DX4 = b.HEDIS_DX_code OR 
		a.comp_icd_DX5 = b.HEDIS_DX_code OR 
		a.comp_icd_DX6 = b.HEDIS_DX_code OR 
		a.comp_icd_DX7 = b.HEDIS_DX_code OR 
		a.comp_icd_DX8 = b.HEDIS_DX_code OR 
		a.comp_icd_DX9 = b.HEDIS_DX_code OR 
		a.comp_icd_DX10 = b.HEDIS_DX_code OR 
		a.comp_icd_DX11 = b.HEDIS_DX_code OR 
		a.comp_icd_DX12 = b.HEDIS_DX_code OR 
		a.comp_icd_DX13 = b.HEDIS_DX_code OR 
		a.comp_icd_DX14 = b.HEDIS_DX_code OR 
		a.comp_icd_DX15 = b.HEDIS_DX_code OR 
		a.comp_icd_DX16 = b.HEDIS_DX_code OR 
		a.comp_icd_DX17 = b.HEDIS_DX_code OR 
		a.comp_icd_DX18 = b.HEDIS_DX_code OR 
		a.comp_icd_DX19 = b.HEDIS_DX_code OR 
		a.comp_icd_DX20 = b.HEDIS_DX_code OR 
		a.comp_icd_DX21 = b.HEDIS_DX_code OR 
		a.comp_icd_DX22 = b.HEDIS_DX_code OR 
		a.comp_icd_DX23 = b.HEDIS_DX_code OR 
		a.comp_icd_DX24 = b.HEDIS_DX_code OR 
		a.comp_icd_DX25 = b.HEDIS_DX_code
	Where /*Delete records that don't match any of the flag proc or DX codes*/
		(a.cpt in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.cpt_prof in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr1 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr2 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr3 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr4 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr5 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr6 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr7 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr8 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr9 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr10 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr11 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr12 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr13 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr14 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr15 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr16 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr17 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr18 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr19 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr20 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr21 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr22 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr23 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr24 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_pr25 in (select HEDIS_proc_code from temp.prepost_FLAGS) OR
			a.icd_DX1_prof in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX1 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX2 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX3 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX4 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX5 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX6 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX7 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX8 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX9 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX10 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX11 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX12 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX13 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX14 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX15 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX16 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX17 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX18 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX19 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX20 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX21 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX22 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX23 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX24 in (select HEDIS_DX_code from temp.prepost_FLAGS) OR
			a.comp_icd_DX25 in (select HEDIS_DX_code from temp.prepost_FLAGS));
QUIT;

 

 

SASKiwi
PROC Star

Where is the TEMP SAS library stored? On your local hard drive? 

 

Your query is obviously IO bound and is seriously inefficient because of all of the sub-queries you have in your WHERE clause. The fact that you have to check over 50 variables to do your join points to a very inefficient table design. Rewriting this in a DATA step will speed this up drastically. We can't help you with that until you post some sample data for the two input tables. 

Wolverine
Pyrite | Level 9

@SASKiwi wrote:

Where is the TEMP SAS library stored? On your local hard drive? 

 

Your query is obviously IO bound and is seriously inefficient because of all of the sub-queries you have in your WHERE clause. The fact that you have to check over 50 variables to do your join points to a very inefficient table design. Rewriting this in a DATA step will speed this up drastically. We can't help you with that until you post some sample data for the two input tables. 


I recognize that the syntax is very inefficient, but let me provide some context.  This is an older bit of code that I wrote a few months ago.  Originally, only proc codes were required.  So while it was slow, it still completed by the following morning.  In the months since, I worked on some similar projects and developed some better code.  I even posted about it here  and @Reeza helped me further improve my approach. 

 

This past Friday afternoon, I got a request to update this analysis file to include DX codes.  Rather than trying to rewrite the syntax with this better approach, it was a lot faster and easier to simply expand what I had already written.  I knew it would be slow, but it had the long Labor Day weekend to run.  I remotely logged in several times over the weekend to keep an eye on it and make sure it didn't crash.  That's when I noticed the lack of computer resources being used, and that piqued my curiosity.

 

And that's what this post is really about.  What is SAS doing when it doesn't appear to be doing much at all?  You say that it's very IO intensive, but doesn't that imply sustained periods of near-max transfer speeds? 

 

It's a virtual machine, so my understanding is that there is no local hard drive.  All data is transferred via Ethernet.  There is a C: drive (where Windows is stored) and an E: drive (titled "Data"), but even if they are connected locally, neither of these is sufficiently large to work as the temp directory.

Reeza
Super User

Are you using SAS installed on the VM or is it on a server?

SASKiwi
PROC Star

It's the sub-queries that are adding a lot of extra IO. Every row read in the main table, requires the sub-queries to be run so its no wonder it runs so slowly. Personally I don't think it is worth persevering with this approach unless you are OK with having to run it overnight all the time.

Reeza
Super User
Agreed with @SASKiwi, this is an inefficient query that can be optimized with a format and possibly a data step or simpler SQL. Post some sample data and show the expected logic and we can probably save you hours with your query.
Kurt_Bremser
Super User

Since you have one quite small lookup table, a DATA step using two hash objects (for the separate keys) will be much faster.

And you must seriously consider a long layout for prepost_ICD_CPT, see Maxim 19.

mkeintz
PROC Star

I don't think you need the WHERE clause, which looks like it's filtering rows with a successful match.  It's just reiterating your intention to produce an inner join, which I think is already the case for an unmodified "JOIN" operation.  Others more expert than I can give a more authoritative opinion, but I believe the particular WHERE clauses you specify are totally superfluous ... but very resource costly.  So maybe just dropping the WHERE conditions would speed things up sufficiently, without changing results.

 

Now, in a context more familiar to me: If the temp.preport_FLAGS dataset has no duplicate HEDIS_proc_code values or duplicate HEDIS_dx_code values, then this data step could be a lot faster.

 

 

data temp.prepost_ICD_CPT_flags (drop=rc i);
  set temp.prepost_ICD_CPT ;
  if _n_=1 then do;
    if 0 then set temp.preport_FLAGS;
    declare hash hash_pc (dataset:'temp.preport_FLAGS');
      hash_pc.definekey('HEDIS_proc_code');
      hash_pc.definedata(all:'Y');
      hash_pc.definedone();
    declare hash hash_dx (dataset:'temp.preport_FLAGS');
      hash_dx.definekey('HEDIS_dx_code');
      hash_dx.definedata(all:'Y');
      hash_dx.definedone();
  end;

  array icd_p {*} cpt cpt_prof icd_pr1-icd_pr25 icd_dx1_prof ;
  /*Check the 28 vars above until first HEDIS_proc_code match (i.e. rc=0)*/
  do i=1 to dim(icd_p) until (rc=0);
    rc=hash_pc.find(key:icd_p{i});
  end;

  array icd_d {*} comp_icd_DX1-comp_icd_DX25 ;
  /*If still unmatched, check the 25 vars above until first HEDIS_dx_code match */
  if rc^=0 then do i=1 to dim(icd_d) until (rc=0);
    rc=hash_dx.find(key:icd_d{i});
  end;

  if rc=0 then output; /*Only output successful matches */
run;

 

 

BTW this code assumes:

  1. all the matchable codes are numeric.  If not, the ARRAY statements need to be changed to specify character arrays.

  2. There are no identical variable names in the two datasets.  That's because this code would allow the preport_flags variable values to overwrite the same name variables found in the prepost_ICD_CPT, while the proc sql code would do the opposite.

  3. [editted addition].  If there are certain variables listed in the arrays that you expect to be significantly more likely to be matched than others, list them at the start of their corresponding arrays.  That rearrangement will find those (frequent) matches in earlier iterations of the DO loops.   You could even make a couple of smaller, high-likelihood, arrays and do loops (one for the HEDIS_proc_code candidates, one for the HEDIS_dx_code) that you could start with, followed by arrays and loops for the remaining match candidates. 

 

 

 

 

Of course, this is untested code in the absence of sample data.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Sajid01
Meteorite | Level 14

Hello @Wolverine 
Further to what has been said in the posts above I  would like to share my observation.
First, having looked at the code and the output file size, the time of 30 hours is understandable. It's running on a VM would not matter (More often then not servers run on VM infrastructure).  Similarly the data being in external data bases would impact all programs and not just this one. It is the norm that external database and SAS work on different machines.
In this code a large dataset (54 million records) is being joined to a small dataset (930 records). 
Consider alternative approaches such as the one by @mkeintz  OR from this paper  ( https://support.sas.com/resources/papers/proceedings/proceedings/sugi26/p102-26.pdf ).

 

 

Patrick
Opal | Level 21

Here a coding variant to the hash lookup approach as mentioned by @Kurt_Bremser and for which @mkeintz already posted some code. 
Fully untested of course.

data want(drop=_:);

  if _n_=1 then
    do;
      if 0 then set temp.prepost_FLAGS;
      dcl hash h_proc (dataset:"temp.prepost_FLAGS");
      h_proc.defineKey('HEDIS_proc_code');
      h_proc.defineData(all:'y');
      h_proc.defineDone();

      dcl hash h_dx(dataset:"temp.prepost_FLAGS");
      h_dx.defineKey('HEDIS_DX_code');
      h_dx.defineData(all:'y');
      h_dx.defineDone();
    end;
  call missing(of _all_);

  set temp.prepost_ICD_CPT;

  array _a_proc {*} icd_pr1-icd_pr25 icd_dx1_prof;
  do _i=1 to dim(_a_proc);
    if h_proc.find(key:_a_proc[_i])=0 then 
      do;
        output;
        return;
      end;
  end;

  array _a_dx {*} comp_icd_DX1-comp_icd_DX25;
  do _i=1 to dim(_a_dx);
    if h_dx.find(key:_a_dx[_i])=0 then 
      do;
        output;
        return;
      end;
  end;

run;

/* in case there can be duplicates in temp.prepost_ICD_CPT here mimicking the DISTINCT from the SQL */
proc sort data=want;
  by _all_;
run;

The code above writes a row to output as soon as it finds a matching flag and then stops further checks (return statement). 
If for a single row there can be multiple flags that match and you want to write out the row multiple times then you would have to remove the RETURN; statements.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 42 replies
  • 2415 views
  • 26 likes
  • 12 in conversation