BookmarkSubscribeRSS Feed
eiger
Calcite | Level 5

When I subset a dataset in a DATA step, SAS will continue to run for much longer than expected, so long in fact that I have not seen it finish running. However, when I break the run and cancel the submitted statements, the log indicates that 1,271 observations were read, which is number of observations that I expect to have in the subset. Why is it that SAS keeps running when all of the observations that match the WHERE condition have been read?

 

In the DATA step I use a WHERE statement to subset for observations where the character variable SUB = '123'. The dataset is large (1.3M+ obs.), but as I mentioned, the resulting data set "filtered_items" should only have 1,271 observations. 

libname corpxin "\\filepath\folder";

data filtered_items;
	set corpxin.items_202001;
	where SUB = '123';
run;
9 REPLIES 9
Tom
Super User Tom
Super User

Where is the libref pointing?

Is that an actual SAS dataset? Or is the libref pointing to some external database? 

Is the member an actual dataset (or database table) or is it a view?

Is your where clause really that simple equality test? Or is it something more complex?  

 

Note that if the libref is pointing to a remote database and the where clause is something that cannot be passed to the remote database then the delay is probably from SAS having to copy ALL of the observations to the SAS server in order to apply the filter.

eiger
Calcite | Level 5

The library is a folder on a remote drive. The file corpxin.items_202001 is a .sas7bdat dataset. 

Yes, the where clause is just a simple equality test. 

 

mkeintz
PROC Star

Local SAS probably doesn't know that the library is on a remote drive.  That is masked by the operating system, or the remote-access file system.   So the data engine (and the "where" filter) is probably running on the local system, meaning, as @Tom suggested, all the data is being transported to the client (where SAS is running) system prior to applying the filter.

 

Since the data file is a .sas7bat file, you would need a way to have SAS to run a data engine on the remote system to apply the filter prior to transport to the local system.  I have done this with sas/connect, but that requires a SAS license on both the client and server systems.

 

In the absence of the above, you can at least measure your progress with something like this, which reports elapsed time for the 1st, 101st, 201st, etc. obs that passes the where filter.

 

data _null_;
  t0=put(time(),time8.0);
  call symput("T0",t0);
 run;
%put &=t0;

data filtered_items;
	set corpxin.items_202001;
	where SUB = '123';

  if mod(_n_,100)=1 then do;  
    t1=time();
    elapsed=t1-"&t0"t;
    put _n_=comma6.0  t1=time8.0  elapsed=time8.0;
  end;
  drop t1 elapsed;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

@eiger wrote:

The library is a folder on a remote drive. The file corpxin.items_202001 is a .sas7bdat dataset. 

Yes, the where clause is just a simple equality test. 

 


SAS will have to read the WHOLE file to know whether or not there are any more observations that meet the where condition.

 

You could probably improve the performance of that query in a couple of different ways.

 

You could add an index on that variable.  Then SAS could read the index file (which should be much smaller) and find out which parts of the dataset need to be read.

 

If the dataset was sorted by that variable (and the values you want are near the start) then you could use an IF statement instead of a where statement and add a test to stop when you have passed the values of interest.

libname corpxin "\\filepath\folder";

data filtered_items;
  set corpxin.items_202001;
  by sub;
  if SUB = '123' then output;
  if sub > '123' then stop;
run;

 

ballardw
Super User

@eiger wrote:

The library is a folder on a remote drive. The file corpxin.items_202001 is a .sas7bdat dataset. 

Yes, the where clause is just a simple equality test. 

 


I am going to say a combination of the data step having to read every observation coupled with your network bandwidth or traffic.

 

Several years ago I copied network data sets to my local drive because the amount of time to access a data set with 5,000 observations was excessive. As in running proc freq on a couple of variables could take 15 minutes because of the amount of network traffic using that drive. Running with the local copy was a few seconds. I can image that if your data set with a million plus observations was on that drive with the network in effect that it could take hours to complete.

PaigeMiller
Diamond | Level 26

 Why is it that SAS keeps running when all of the observations that match the WHERE condition have been read?

 

I will guess that you did not submit the run; command.

 

Or guess #2 — SAS is still crunching through the 1.3 million records. SAS can find the 1271 of interest and still have 1.2 million records to search through; it doesn't know there are no more matches.

--
Paige Miller
PaigeMiller
Diamond | Level 26

Or guess #2 — SAS is still crunching through the 1.3 million records. SAS can find the 1271 of interest and still have 1.2 million records to search through; it doesn't know there are no more matches.

--
Paige Miller
mkeintz
PROC Star

Also, if the OP is certain that only 1,271 observations satisfy the filter, the program can be made to stop after outputting those 1,271, even if there remains a large part of the data file yet to be submitted to the filter.

 

data filtered_items;
	set corpxin.items_202001;
	where SUB = '123';

  output;
  if _n_=1271 then stop;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 393 views
  • 0 likes
  • 5 in conversation