Solved: Re: Prevent SAS from storing data in Work Folder

SGcabashe1 · Posted 07-23-2020 06:52 AM

Good Day.

I am trying to merge 2 SAS datasets, one of which is ~100MB in size and the other ranges between 400 - 700GB per dataset. The code I'm using to do this looks similar to the below:

libname dsfirst "\\path\to\first_dataset";
libname destfp "\\destination_path";

proc sql;
 create table destfp.merge_20200131 as
 select a.person_id, b.*
 from dsfirst.firstds_202001 a left join destfp.secondds_20200131 b on 
 a.person_id = b.id_no;
quit;

The second dataset is currently saved in the destination file path, destfp.

Due to the size of the second dataset, I was hoping to circumvent writing data to the Work folder/drive on the SAS Workspace server that the code is running on as this is not big enough to house the resultant dataset, however SAS is still writing to this Work folder in spite of the above code. Any ideas why this is the case, and is there a way to avoid this? Thanks

SGcabashe1 · Posted 08-03-2020 10:39 AM

Good Day everyone.

Apologies for the late reply, but we decided to go with an alternative approach. Instead of merging the 2 data sets, we are now filtering the second data set based on the national_id in the first data set. In this manner and for our purposes, we avoid the need for merging in the first place, thus being able to run a PROC SQL statement such as the below without issue:

proc sql;
  create table destfp.finalds_20200131 as
  select * from destfp.secondds_20200131
  where id_no in (select distinct person_id from dsfirst.firstds_202001);
quit;

Yes, it is still a PROC SQL statement, however since we're not doing a join and the resultant data set is roughly a tenth of the size of the second data set, we are happy with the outcome.

Unfortunately, due to the sensitivity of the data we are working with, I am not able to post sample data, however the code is a near-identical representation of the actual code we are running.

Thanks everyone for the assistance. Special mention to @yabwon and @Kurt_Bremser for their ideas. Much appreciated.

View solution in original post

PaigeMiller · Posted 07-23-2020 06:59 AM

Why do you think that SAS is still using the WORK folder?

--
Paige Miller

yabwon · Posted 07-23-2020 07:01 AM

I would suspect that maybe utiliti files are stored there?

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

SGcabashe1 · Posted 07-23-2020 07:02 AM

@PaigeMiller I honestly have no idea. I suspect it could be a configuration we are missing, but my SAS Server Administration knowledge is currently limited.

PaigeMiller · Posted 07-23-2020 07:06 AM

@SGcabashe1 wrote:
@PaigeMiller I honestly have no idea. I suspect it could be a configuration we are missing, but my SAS Server Administration knowledge is currently limited.

This does not answer the question. You stated definitively that data sets were being written to the WORK folder.

however SAS is still writing to this Work folder

What do you see that makes you say this?

--
Paige Miller

SGcabashe1 · Posted 07-23-2020 07:11 AM

@PaigeMiller my apologies. When the session is active, it creates files in my user folder in the Work folder, which are typically greater than or equal to the size of the second dataset. Left unchecked, this then results in the "Disk Full/OUT OF RESOURCES" error message. This occurs irrespective of the free space in destfp, the destination file path.

yabwon · Posted 07-23-2020 07:00 AM

What size of the result you are expecting? Will it be close to "100MB" of the first or "400GB" of the second?
Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

SGcabashe1 · Posted 07-23-2020 07:04 AM

@yabwon I'm expecting it to be at least 400GB, depending on the size of the second dataset.

yabwon · Posted 07-23-2020 07:07 AM

What is data structure of both sets, i.e. number of observations and variables? what is relation between `person_id` and `id_no`? is it 1-to-1? or 1-to-n?

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

SGcabashe1 · Posted 07-23-2020 07:37 AM

@yabwon Dataset 1 typically has >17000 observations and 170 variables. Dataset 2 typically has between 40000000 and 50000000 observations and 7820 variables. As for the relation, it's 1-to-n - person_id from Dataset 1 can appear 1 or more times in Dataset 2.

yabwon · Posted 07-23-2020 07:58 AM

since "person_id from Dataset 1 can appear 1 or more times in Dataset 2" wouldn't it be more efficient to use 'inner join' instead "left join"?

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

SGcabashe1 · Posted 07-24-2020 02:21 AM

You may have a point there. I'll look into it. Thanks

Kurt_Bremser · Posted 07-23-2020 07:57 AM

Using SQL for joins will always cause the creation of utility files in WORK, which can grow to considerable size; see Maxim 10.

To determine which method is most feasible to solve your issue, we need to get to Know Your Data (Maxim 3). Please supply information about number of observations involved in your join, and about the key relationship (1:1, 1:many, many:1, many:many).

My first idea would be this:

data destfp.merge_20200131_a;
set destfp.secondds_20200131;
if _n_ = 1
then do;
  person_id = id_no; /* just to set the attributes */
  declare hash first (dataset:"dsfirst.firstds_202001 (keep=person_id)");
  first.definekey('person_id');
  first.definedata('person_id');
  first.definedone();
end;
if first.check(key:id_no) = 0;
run;

If you can fit the hash into memory, then there is a very good chance you can now sort the tables, and use a MERGE to get the left join:

proc sort data=destfp.merge_20200131_a;
by person_id;
run;

proc sort data=dsfirst.firstds_202001
by person_id;
run;

data destfp.merge_20200131;
merge
  dsfirst.firstds_202001 (keep=person_id in=a)
  destfp.merge_20200131_a
;
by person_id;
if a;
run;

This code is not tested, for lack of example data to test against.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SGcabashe1 · Posted 08-03-2020 10:39 AM

Good Day everyone.

Apologies for the late reply, but we decided to go with an alternative approach. Instead of merging the 2 data sets, we are now filtering the second data set based on the national_id in the first data set. In this manner and for our purposes, we avoid the need for merging in the first place, thus being able to run a PROC SQL statement such as the below without issue:

proc sql;
  create table destfp.finalds_20200131 as
  select * from destfp.secondds_20200131
  where id_no in (select distinct person_id from dsfirst.firstds_202001);
quit;

Yes, it is still a PROC SQL statement, however since we're not doing a join and the resultant data set is roughly a tenth of the size of the second data set, we are happy with the outcome.

Unfortunately, due to the sensitivity of the data we are working with, I am not able to post sample data, however the code is a near-identical representation of the actual code we are running.

Thanks everyone for the assistance. Special mention to @yabwon and @Kurt_Bremser for their ideas. Much appreciated.

Registration is open

SAS Training: Just a Click Away