BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SGcabashe1
Fluorite | Level 6

Good Day. 

I am trying to merge 2 SAS datasets, one of which is ~100MB in size and the other ranges between 400 - 700GB per dataset. The code I'm using to do this looks similar to the below:

libname dsfirst "\\path\to\first_dataset";
libname destfp "\\destination_path";

proc sql;
 create table destfp.merge_20200131 as
 select a.person_id, b.*
 from dsfirst.firstds_202001 a left join destfp.secondds_20200131 b on 
 a.person_id = b.id_no;
quit;

The second dataset is currently saved in the destination file path, destfp.

 

Due to the size of the second dataset, I was hoping to circumvent writing data to the Work folder/drive on the SAS Workspace server that the code is running on as this is not big enough to house the resultant dataset, however SAS is still writing to this Work folder in spite of the above code. Any ideas why this is the case, and is there a way to avoid this? Thanks


1 ACCEPTED SOLUTION

Accepted Solutions
SGcabashe1
Fluorite | Level 6

Good Day everyone.

 

Apologies for the late reply, but we decided to go with an alternative approach. Instead of merging the 2 data sets, we are now filtering the second data set based on the national_id in the first data set. In this manner and for our purposes, we avoid the need for merging in the first place, thus being able to run a PROC SQL statement such as the below without issue:

proc sql;
  create table destfp.finalds_20200131 as
  select * from destfp.secondds_20200131
  where id_no in (select distinct person_id from dsfirst.firstds_202001);
quit;

Yes, it is still a PROC SQL statement, however since we're not doing a join and the resultant data set is roughly a tenth of the size of the second data set, we are happy with the outcome.

 

Unfortunately, due to the sensitivity of the data we are working with, I am not able to post sample data, however the code is a near-identical representation of the actual code we are running.

 

Thanks everyone for the assistance. Special mention to @yabwon  and @Kurt_Bremser for their ideas. Much appreciated.

View solution in original post

13 REPLIES 13
PaigeMiller
Diamond | Level 26

Why do you think that SAS is still using the WORK folder?

--
Paige Miller
yabwon
Onyx | Level 15
I would suspect that maybe utiliti files are stored there?
_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



SGcabashe1
Fluorite | Level 6
@PaigeMiller I honestly have no idea. I suspect it could be a configuration we are missing, but my SAS Server Administration knowledge is currently limited.
PaigeMiller
Diamond | Level 26

@SGcabashe1 wrote:
@PaigeMiller I honestly have no idea. I suspect it could be a configuration we are missing, but my SAS Server Administration knowledge is currently limited.

This does not answer the question. You stated definitively that data sets were being written to the WORK folder.

 

however SAS is still writing to this Work folder 

 

What do you see that makes you say this?

--
Paige Miller
SGcabashe1
Fluorite | Level 6
@PaigeMiller my apologies. When the session is active, it creates files in my user folder in the Work folder, which are typically greater than or equal to the size of the second dataset. Left unchecked, this then results in the "Disk Full/OUT OF RESOURCES" error message. This occurs irrespective of the free space in destfp, the destination file path.
yabwon
Onyx | Level 15
What size of the result you are expecting? Will it be close to "100MB" of the first or "400GB" of the second?
Bart
_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



SGcabashe1
Fluorite | Level 6
@yabwon I'm expecting it to be at least 400GB, depending on the size of the second dataset.
yabwon
Onyx | Level 15

What is data structure of both sets, i.e. number of observations and variables? what is relation between `person_id` and `id_no`? is it 1-to-1? or 1-to-n?

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



SGcabashe1
Fluorite | Level 6
@yabwon Dataset 1 typically has >17000 observations and 170 variables. Dataset 2 typically has between 40000000 and 50000000 observations and 7820 variables. As for the relation, it's 1-to-n - person_id from Dataset 1 can appear 1 or more times in Dataset 2.
yabwon
Onyx | Level 15
since "person_id from Dataset 1 can appear 1 or more times in Dataset 2" wouldn't it be more efficient to use 'inner join' instead "left join"?
_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



SGcabashe1
Fluorite | Level 6
You may have a point there. I'll look into it. Thanks
Kurt_Bremser
Super User

Using SQL for joins will always cause the creation of utility files in WORK, which can grow to considerable size; see Maxim 10.

 

To determine which method is most feasible to solve your issue, we need to get to Know Your Data (Maxim 3). Please supply information about number of observations involved in your join, and about the key relationship (1:1, 1:many, many:1, many:many).

 

My first idea would be this:

data destfp.merge_20200131_a;
set destfp.secondds_20200131;
if _n_ = 1
then do;
  person_id = id_no; /* just to set the attributes */
  declare hash first (dataset:"dsfirst.firstds_202001 (keep=person_id)");
  first.definekey('person_id');
  first.definedata('person_id');
  first.definedone();
end;
if first.check(key:id_no) = 0;
run;

If you can fit the hash into memory, then there is a very good chance you can now sort the tables, and use a MERGE to get the left join:

proc sort data=destfp.merge_20200131_a;
by person_id;
run;

proc sort data=dsfirst.firstds_202001
by person_id;
run;

data destfp.merge_20200131;
merge
  dsfirst.firstds_202001 (keep=person_id in=a)
  destfp.merge_20200131_a
;
by person_id;
if a;
run;

This code is not tested, for lack of example data to test against.

SGcabashe1
Fluorite | Level 6

Good Day everyone.

 

Apologies for the late reply, but we decided to go with an alternative approach. Instead of merging the 2 data sets, we are now filtering the second data set based on the national_id in the first data set. In this manner and for our purposes, we avoid the need for merging in the first place, thus being able to run a PROC SQL statement such as the below without issue:

proc sql;
  create table destfp.finalds_20200131 as
  select * from destfp.secondds_20200131
  where id_no in (select distinct person_id from dsfirst.firstds_202001);
quit;

Yes, it is still a PROC SQL statement, however since we're not doing a join and the resultant data set is roughly a tenth of the size of the second data set, we are happy with the outcome.

 

Unfortunately, due to the sensitivity of the data we are working with, I am not able to post sample data, however the code is a near-identical representation of the actual code we are running.

 

Thanks everyone for the assistance. Special mention to @yabwon  and @Kurt_Bremser for their ideas. Much appreciated.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 2407 views
  • 2 likes
  • 4 in conversation