Good Day.
I am trying to merge 2 SAS datasets, one of which is ~100MB in size and the other ranges between 400 - 700GB per dataset. The code I'm using to do this looks similar to the below:
libname dsfirst "\\path\to\first_dataset";
libname destfp "\\destination_path";
proc sql;
create table destfp.merge_20200131 as
select a.person_id, b.*
from dsfirst.firstds_202001 a left join destfp.secondds_20200131 b on
a.person_id = b.id_no;
quit;
The second dataset is currently saved in the destination file path, destfp.
Due to the size of the second dataset, I was hoping to circumvent writing data to the Work folder/drive on the SAS Workspace server that the code is running on as this is not big enough to house the resultant dataset, however SAS is still writing to this Work folder in spite of the above code. Any ideas why this is the case, and is there a way to avoid this? Thanks
Good Day everyone.
Apologies for the late reply, but we decided to go with an alternative approach. Instead of merging the 2 data sets, we are now filtering the second data set based on the national_id in the first data set. In this manner and for our purposes, we avoid the need for merging in the first place, thus being able to run a PROC SQL statement such as the below without issue:
proc sql;
create table destfp.finalds_20200131 as
select * from destfp.secondds_20200131
where id_no in (select distinct person_id from dsfirst.firstds_202001);
quit;
Yes, it is still a PROC SQL statement, however since we're not doing a join and the resultant data set is roughly a tenth of the size of the second data set, we are happy with the outcome.
Unfortunately, due to the sensitivity of the data we are working with, I am not able to post sample data, however the code is a near-identical representation of the actual code we are running.
Thanks everyone for the assistance. Special mention to @yabwon and @Kurt_Bremser for their ideas. Much appreciated.
Why do you think that SAS is still using the WORK folder?
@SGcabashe1 wrote:
@PaigeMiller I honestly have no idea. I suspect it could be a configuration we are missing, but my SAS Server Administration knowledge is currently limited.
This does not answer the question. You stated definitively that data sets were being written to the WORK folder.
however SAS is still writing to this Work folder
What do you see that makes you say this?
What is data structure of both sets, i.e. number of observations and variables? what is relation between `person_id` and `id_no`? is it 1-to-1? or 1-to-n?
Using SQL for joins will always cause the creation of utility files in WORK, which can grow to considerable size; see Maxim 10.
To determine which method is most feasible to solve your issue, we need to get to Know Your Data (Maxim 3). Please supply information about number of observations involved in your join, and about the key relationship (1:1, 1:many, many:1, many:many).
My first idea would be this:
data destfp.merge_20200131_a;
set destfp.secondds_20200131;
if _n_ = 1
then do;
person_id = id_no; /* just to set the attributes */
declare hash first (dataset:"dsfirst.firstds_202001 (keep=person_id)");
first.definekey('person_id');
first.definedata('person_id');
first.definedone();
end;
if first.check(key:id_no) = 0;
run;
If you can fit the hash into memory, then there is a very good chance you can now sort the tables, and use a MERGE to get the left join:
proc sort data=destfp.merge_20200131_a; by person_id; run; proc sort data=dsfirst.firstds_202001 by person_id; run; data destfp.merge_20200131; merge dsfirst.firstds_202001 (keep=person_id in=a) destfp.merge_20200131_a ; by person_id; if a; run;
This code is not tested, for lack of example data to test against.
Good Day everyone.
Apologies for the late reply, but we decided to go with an alternative approach. Instead of merging the 2 data sets, we are now filtering the second data set based on the national_id in the first data set. In this manner and for our purposes, we avoid the need for merging in the first place, thus being able to run a PROC SQL statement such as the below without issue:
proc sql;
create table destfp.finalds_20200131 as
select * from destfp.secondds_20200131
where id_no in (select distinct person_id from dsfirst.firstds_202001);
quit;
Yes, it is still a PROC SQL statement, however since we're not doing a join and the resultant data set is roughly a tenth of the size of the second data set, we are happy with the outcome.
Unfortunately, due to the sensitivity of the data we are working with, I am not able to post sample data, however the code is a near-identical representation of the actual code we are running.
Thanks everyone for the assistance. Special mention to @yabwon and @Kurt_Bremser for their ideas. Much appreciated.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.