Re: SAS DI Job : Join on large dataset

JJP1 · Posted 05-02-2019 09:51 AM

Hi Team,

We are trying to create SAS DI job where we are using SQL join transformation to join the datasets.

Almost all the datasets(having size as 2GB,3GB) we are trying to join fine and works fine with out any issues.but in the same join we are trying to join with the dataset that has 100+ GB size where it is always getting failed as "Insufficient work Space error " after running long time.

So from the error it is clear that there is no enough space.

Would you please help on effective way to handle this error please .Actually we should handle this case with out using option called "increasing work space size issue please" where this is the solution i got from google.

please suggest the best option .

ChrisNZ · Posted 05-02-2019 10:52 PM

Sorting the large data set on the join key(s) may avoid having to load the data in memory.

Each join is different, and this is a generic answer to generic question.

Provide more details for a more detailed answer.

High-Performance SAS Coding - Third Edition

JJP1 · Posted 05-03-2019 03:20 AM

Hi Chris,

Thanks for this.please find details below .

While happening below join the first step in SAS DI (sample code) itself where we are getting error as insufficent space error after running approximately 3 hours.please suggest the efficient way to handle this case please.let me know if you need any clarification please.

proc sql;
   create table work.xxxx
   select
< ALL Required Columns >
   from
      AAAAAAA left join                         /* AAAAAAA size is 2GB * /

      BBBBBBB                                  /* BBBBBBB size is  2.4GB * /                               
         on
         (
            AAAAAAA.ID = BBBBBBB.ID
         ) left join 

      CCCCCCC                               /* CCCCCCC size is 1.2GB * /
         on
         (
            BBBBBBB.ID1 = CCCCCCC.ID1
         ) left join 
      DDDDDDD                              /* DDDDDDD size is 1.4GB * /
         on
         (
            CCCCCCC.ID2 = DDDDDDD.ID2
         ) left join 
      EEEEEEE                           /* EEEEEEE size is 3GB */
          on
         (
            DDDDDDD.ID3 = EEEEEEE.ID3
         ) left join 
      FFFFFFF                               /* FFFFFFF size is 103 GB */
         on 
         (
            EEEEEEE.ID4 = FFFFFFF.ID4
         )
   ;
quit;

ScottBass · Posted 05-03-2019 12:03 AM

Provide example code.

Make sure you're not creating an unintended Cartesian product.

Know the cardinality of your join columns within your source tables (i.e. are the keys unique?)

Can you write your work output datasets to a different path that has more space?

You can try writing to a SPDE library with compression turned on - the compression is better on the physical file segments when using the SPDE engine.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

JJP1 · Posted 05-03-2019 03:21 AM

Hi Scott,

Thanks for this.please find details below .

While happening below join the first step in SAS DI (sample code) itself where we are getting error as insufficent space error after running approximately 3 hours.please suggest the efficient way to handle this case please.let me know if you need any clarification please.

proc sql;
   create table work.xxxx
   select
< ALL Required Columns >
   from
      AAAAAAA left join                         /* AAAAAAA size is 2GB * /

      BBBBBBB                                  /* BBBBBBB size is  2.4GB * /                               
         on
         (
            AAAAAAA.ID = BBBBBBB.ID
         ) left join 

      CCCCCCC                               /* CCCCCCC size is 1.2GB * /
         on
         (
            BBBBBBB.ID1 = CCCCCCC.ID1
         ) left join 
      DDDDDDD                              /* DDDDDDD size is 1.4GB * /
         on
         (
            CCCCCCC.ID2 = DDDDDDD.ID2
         ) left join 
      EEEEEEE                           /* EEEEEEE size is 3GB */
          on
         (
            DDDDDDD.ID3 = EEEEEEE.ID3
         ) left join 
      FFFFFFF                               /* FFFFFFF size is 103 GB */
         on 
         (
            EEEEEEE.ID4 = FFFFFFF.ID4
         )
   ;
quit;

Patrick · Posted 05-03-2019 01:18 AM

@JJP1

A SAS SQL join will need to physically sort the data unless there are indexes on the tables which can get used.

This intermediary sorting happens in the WORK (actually UTILLOC) area on disk and the required disk space is a multiple of the source data.

The error you're getting tells you that there isn't sufficient disk space in your Work location (either because you hit a quota or even worse you really take up all the disk space and other SAS session requiring Work space will fail as well).

You could try to create indexes on the tables as required by the joins. That would likely circumvent the out of work condition.

To give you any further advice and eventually also provide code you would need to post representative sample data (please as tested data steps creating this data), explain the join logic you want to implement and then also show us how the desired result using the sample data should look like.

And just to add another idea:

It appears that if the SAS SQL compiler "knows" that a SAS table is already pre-sorted in the right way then it doesn't do another sort anymore. Sooo.... You could also pre-sort your big table using a sort transformation or if it's already sorted the right way just "tell" the SQL join that it's already sorted. Below sample illustrates what I'm talking about.

data leftTbl;
  left_var='left';
  do id=1 to 10000;
    output;
  end;
  stop;
run;

data rightBigTbl;
  right_var='right';
  do id=1 to 10000;
    output;
    output;
  end;
  stop;
run;

/*proc sort data=rightBigTbl;*/
/*  by id;*/
/*run;*/

proc sql _method;
  create table want as
    select L.id, left_var, right_var
    from leftTbl L left join rightBigTbl(sortedby=id) R
    on L.id=R.id
    ;
quit;

From the SAS log:

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjm
              sqxsrc( WORK.RIGHTBIGTBL(alias = R) )
              sqxsort
                  sqxsrc( WORK.LEFTTBL(alias = L) )

Because dataset option "sortedby" has been added to the RIGHTBIGTBL the SQL compiler doesn't attempt to sort this table again (no sqxsort).

IF the table has been sorted using Proc Sort then sortedby will already be a table attribute and you don't need to add it explicitly to the SQL. IF you add the sortedby option but the data isn't actually physically sorted that way then you SQL will fail.

JJP1 · Posted 05-03-2019 03:19 AM

Hi Patrik,

Thanks for this.please find details below .

While happening below join the first step in SAS DI (sample code) itself where we are getting error as insufficent space error after running approximately 3 hours.please suggest the efficient way to handle this case please.let me know if you need any clarification please.

proc sql;
   create table work.xxxx
   select
< ALL Required Columns >
   from
      AAAAAAA left join                         /* AAAAAAA size is 2GB * /

      BBBBBBB                                  /* BBBBBBB size is  2.4GB * /                               
         on
         (
            AAAAAAA.ID = BBBBBBB.ID
         ) left join 

      CCCCCCC                               /* CCCCCCC size is 1.2GB * /
         on
         (
            BBBBBBB.ID1 = CCCCCCC.ID1
         ) left join 
      DDDDDDD                              /* DDDDDDD size is 1.4GB * /
         on
         (
            CCCCCCC.ID2 = DDDDDDD.ID2
         ) left join 
      EEEEEEE                           /* EEEEEEE size is 3GB */
          on
         (
            DDDDDDD.ID3 = EEEEEEE.ID3
         ) left join 
      FFFFFFF                               /* FFFFFFF size is 103 GB */
         on 
         (
            EEEEEEE.ID4 = FFFFFFF.ID4
         )
   ;
quit;

ScottBass · Posted 05-03-2019 03:50 AM

You don't need to post THE SAME EXACT CODE three times addressed to three different people. A simple "see code above", or better yet editing your original post, would suffice.

(Also, how are we part of your "team" when we don't even know you lol 😉 )

Since you're doing an outer join (i.e. you don't need a Cartesian product), and you've got large data volumes, I would approach this using an index key lookup.

See https://documentation.sas.com/?docsetId=lestmtsref&docsetTarget=p00hxg3x8lwivcn1f0e9axziw57y.htm&doc..., examples 7 & 8.

Something like this - untested and possible syntax errors. The doc and Google are your friends:

* are the keys unique? life will be better if they are! ;
proc datasets lib=work;
   modify AAAAAAA;
   index create ID / unique;  * technically this is redundant ;
   run;

   modify BBBBBBB;
   index create ID / unique;
   index create ID1 / unique;  * technically this is redundant ;
   run;
   
   modify CCCCCCC;
   index create ID1 / unique;
   index create ID2 / unique; * technically this is redundant ;
   run;
   
   modify DDDDDDD;
   index create ID2 / unique;
   index create ID3 / unique; * technically this is redundant ;
   run;
   
   modify EEEEEEE;
   index create ID3 / unique;
   index create ID4 / unique; * technically this is redundant ;
   run;
   
   modify FFFFFFF;
   index create ID4 / unique;
   run;
quit;   

data work.xxxx;
   * define PDV ;
   if 0 then set AAAAAAA BBBBBBB CCCCCCC DDDDDDD EEEEEEE FFFFFFF;
   call missing(of _all_);
   
   set AAAAAAA;                   * this will set ID ;
   set BBBBBBB key=ID / unique;   * this will use ID  in the PDV to find the matching ID  in BBBBBBB, which will set ID1 ;
   set CCCCCCC key=ID1 / unique;  * this will use ID1 in the PDV to find the matching ID1 in CCCCCCC, which will set ID2 ;
   set DDDDDDD key=ID2 / unique;  * this will use ID2 in the PDV to find the matching ID2 in DDDDDDD, which will set ID3 ;
   set EEEEEEE key=ID3 / unique;  * this will use ID3 in the PDV to find the matching ID3 in EEEEEEE, which will set ID4 ;
   set FFFFFFF key=ID4 / unique;  * this will use ID4 in the PDV to find the matching ID4 in FFFFFFF ;
   
   * SAS treats any non-match as an error, so reset the _error_ variable ;
   _error_=0;
run;

You'll likely need drop and/or keep statements. I recommend you work with small subsets until you wrap your head around how this works.

Hopefully this gets you started...

Edit: See also https://communities.sas.com/t5/SAS-Programming/Efficient-one-to-many-join-to-a-huge-dataset/m-p/5558... for a working example.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

JJP1 · Posted 05-03-2019 05:32 AM

Hi ,

Thanks for this,Actually we are using the mentioned code in sql join transformation in SAS DI application.

So we need to create a job in SAS DI to resolve the insufficent space issue.

I don't think we can create data steps in existing SAS DI job as we are currently using sql join transformation.

kindly suggest and please help in SAS DI application approach please.please correct me if i am wrong and let me know the efficient way to do this please.

ErikLund_Jensen · Posted 05-03-2019 06:56 AM

Hi @JJP1

First: You can write data steps in DI Studio. SAS supplies a transformation named "User Written Code" (Look in Transformations -> Data). If you use that, you just get an empty transformation in your job with input and output connections, and you can fill it out with your own code.

Next: I don't think it is a good idea to join 6 tables in one SQL join, because it is difficult to debug, and it is not easy to figure out how SQL handles the joining, so you might end op with many passes over the same table. Try to split it up.

In your case I would start with 4 joins covering the "small" tables: A-B, result-C, result_D and result-E. Remember to deselect "create view", so the code will run in each transfromation. The you can run the transfromations one by one and make sure they give the expected result, and you don't get cartesian products.

If you don't select all columns from all tables, you might get unwanted duplicate rows in the result, so remember to select 'distinct' in each join.

When that works, you could change to views in the 3 first transformations to save disk space, but keep a physical table as output in the last. Remember to save/close the job and open it again at this point to get rid of the work tables, and it might be a good idea to write the 4. output to a permanent table (at least in the development phase) so you don't have to rerun everything when you open the job, but can proceed directly to next step:

Add another SQL join to join the last result with the big table. You don't give any information about the content of your data, so we don't know if there will be key duplicates in the the result, and if there are key duplicates in the big table too. But if there are key duplicates in both tables, you will get a cartesian product, and the result will probably be useless even if you get it to run. In that case rethink your process, because there are flaws in your data model.

JJP1 · Posted 05-03-2019 08:56 AM

Hi @Erik,

Thanks for the First point,but we are planning to create a job where we need to do joining(all 6 tables result),and output of this joined data is agin getting subsett based on different conditions and finally created 3 SAS datasets from joined output.

Yes Erik on Next point i tried by joining 2 tables first with left join and did the resultant output left join with other table,and then result with 3rd table like wise,it was taking 10 minutes time to join two tables at a time.
it will work fine,but when it comes to join with the large dataset(103GB) it will give error after running very long time saying "Insufficient Space error " or sometimes "sort execution error"

Yes Erik I will try to do the 4 joins covering the "small" tables: A-B, result-C, result_D and result-E.but i am guessing that again it may failed when we join with large dataset (103GB) because we already ran by joining datasets between two(A & B) tables which are of maximum 3 GB size runs fine and gives output,and gets failed when it comes to this large data set after runnng 1 hr.

Thanks Erik i was getting duplicate records as we are not taking all columns.i will use select 'distinct' in each join which i was not knowing

Thanks Erik,i will try to create job as suggested by you as below

When that works, you could change to views in the 3 first transformations to save disk space, but keep a physical table as output in the last. Remember to save/close the job and open it again at this point to get rid of the work tables, and it might be a good idea to write the 4. output to a permanent table (at least in the development phase) so you don't have to rerun everything when you open the job, but can proceed directly to next step:

Also there are no duplicates in large dataset(103GB).

But i am just guessing even after creating the permanent table and doing left join with large table may get failed as i have tried running two tables each and comes to the step where we join with this large table always getting failed.

ErikLund_Jensen · Posted 05-03-2019 11:19 AM

Hi @JJP1

Re your comment to my first point - I fail to see how the structure of your planned job could prevent the use of a transformation with user-written code. To DI Studio there is no difference between a standard transformation like an Extract and a transformation with User-Written code. it is just a "box" holding some sas code and connected via tables to other "boxes", so it can go between two joins, if you need to do something that is beyond the scope of the prepacked transformations.

An then to your problem. I can't see from the information given if the problem is insuffecient space in the work library or a problem with other system ressources, but today memory is normally not a problem, if the SAS installation is configured to use it, so it might be a disk space problem. I suppose you have already run the cleanwork utility to free unused space in the work library.

I suggest a couple of experiments to see if the problem can be circumvented by reducing the amount of space needed in the work library. All transformations in a DI Studio job are running in the same process, so all work tables are retained until the process finishes. Try to split in two jobs, the first with the small table joins ending in a permanent table, and let the 4. join end with sorting on the "next" join key. Explanation follows.

Then start the next job with a sort just for testing purposes. Is it possible for you to sort the big dataset on the join key without running out of ressources - I mean just connect the table to a sort transformation, specify the key and let it write to the default work table? -

If it is not working, try to sort to a permanent table in a library defined with compress=yes, or compress=binary if you have long character variables. Be sure to register the output table from the sort transformation instead of using a loader, as the whole idea is to avoid work tables.

If that works, then you have no work tables left, but two permanent tables, both sorted on the merge key. You could try a join, but with both tables sorted, you have an extra option, because you can merge the two tables in a data step (requiring the dreaded User Written Code), and merge is very light on ressources, because it don't need any sort tables in the background,

JJP1 · Posted 05-06-2019 06:47 AM

Hi Erik,

I have created two permanent datasets where one dataset holds the join output of all small tables

and the other dataset only holds the sorted large dataset output (103 GB) size,(sorted successfully and created sorted output in permanent dataset)

and finally used merge statement to run both permanent tables based on the merge key , after merge statement started approximately after 3 minutes. dataset was updated up to till 7 GB data and was given below errror in log and merge statement was failed.

ERROR: Insufficient space in file AAA.FINALJOIN.DATA.
ERROR: File AAA.FINALJOIN.DATA is damaged. I/O processing did not complete.

WARNING: The data set AAA.FINALJOIN may be incomplete. When this step was stopped there were 28058873 observations and 41
variables.

Would you please help is there any other way to resolve this please,as yet i need to check on creating index option.

Thanks

ChrisNZ · Posted 05-06-2019 06:50 PM

>ERROR: Insufficient space in file AAA.FINALJOIN.DATA.

Would you please help is there any other way to resolve this please,as yet i need to check on creating index option.

This indicates insufficient disk space.

Your disk does not have enough space for the old table and the new one being created.

Which is strange considering you said you sorted it it, which requires the same amount of space for the permanent table.

Look at making space on that disk. Maybe it's a quota issue?

Also look at using SPDE as already suggested by @ScottBass to increase compression.

libname SPEEDY spde "%sysfunc(pathname(AAA))" compress=binary partsize=500g;

Use that library instead of AAA.

High-Performance SAS Coding - Third Edition

JJP1 · Posted 05-07-2019 02:24 AM

Hi @ChrisNZ ,

I have run the SPDE libname statement successfully .and i tried creating index for all tables,but iam getting below error for large table when trying to create index.please suggest.

ERROR: Insufficient space in file AAA.TABLE.INDEX.
ERROR: Requested operation denied. Index file AAA.TABLE.INDEX is damaged.
ERROR: Insufficient space in file AAA.TABLE.INDEX.

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away