Solved: Re: Automate SAS code to wait for sorted datasets - Page 2

JJP1 · Posted 05-20-2019 02:42 AM

The reason for splitting up for source tables is

1.the source table has 800+ million of records and this table has to join with some other tables and finally create target dataset.

2.initially we tried like doing joining with proc sql and getting the final output.but this takes a long time and it will not be completely processing this ,i will get errors like "insuffcient space".so we decided to split the tables and join or merge in order to complete this whole process with very less time please.

3.main aim to split is to do this whole process with very less time and avoid to get space issues.

Iam really trying to understand what @KurtBremser mentioned in this thread.as i am unable to grasp fully please and i am not sure how can i proceed on this please.sorry to trouble even after so much explanation i am asking like this please.

Would it be possible to help how hash code mentioned below works please and whether it will split please

Patrick · Posted 05-20-2019 02:59 AM

@JJP1

Thank you for explaining the reason for splitting. That helps. And no reason for apologies. I'd say you get that much "resistance" from "us" because we don't want you to go down the wrong rabbit hole.

Nothing will be "very less time" with that big tables but there are potentially ways to combine data from different sources without the need to sort everything physically on disk. What's possible and "best" depends on what you have and what you need.

I believe it would be worth to look into the bigger problem to not solve something that eventually doesn't require solving (like coding for a "split-sort"). For us to give some guidance for the bigger challenge you would need to provide more info like:

1. Name of source tables (assumed all stored as SAS tables, else please specify if any database involved)

2. Number of rows per source table and eventually also the size per table

3. The code of the actual join which you've got working already for lower volumes (so we can understand the join logic)

4. The length of the variables used for joining (so we can estimate memory consumption for any in-memory approach)

5. Is there any need for the resulting target table to be sorted in a specific way for further downstream processing.

JJP1 · Posted 05-20-2019 04:07 AM

Thanks @Kurt_Bremser for kind reponse .

Hi @Patrick ,

please find the details below

1. Name of source tables (assumed all stored as SAS tables, else please specify if any database involved)

All stored as SAS tables

2. Number of rows per source table and eventually also the size per table

Size
Library of Size of
Name Member Name File File
-------------------------------------------------------------

XXX AAA 2GB 1931520K
XXX BBB 102GB 1.0673E8K
XXX CCC 1GB 1445376K
XXX DDD 3GB 2657024K
XXX EEE 1GB 1227008K
XXX FFF 2GB 2466048K

Table_Name Record count
AAA - 22758772
BBB - 8.1021E8
CCC - 28754734
DDD - 27547919
EEE - 24549490
FFF - 24499890

3. The code of the actual join which you've got working already for lower volumes (so we can understand the join logic)

tried joining small tables and placed in one permanent dataset and sorted large dataset and placed in one permanent dataset. and tried merging.but it takes more time and insufficient space issue

code for actual join is

PROC SQL;
   CREATE TABLE joined1 AS 
   SELECT t1.AAAAAAAAAAA, 
          t1.PPPPPPPPPPPP, 
          t1.CCCCCCCCCCCC, 
          t1.DDDDDD, 
          t1.TTTTTTTTTTT, 
          t1.SSSSSSSSS, 
          t1.GGGGGGGGGGGGG, 
          t2.KKKKKKKKKKKK, 
          t2.FFFFFFFFFFFF, 
          t2.YYYYYYYY, 
          t2.RRRRRRR, 
          t2.EEEEEEEEEEEE, 
          t2.UUUUUUUUUUUUU, 
          t2.ZZZZZZZZZZZZZZZZ, 
          t2.LLLLLLLLLLLL, 
          t3.JJJJJJJJJ, 
          t3.NNNNNNNNNNNNNN, 
          t4.OOOOOOOOOOOO, 
          t4.VVVVVVVVVVVV, 
        
      FROM AAA t1
           left JOIN FFF t2 ON (t1.AAAAAAAAAAA = t2.AAAAAAAAAAA)
           left JOIN EEE t3 ON (t2.KKKKKKKKKKKK = t3.KKKKKKKKKKKK)
           left JOIN CCC t4 ON (t3.JJJJJJJJJ = 
          t4.JJJJJJJJJ);
QUIT;

PROC SQL;
   CREATE TABLE joined2 AS 
   SELECT t1.*, 
          t2.YYYYYYYYYYYYYYY, 
          t2.SAMPLE, 
          t2.SAMPLE1, 
          t2.SAMPLE2, 
          t2.SAMPLE3, 
          t2.SAMPLE4, 
          t2.SAMPLE5, 
          t2.SAMPLE6, 
          t2.SAMPLE7, 
          t2.SAMPLE8, 
          t2.SAMPLE9, 
          t2.SAMPLE10
      FROM joined1 t1
           left JOIN DDD t2 ON (t1.OOOOOOOOOOOO = t2.OOOOOOOOOOOO);
QUIT;


proc sort data=xx.BBB out=BBB;(large dataset)
  by YYYYYYYYYYYYYYY QQQQQQQQQQ;
run;
proc sort data=joined2;
  by YYYYYYYYYYYYYYY;
run;
data joined3;
  merge joined2(in=ina) BBB(in=inb);
  by YYYYYYYYYYYYYYY;
  if ina;
run;

4. The length of the variables used for joining (so we can estimate memory consumption for any in-memory approach)

the length of the columns i m using to join the columns are : 16,23,32,28,23

please excuse me for mentioning in code as table names and columns names as sudo values please

5. Is there any need for the resulting target table to be sorted in a specific way for further downstream processing

Yes The resultant target table further needs to be subset-ted to 3 different dataset based on different conditions please

Kurt_Bremser · Posted 05-20-2019 04:16 AM

@JJP1 wrote:

PROC SQL;
   CREATE TABLE joined1 AS 
   SELECT t1.AAAAAAAAAAA, 
          t1.PPPPPPPPPPPP, 
          t1.CCCCCCCCCCCC, 
          t1.DDDDDD, 
          t1.TTTTTTTTTTT, 
          t1.SSSSSSSSS, 
          t1.GGGGGGGGGGGGG, 
          t2.KKKKKKKKKKKK, 
          t2.FFFFFFFFFFFF, 
          t2.YYYYYYYY, 
          t2.RRRRRRR, 
          t2.EEEEEEEEEEEE, 
          t2.UUUUUUUUUUUUU, 
          t2.ZZZZZZZZZZZZZZZZ, 
          t2.LLLLLLLLLLLL, 
          t3.JJJJJJJJJ, 
          t3.NNNNNNNNNNNNNN, 
          t4.OOOOOOOOOOOO, 
          t4.VVVVVVVVVVVV, 
        
      FROM AAA t1
           left JOIN FFF t2 ON (t1.AAAAAAAAAAA = t2.AAAAAAAAAAA)
           left JOIN EEE t3 ON (t2.KKKKKKKKKKKK = t3.KKKKKKKKKKKK)
           left JOIN CCC t4 ON (t3.JJJJJJJJJ = 
          t4.JJJJJJJJJ);
QUIT;

PROC SQL;
   CREATE TABLE joined2 AS 
   SELECT t1.*, 
          t2.YYYYYYYYYYYYYYY, 
          t2.SAMPLE, 
          t2.SAMPLE1, 
          t2.SAMPLE2, 
          t2.SAMPLE3, 
          t2.SAMPLE4, 
          t2.SAMPLE5, 
          t2.SAMPLE6, 
          t2.SAMPLE7, 
          t2.SAMPLE8, 
          t2.SAMPLE9, 
          t2.SAMPLE10
      FROM joined1 t1
           left JOIN DDD t2 ON (t1.OOOOOOOOOOOO = t2.OOOOOOOOOOOO);
QUIT;


proc sort data=xx.BBB out=BBB;(large dataset)
  by YYYYYYYYYYYYYYY QQQQQQQQQQ;
run;
proc sort data=joined2;
  by YYYYYYYYYYYYYYY;
run;
data joined3;
  merge joined2(in=ina) BBB(in=inb);
  by YYYYYYYYYYYYYYY;
  if ina;
run;

And where in this code does the "insufficient disk space" happen? While sorting the large table, or during the merge?

Please post the whole log of that failing step.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

JJP1 · Posted 05-20-2019 04:58 AM

during the merge it is failing.actually i need to rerun the job again to get the log.

But it takes more time.please suggest.also we need to make the whole process in an efficient way.

so we tried loop option,it works but takes nearly 20 hrs please.

so after this we decided split and sort ,merge ways.

please let me know above all is there any other best approach please as i have a little knowledge please

Kurt_Bremser · Posted 05-20-2019 05:12 AM

Your problem:

In your WORK there is (before the merge) a sorted copy of dataset BBB, and your table joined2.

To do the merge, you need to have the space for BBB again, but also additional space to accommodate the additional columns you get from joined2. Depending on the relationship between those tables with regard to the key, you might even get multiple observations for some keys, expanding the dataset even more.

Since, if you split your data, you will have all the splits and the final dataset in WORK before you are finished, that will also fail.

Either increase the available space in WORK (make sure that no leftovers from previous sessions "eat" space, run cleanwork!), or create intermediate results or the final table in a library where you have the necessary space free.

Also look if using the compress=yes option reduces physical dataset size sufficiently to get you through. If you have lots of character columns with mostly empty contents, this might help significantly.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 05-20-2019 05:11 AM

@JJP1

You're dealing with some serious volumes here so I guess there won't be sufficient memory for most of what you need to do.

Are the sizes you've shared for compressed or uncompressed tables? And do you need all the columns in the tables or is the size of what you need eventually much lower?

Also: To decide if something could fit into memory I'd need the lengths of the columns you need (both in the join and select statement).

From what you write performance is important to you but first priority I guess is code which you can run at all without getting into an out of space error condition. Assuming you really need target table JOINED3 sorted, first measure you could take is using a TAGSORT for table BBB. This will only create utility files for the sort variables and though should require much less work space.

proc sort data=xx.BBB out=BBB(compress=yes) TAGSORT;
  by YYYYYYYYYYYYYYY QQQQQQQQQQ;
run;

IF JOINED3 doesn't need to be sorted then the Hash lookup approach I've posted earlier could eventually work if column YYYYY... from table JOINED2 can fit into memory. This approach wouldn't require sorting of table BBB at all.

And last but not least: Make sure all your tables are compressed. Either by setting OPTIONS compress=yes; or then having compress defined as data set option when creating the tables.

You could also free-up work space by deleting JOINED1 after your 2nd SQL.

proc datasets lib=work nolist nowarn;
  delete joined1;
  run;
quit;

JJP1 · Posted 05-20-2019 05:36 AM

HI @Patrick ,

The code i pasted is working but it takes more time and gives space issue please,so we come up with new approach like doing splitting the dataset to 8 subsets please ,sorting and merging.

Splliting,sorting, and merging part method we did not try and check develop any code please.

can we do this for "Splliting,sorting, and merging part" instead of proc sql pasted code please.

Yes @Patrick ,deleted JOINED1 after 2nd SQL please.

sorry iam confusing you please suggest how can i start this please

Patrick · Posted 05-20-2019 05:51 AM

@JJP1

So you would want to split table BBB by variable YYYYYY...? Not that this can't be done but I simply can't see how this could improve performance in any way. Rather the other way round: Whatever we could come up with would require additional passes through the data and though decrease performance.

I guess you believe the performance improvement would then come from some sort of parallel processing like for sorting the the table "partitions". If you've got SAS/Connect licensed then it is possible to do such things but.... additional data movements likely not improving performance especially when it's then multiple processes competing for the same I/O.

The likely bottleneck is disk I/O. For this reason a TAGSORT on BBB as a single table might be the most efficient sort algorithm.

An alternative would be to use the SPDE engine (instead of Base) and define multiple disks both for storage of the table and utility files.

....what's possible and could help depends on your environment and for table BBB would also require a change to the upstream process which creates the table in first place (as that's where you would need to use the SPDE engine first).

Adding in @ChrisNZ as this discussion feels very much to be in his corner. May be he's got some additional ideas.

JJP1 · Posted 05-20-2019 06:01 AM

Tahnks @Patrick ..you mean that if use below code and do the joining it will be more efficient rather than the splitting below tables and joining parallely please and it takes less time please ?

options compress=yes;
proc sort data=xx.BBB out=test tagsort;
by id num;
quit;

Patrick · Posted 05-20-2019 06:03 AM

@JJP1

Yes, that's what I believe to be true.

...and if you don't need all the columns from BBB then also add a keep statement like:

proc sort data=xx.BBB(keep=var1 var2....)

JJP1 · Posted 05-20-2019 06:26 AM

Hi @Patrick ,

I am using below code where i do not want the column(being used in by sttamenet) to be taken in final sorted dataset bigtable. iam getting error ID not found .please help

options compress=yes;
proc sort data=XX.BB
(keep=col col2 col3 )
out=bigtable tagsort;
by ID fff ;
quit;

ERROR: ID not found.

Kurt_Bremser · Posted 05-20-2019 06:33 AM

@JJP1 wrote:

Hi @Patrick ,

I am using below code where i do not want the column(being used in by sttamenet) to be taken in final sorted dataset bigtable. iam getting error ID not found .please help
options compress=yes;
proc sort data=XX.BB
(keep=col col2 col3 )
out=bigtable tagsort;
by ID fff ;
quit;

ERROR: ID not found.

This means that ID is not a column in dataset XX.BB.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 05-20-2019 06:45 AM

Try this for better compression:

libname SP spde "%sysfunc(pathname(WORK))" compress=binary partsize=500g;
proc sort data=XX.BB(keep=C1 C2 C3 ID FFF) out=SP.SORTED(keep=C1 C2 C3) tagsort;
  by ID FFF ;
quit;

It might even work without the tagsort option.

High-Performance SAS Coding - Third Edition

JJP1 · Posted 05-20-2019 07:03 AM

Hi @ChrisNZ ,

I am using below code.but i can not see the SP library under SASEG libraries.to see the sorted dataset. may i kindly request where can i see this dataset SP.SORTED" further to join with other tables please

libname SP spde "%sysfunc(pathname(WORK))" compress=binary partsize=500g;
proc sort data=XX.BB(keep=C1 C2 C3 ID FFF) out=SP.SORTED(keep=C1 C2 C3) tagsort;
  by ID FFF ;
quit;

May i kindly request why compress=binary and partsize=500g using options please ?

SAS Innovate 2025: Call for Content

Classroom Training Available!