Solved: Re: How do I Split my Records into Smaller Populations

SAS_in_Florida · Posted 03-08-2024 11:43 AM

Never used forums before and this may be pretty easy but I am struggling with this at the moment. Any suggestions are appreciated.

Currently Using SAS Studio but might be moving to SAS VIYA in future.

I have an inventory of ~100k records that I am joining/merging with another table. The time it takes to run the query is exceeding the allowed amount of time to run against the data tables and the program is being aborted as a result. My goal, is to split my inventory into smaller inventory files of ~10K records and run the query multiple times and set the results back together upon completion. If the record count was the same each time, I could just hard code it. However, the inventory record count could vary daily.

Inventory - 111,000

INV_1 = 10k

INV_2=10k

...

INV11=1K

If 1st 10K records,

Join FILE_A with INV_1

If 2nd 10K records,

Join FILE_A with INV_2

...

If within Last 10k records(1000 in generic example),

Join File_A with INV_11

Data final;

set INV1, IN2, ...INV11

run;

Like I said, I believe this probably pretty easy but cannot think through this. I'll probably dream up a way to code it but struggling at the moment and not very good with macros.

Thanks for any assistance you can provide.

Mazi · Posted 03-08-2024 01:07 PM

In response to your question though, I have come up with the following. Though I do think Indexing would be the best way to solve this issue.

data original;
do j=1 to 10**6;
i=rand('integer', 1, 100000);
output;
end;
drop j;
run;

data test;
do i=1 to 10**5;
j=i*i;
output;
end;
run;

%macro split(data=);
%local split_parts i elif;

%let elif=;

proc sql;
select ceil(divide(count(*),10000)) into: split_parts from &data;
quit;

data
%do i = 1 %to &split_parts;
inv_&i
%end;
;
set &data;
part = ceil(divide(_n_,10000));
%do i=1 %to &split_parts;
&elif if part = &i then output inv_&i;
%let elif=else;
%end;
;
run;

%do i=1 %to &split_parts;
proc sql;
create table merge&i as
select a.i, b.j
from original as a inner join
inv_&i as b
on a.i=b.i;
quit;
%end;

data final;
set %do i=1 %to &split_parts;
merge&i
%end;
;
run;
%mend split;

%split(data=test)

View solution in original post

Mazi · Posted 03-08-2024 12:43 PM

Hi, Can I ask if you’ve tried adding an index to improve performance?

Another option could be the use of a hashobject

SAS_in_Florida · Posted 03-08-2024 12:58 PM

The indexing of the file would be ideal. Unfortunately, the file that I am joining my Inventory with is indexed in a secondary key that we cannot use. There have been requests to index the file(in Teradata) but this has not been something that is getting prioritized as I believe we may be switching to Snowflake.

I have not used hashobjects yet but I will review. Thank you for the suggestion.

ballardw · Posted 03-08-2024 01:07 PM

@SAS_in_Florida wrote:

The indexing of the file would be ideal. Unfortunately, the file that I am joining my Inventory with is indexed in a secondary key that we cannot use. There have been requests to index the file(in Teradata) but this has not been something that is getting prioritized as I believe we may be switching to Snowflake.

I have not used hashobjects yet but I will review. Thank you for the suggestion.

Which also suggests turning on the diagnostic bits for the connection to Teradata if that is what you are doing for this exercise. With seeing what you are actually doing perhaps you want pass through code to execute on the Teradata server (which almost certainly won't make the DB admin happy with multiple temporary tables).

SAS_in_Florida · Posted 03-08-2024 01:19 PM

Yes, That is what kinda being done. Taking the inventory and sending it to Teradata for the Join and upon completion, deleting the temp. I am trying to limit the variables sent to Teradata and doing innner join on the matching key(s) but it is unbearably slow. When matching just 1 record where A.key = B.key, it took about 20 minutes.

Sorry for painting an unfinished picture.

In my mind, I am trying to do the following.

1. Create a counter of # of records in my inventory - RECORD_COUNT

2. Loops = CEIL(RECORD_COUNT/10000)

3. If count of Inventory <= 10K send to Temp_TERADATA location, run QUERY and send to RESULTS_TMP1, delete Temp_TERADATA, (Repeat for Loops 2-N and Set Results 1-N into FINAL_QUERY results)

4. Once I have the matching records, I can the merge/join the results back with my original Inventory to bring back in all needed variables for the end reporting.

Mazi · Posted 03-08-2024 01:07 PM

In response to your question though, I have come up with the following. Though I do think Indexing would be the best way to solve this issue.

data original;
do j=1 to 10**6;
i=rand('integer', 1, 100000);
output;
end;
drop j;
run;

data test;
do i=1 to 10**5;
j=i*i;
output;
end;
run;

%macro split(data=);
%local split_parts i elif;

%let elif=;

proc sql;
select ceil(divide(count(*),10000)) into: split_parts from &data;
quit;

data
%do i = 1 %to &split_parts;
inv_&i
%end;
;
set &data;
part = ceil(divide(_n_,10000));
%do i=1 %to &split_parts;
&elif if part = &i then output inv_&i;
%let elif=else;
%end;
;
run;

%do i=1 %to &split_parts;
proc sql;
create table merge&i as
select a.i, b.j
from original as a inner join
inv_&i as b
on a.i=b.i;
quit;
%end;

data final;
set %do i=1 %to &split_parts;
merge&i
%end;
;
run;
%mend split;

%split(data=test)

SAS_in_Florida · Posted 03-08-2024 01:21 PM

I do agree indexing would be beneficial for this project and ALL others but I do not see that they are going to put in efforts for the changes needed. I will try working with this. Thank you.

Mazi · Posted 03-08-2024 02:43 PM

Please let me know if this is what you had in mind

ballardw · Posted 03-08-2024 12:57 PM

Please show the actual code of your join/merge operation (s).

Unless you have a very slow network or someone has set extremely short limits to your options most things involving 100K or so variables should be unlikely to time out unless the code is very inefficient. So showing the code may show us some places to help.

Also best would be to show a LOG so we see how long successful steps are taking.

One common thing that we see that extends the time code runs is using SQL where a full join is done and then a where clause is used to filter the results. These are often amenable to changing to another type of join with ON conditions and run faster.

SAS_in_Florida · Posted 03-08-2024 01:05 PM

We are using teradata and the network is very slow M-F. The SQL code I am using is maxed out on the where clause criteria. I want to say I can share coding but large organization may frown upon this. I know that is not an ideal response but rather not poke the bear. I will try to mock up generic duplication of the code but I believe the main issue is with the data source limitations and not having the file's join key indexed properly. I have used similar code in prior company and the the code runs quickly.

Thank you for your response.

ballardw · Posted 03-08-2024 03:23 PM

@SAS_in_Florida wrote:

We are using teradata and the network is very slow M-F. The SQL code I am using is maxed out on the where clause criteria. I want to say I can share coding but large organization may frown upon this. I know that is not an ideal response but rather not poke the bear. I will try to mock up generic duplication of the code but I believe the main issue is with the data source limitations and not having the file's join key indexed properly. I have used similar code in prior company and the the code runs quickly.

Thank you for your response.

So paraphrase the code. Different table names like just t1, t2, t3; variables like var1 var2 var3. It doesn't have to list every single one, just enough to show the general structure of the code. If you think the Where clause is the issue then a very close similarity to that is needed. If you are looking for literal values then use text like "Value1" "Value2" instead of your actual ones if there is a concern over sensitive data.

Are you currently using pass through to Teradata or repeatedly hitting the server with requests from SAS?

There are diagnostic requests for what is passing over the connection to Teradata. You really do need to show what your connection to Teradata looks like. Replace server and schema names and login/password with generic text but the OPTIONS used may be critical.

The online help for SAS/ACCESS Interface to Teradata has a section "Maximizing Teradata Load and Read Performance".

Have you seen that part of the documentation. There's about 20 topics right there to peruse.

SASKiwi · Posted 03-08-2024 04:56 PM

With all due respect, I don't think you have fully explained your use case so we can offer the most effective advice. If appears you are joining a SAS dataset with a Teradata table. Since you appear to be using just 'normal' SQL this will execute on your SAS server and means ALL Teradata data will have to read onto your SAS server before the join can happen. Please correct my assumptions here if they are not correct.

A much better and way faster strategy would be to upload your SAS dataset into a temporary Teradata table and do ALL of your joining in Teradata, then reading the new table back to SAS. This is going to require use of SQL Passthru - have you used this before?

Reeza · Posted 03-08-2024 05:09 PM

Are you limiting your data set to 10,000 for joining because of macro variable length or the code submitted to Teradata becomes too long?

Patrick · Posted 03-09-2024 07:42 PM

@SAS_in_Florida wrote:

Never used forums before and this may be pretty easy but I am struggling with this at the moment. Any suggestions are appreciated.

Currently Using SAS Studio but might be moving to SAS VIYA in future.

I have an inventory of ~100k records that I am joining/merging with another table. The time it takes to run the query is exceeding the allowed amount of time to run against the data tables and the program is being aborted as a result. My goal, is to split my inventory into smaller inventory files of ~10K records and run the query multiple times and set the results back together upon completion. If the record count was the same each time, I could just hard code it. However, the inventory record count could vary daily.

@SAS_in_Florida SAS Studio won't go away. It is the main client with SAS Viya.

Soo...

You've got some big table in Teradata and you've got a relatively small table (100k rows) that you need to join.
1. Is the small table in SAS or in Teradata?
2. How big is the Teradata table (number of rows?).
You need to join the big with the small table (inner join) and then get the result set back into SAS for further processing
1. What's the relationship between the tables? Would the result set be 100k rows or could it be much more?
2. Ensure that the join key (single or composite) in your small table is unique.
How do you know that it's the query in Teradata that takes too long and not just the data transfer between Teradata and SAS?

The usual approach for a performant join between a "small" SAS table with a big DB table is to first upload the join keys of the small table into a Teradata table (temp table or permanent table), execute the join in Teradata using explicit passthrough SQL and then transfer the resultset back to SAS. Are you allowed to create a table in Teradata (permanent or temporary)?

What you also need to "optimize" is in what junks SAS exchanges data with Teradata because the defaults are often too low. Verify that for your libname/connection to Teradata the following options have "reasonable" values: readbuff=<some value>, insertbuff=<some value>, dbcommit=0

If you don't use explicit passthrough SQL then also use the following options to ensure that processing gets fully pushed to the DB:
options sastrace=',,,d' sastraceloc=saslog nostsuffix msglevel=i

SAS_in_Florida · Posted 03-12-2024 09:47 AM

Thank you for all feedback and suggestions. Hopefully to clarify my original question some. I was taking population A records and sending to a Temp Table in Teradata so that I can join with another table within Teradata. This Teradata table is not indexed with the primary merge key and is not something that is in scope. The query runs very slowly against millions of transactional records. As a result, the processed times out due to limitations within our SAS environment. I was trying to find a way to split the Temp table into smaller populations so that I can run the query N number of times based upon the volume. However, I did not want to hard code the number of loops or Temp tables because the volume could change. Some of the suggestions that were made helped clear my mental block and I came up with the following. I ran this on a much smaller population of 101 records so that I could do some quick testing. However, it appears to be working. If there is any major inefficiencies, I would not be offended if you smack my hand and advise better ways to do them as I am always open to learning.

options obs=101;

Data INV(keep=KEY vol_j);
set INVENTORY.LOCATION;
vol_j=1;
run;
proc sort; 
by vol_j key;
options obs=max;

Data INV;
set INV;
by vol_j key;
if first.vol_j then order_j=1;
else order_j+1;
run;      

Data loop_number(drop=order_j);
set INV(keep=order_j);
vol_run=10;
loopnum=ceil(order_j/vol_run);
proc sort;
by vol_run descending loopnum;
proc sort nodupkey;
by vol_run;

Data _null_;
set loop_number;
call symput( 'loop_end_num', loopnum );
call symput( 'vol_per_run', vol_run );    
run;

/**********************************************************************************/           
%macro J_LOOP;
%do j=1 %to &loop_end_num;

%macro JM_QUERY;
Proc sql;
connect to INVENTORY2.LOCATION;
create table NEW_INV as select * from connection to TERA(
select         
 a.KEY 
,b.KEY
from INVENTORY.TERA         a 
inner join TEMPFILE.TERA_&J b 
on a.KEY = b.KEY);
disconnect from TERA;
quit;
%mend JM_QUERY;   

%if &j=1 %then %do;
Data Inventory_&J;
set INV;
if order_j LE &vol_per_run * &J;
run; 

proc delete data=TEMPFILE.TERA_&J; run;               
data TEMPFILE.TERA_&J;
set Inventory_&J;
run; 
%JM_QUERY;

Data NEW_INV_ALL;
set NEW_INV;
run;
%end;

%else %do;
Data Inventory_&J;
set INV;        
if order_j GT ((&vol_per_run)*(&J-1)) and
order_j LE (&vol_per_run * &J);
run;  

proc delete data=TEMPFILE.TERA_&J; run;               
data TEMPFILE.TERA_&J;
set Inventory_&J;
run;  
%JM_QUERY;

Data NEW_INV_ALL;
set NEW_INV_ALL NEW_INV;
run;        
%end;                  

%end;
%mend J_LOOP; 
%J_LOOP;

Thank you again for all of the responses and suggestions. I was unsure if anyone would respond to the forum and I have been very pleased with what appears to be a very supporting network of professionals!

Classroom Training Available!