Hello friends,
I have very big data set with many columns.
Actually only 3 columns are important for me -id,date,LoanBalance
The task is to create a new data set that contain for each customer ID only one row.
The chosen row for each ID should be the row with max date.
This code is not efficient and running very long time and even didn't finish.
My question- What is most efficient way to do it?
/*select one row per group based on max value*/
/*select one row per group based on max value*/
/*select one row per group based on max value*/
/*select one row per group based on max value*/
Data have;
input id date : date9. LoanBalance;
format date date9.;
cards;
1 01JUN2023 1000
1 02JUn2023 980
1 03JUN2023 950
1 04JUN2023 920
2 01JUN2023 3000
2 02JUn2023 2800
2 03JUN2023 0
;
Run;
Proc SQL;
Create table want as
select id,date,LoanBalance
from have
group by id
having date=max(date)
;
run;
Proc sort data=have; by id date; run; data want; set have; by id; if last.id; run;
The summary functions in a HAVING clause do have a certain amount of overhead.
The above code may take time to sort. May subset to reduce the number of variables to reduce disk space and read/write times if there very many more.
To speed up the process, reduce data horizontally in the first step:
proc sort
data=have (keep=id date loanbalance)
out=intermediate
;
by id date;
run;
data want;
set intermediate;
by id;
if last.id;
run;
Since the data are already sorted by ID, you can pass through each ID twice: first pass to find the maximum value, the second pass to output the instance that match that maximum value.
data have;
input id date : date9. LoanBalance;
format date date9.;
cards;
1 01JUN2023 1000
1 02JUn2023 980
1 03JUN2023 950
1 04JUN2023 920
2 01JUN2023 3000
2 02JUn2023 2800
2 03JUN2023 0
run;
data want (drop=_:);
set have (in=firstpass) have (in=secondpass);
by id;
retain _max;
if first.id then _max=loanbalance;
else _max=max(_max,loanbalance);
if secondpass and loanbalance=_max;
run;
This assumes there is only one instance of a maximum value. If not, and you still want only one instance per id, then a rule is needed to choose among the tied observations.
II run your code and got error
data want (drop=_:);
set TeraData.VBM410_ACTIVE_LOAN_BAL (in=firstpass keep=Agreement_Account_Id Agreement_Summary_Date Fund_Actual_Payment_Count)
TeraData.VBM410_ACTIVE_LOAN_BAL (in=secondpass keep=Agreement_Account_Id Agreement_Summary_Date Fund_Actual_Payment_Count);
by Agreement_Account_Id;
retain _max;
if first.Agreement_Account_Id then _max=Fund_Actual_Payment_Count;
else _max=max(_max,Fund_Actual_Payment_Count);
if secondpass and Fund_Actual_Payment_Count=_max;
run;
Here error log
1 The SAS System 08:15 Monday, August 21, 2023
1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Program';
4 %LET _CLIENTPROCESSFLOWNAME='Process Flow';
5 %LET _CLIENTPROJECTPATH='';
6 %LET _CLIENTPROJECTPATHHOST='';
7 %LET _CLIENTPROJECTNAME='';
8 %LET _SASPROGRAMFILE='';
9 %LET _SASPROGRAMFILEHOST='';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=PNG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 FILENAME EGSR TEMP;
15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
16 STYLE=HTMLBlue
17 STYLESHEET=(URL="file:///C:/Program%20Files/SASHome/SASEnterpriseGuide/7.1/Styles/HTMLBlue.css")
18 NOGTITLE
19 NOGFOOTNOTE
20 GPATH=&sasworklocation
21 ENCODING=UTF8
22 options(rolap="on")
23 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24
25 GOPTIONS ACCESSIBLE;
26 data want (drop=_:);
27 set TeraData.VBM410_ACTIVE_LOAN_BAL (in=firstpass keep=Agreement_Account_Id Agreement_Summary_Date
27 ! Fund_Actual_Payment_Count)
28 TeraData.VBM410_ACTIVE_LOAN_BAL (in=secondpass keep=Agreement_Account_Id Agreement_Summary_Date
28 ! Fund_Actual_Payment_Count);
29 by Agreement_Account_Id;
30 retain _max;
31 if first.Agreement_Account_Id then _max=Fund_Actual_Payment_Count;
32 else _max=max(_max,Fund_Actual_Payment_Count);
33 if secondpass and Fund_Actual_Payment_Count=_max;
34 run;
NOTE: Compression was disabled for data set WORK.WANT because compression overhead would increase the size of the data set.
ERROR: Teradata row not delivered (trget): CLI2: REQEXHAUST(307): Request data exhausted. .
NOTE: The DATA step has been abnormally terminated.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 258289 observations read from the data set TERADATA.VBM410_ACTIVE_LOAN_BAL.
NOTE: There were 258285 observations read from the data set TERADATA.VBM410_ACTIVE_LOAN_BAL.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 123804 observations and 3 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 1:31.11
user cpu time 0.41 seconds
system cpu time 0.01 seconds
memory 2027.40k
OS Memory 26544.00k
Timestamp 08/21/2023 08:18:32 AM
Step Count 4 Switch Count 47
Page Faults 0
Page Reclaims 560
Page Swaps 0
Voluntary Context Switches 502
2 The SAS System 08:15 Monday, August 21, 2023
Involuntary Context Switches 5
Block Input Operations 0
Block Output Operations 0
35
36 GOPTIONS NOACCESSIBLE;
37 %LET _CLIENTTASKLABEL=;
38 %LET _CLIENTPROCESSFLOWNAME=;
39 %LET _CLIENTPROJECTPATH=;
40 %LET _CLIENTPROJECTPATHHOST=;
41 %LET _CLIENTPROJECTNAME=;
42 %LET _SASPROGRAMFILE=;
43 %LET _SASPROGRAMFILEHOST=;
44
45 ;*';*";*/;quit;run;
46 ODS _ALL_ CLOSE;
47
48
49 QUIT; RUN;
50
You seem to have run into a limit in Teradata.
ERROR: Teradata row not delivered (trget): CLI2: REQEXHAUST(307): Request data exhausted.
Check with your Teradata admins why this happens.
Access to data in external databases is a factor in planning programming. You didn't mention that the data was on a Teradata server until getting an error.
That your source table resides in a database is significant and something you should have told us from start.
Using a data step solution will likely pull all the data first to the SAS side before sub-setting it. And that's likely not efficient. I'm also not sure if Teradata will return the rows sorted only because you've got a BY <variable> in your data step.
Googling the error you've got returned this Teradata knowledge base article.
I would have expected that your initial Proc SQL would be right code to do things. Is that the exact code that causes the bad performance?
Eventually add the following options prior to the SQL and share the SAS log with us that shows what actually gets executed on the database side.
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
@Ronein I feel my comments were clear. But Tom provided now actual SQL passthrough code that you could try.
Ask your Teradata system admins for an efficient query .
I doubt that SAS can pass your original query into Teradata as I don't think it will allow the use of aggregate functions in a HAVING clause the way that PROC SQL does.
You might be able to use the advanced SQL syntax of "windowing" functions to create the Teradata SQL.
If you already have a libref pointing to your TERADATA database you can use that in PROC SQL to send pass thru SQL into Teradata.
So something like:
proc sql;
connect to teradata;
create table want as
select
Agreement_Account_Id
, Agreement_Summary_Date
, Fund_Actual_Payment_Count
from connection to teradata
(select
Agreement_Account_Id
, Agreement_Summary_Date
, Fund_Actual_Payment_Count
from SCHEMA_NAME.VBM410_ACTIVE_LOAN_BAL
where 1=row_number() partition by Agreement_Account_Id order by Fund_Actual_Payment_Count desc
);
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.