I have very big data set with many columns.
Actually only 3 columns are important for me -id,date,LoanBalance
The task is to create a new data set that contain for each customer ID only one row.
The chosen row for each ID should be the row with max date.
This code is not efficient and running very long time and even didn't finish.
My question- What is most efficient way to do it?
/*select one row per group based on max value*/ /*select one row per group based on max value*/ /*select one row per group based on max value*/ /*select one row per group based on max value*/ Data have; input id date : date9. LoanBalance; format date date9.; cards; 1 01JUN2023 1000 1 02JUn2023 980 1 03JUN2023 950 1 04JUN2023 920 2 01JUN2023 3000 2 02JUn2023 2800 2 03JUN2023 0 ; Run; Proc SQL; Create table want as select id,date,LoanBalance from have group by id having date=max(date) ; run;
Proc sort data=have; by id date; run; data want; set have; by id; if last.id; run;
The summary functions in a HAVING clause do have a certain amount of overhead.
The above code may take time to sort. May subset to reduce the number of variables to reduce disk space and read/write times if there very many more.
To speed up the process, reduce data horizontally in the first step:
proc sort data=have (keep=id date loanbalance) out=intermediate ; by id date; run; data want; set intermediate; by id; if last.id; run;
Since the data are already sorted by ID, you can pass through each ID twice: first pass to find the maximum value, the second pass to output the instance that match that maximum value.
data have; input id date : date9. LoanBalance; format date date9.; cards; 1 01JUN2023 1000 1 02JUn2023 980 1 03JUN2023 950 1 04JUN2023 920 2 01JUN2023 3000 2 02JUn2023 2800 2 03JUN2023 0 run; data want (drop=_:); set have (in=firstpass) have (in=secondpass); by id; retain _max; if first.id then _max=loanbalance; else _max=max(_max,loanbalance); if secondpass and loanbalance=_max; run;
This assumes there is only one instance of a maximum value. If not, and you still want only one instance per id, then a rule is needed to choose among the tied observations.
II run your code and got error
data want (drop=_:); set TeraData.VBM410_ACTIVE_LOAN_BAL (in=firstpass keep=Agreement_Account_Id Agreement_Summary_Date Fund_Actual_Payment_Count) TeraData.VBM410_ACTIVE_LOAN_BAL (in=secondpass keep=Agreement_Account_Id Agreement_Summary_Date Fund_Actual_Payment_Count); by Agreement_Account_Id; retain _max; if first.Agreement_Account_Id then _max=Fund_Actual_Payment_Count; else _max=max(_max,Fund_Actual_Payment_Count); if secondpass and Fund_Actual_Payment_Count=_max; run;
Here error log
1 The SAS System 08:15 Monday, August 21, 2023 1 ;*';*";*/;quit;run; 2 OPTIONS PAGENO=MIN; 3 %LET _CLIENTTASKLABEL='Program'; 4 %LET _CLIENTPROCESSFLOWNAME='Process Flow'; 5 %LET _CLIENTPROJECTPATH=''; 6 %LET _CLIENTPROJECTPATHHOST=''; 7 %LET _CLIENTPROJECTNAME=''; 8 %LET _SASPROGRAMFILE=''; 9 %LET _SASPROGRAMFILEHOST=''; 10 11 ODS _ALL_ CLOSE; 12 OPTIONS DEV=PNG; 13 GOPTIONS XPIXELS=0 YPIXELS=0; 14 FILENAME EGSR TEMP; 15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR 16 STYLE=HTMLBlue 17 STYLESHEET=(URL="file:///C:/Program%20Files/SASHome/SASEnterpriseGuide/7.1/Styles/HTMLBlue.css") 18 NOGTITLE 19 NOGFOOTNOTE 20 GPATH=&sasworklocation 21 ENCODING=UTF8 22 options(rolap="on") 23 ; NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR 24 25 GOPTIONS ACCESSIBLE; 26 data want (drop=_:); 27 set TeraData.VBM410_ACTIVE_LOAN_BAL (in=firstpass keep=Agreement_Account_Id Agreement_Summary_Date 27 ! Fund_Actual_Payment_Count) 28 TeraData.VBM410_ACTIVE_LOAN_BAL (in=secondpass keep=Agreement_Account_Id Agreement_Summary_Date 28 ! Fund_Actual_Payment_Count); 29 by Agreement_Account_Id; 30 retain _max; 31 if first.Agreement_Account_Id then _max=Fund_Actual_Payment_Count; 32 else _max=max(_max,Fund_Actual_Payment_Count); 33 if secondpass and Fund_Actual_Payment_Count=_max; 34 run; NOTE: Compression was disabled for data set WORK.WANT because compression overhead would increase the size of the data set. ERROR: Teradata row not delivered (trget): CLI2: REQEXHAUST(307): Request data exhausted. . NOTE: The DATA step has been abnormally terminated. NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 258289 observations read from the data set TERADATA.VBM410_ACTIVE_LOAN_BAL. NOTE: There were 258285 observations read from the data set TERADATA.VBM410_ACTIVE_LOAN_BAL. WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 123804 observations and 3 variables. WARNING: Data set WORK.WANT was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 1:31.11 user cpu time 0.41 seconds system cpu time 0.01 seconds memory 2027.40k OS Memory 26544.00k Timestamp 08/21/2023 08:18:32 AM Step Count 4 Switch Count 47 Page Faults 0 Page Reclaims 560 Page Swaps 0 Voluntary Context Switches 502 2 The SAS System 08:15 Monday, August 21, 2023 Involuntary Context Switches 5 Block Input Operations 0 Block Output Operations 0 35 36 GOPTIONS NOACCESSIBLE; 37 %LET _CLIENTTASKLABEL=; 38 %LET _CLIENTPROCESSFLOWNAME=; 39 %LET _CLIENTPROJECTPATH=; 40 %LET _CLIENTPROJECTPATHHOST=; 41 %LET _CLIENTPROJECTNAME=; 42 %LET _SASPROGRAMFILE=; 43 %LET _SASPROGRAMFILEHOST=; 44 45 ;*';*";*/;quit;run; 46 ODS _ALL_ CLOSE; 47 48 49 QUIT; RUN; 50
You seem to have run into a limit in Teradata.
ERROR: Teradata row not delivered (trget): CLI2: REQEXHAUST(307): Request data exhausted.
Check with your Teradata admins why this happens.
That your source table resides in a database is significant and something you should have told us from start.
Using a data step solution will likely pull all the data first to the SAS side before sub-setting it. And that's likely not efficient. I'm also not sure if Teradata will return the rows sorted only because you've got a BY <variable> in your data step.
Googling the error you've got returned this Teradata knowledge base article.
I would have expected that your initial Proc SQL would be right code to do things. Is that the exact code that causes the bad performance?
Eventually add the following options prior to the SQL and share the SAS log with us that shows what actually gets executed on the database side.
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
Ask your Teradata system admins for an efficient query .
I doubt that SAS can pass your original query into Teradata as I don't think it will allow the use of aggregate functions in a HAVING clause the way that PROC SQL does.
You might be able to use the advanced SQL syntax of "windowing" functions to create the Teradata SQL.
If you already have a libref pointing to your TERADATA database you can use that in PROC SQL to send pass thru SQL into Teradata.
So something like:
proc sql; connect to teradata; create table want as select Agreement_Account_Id , Agreement_Summary_Date , Fund_Actual_Payment_Count from connection to teradata (select Agreement_Account_Id , Agreement_Summary_Date , Fund_Actual_Payment_Count from SCHEMA_NAME.VBM410_ACTIVE_LOAN_BAL where 1=row_number() partition by Agreement_Account_Id order by Fund_Actual_Payment_Count desc ); quit;
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.