Re: select one row by group where date=max(date)

Ronein · Posted 08-13-2023 01:45 AM

Hello friends,

I have very big data set with many columns.

Actually only 3 columns are important for me -id,date,LoanBalance

The task is to create a new data set that contain for each customer ID only one row.

The chosen row for each ID should be the row with max date.

This code is not efficient and running very long time and even didn't finish.

My question- What is most efficient way to do it?

/*select one row per group based on max value*/
/*select one row per group based on max value*/
/*select one row per group based on max value*/
/*select one row per group based on max value*/
Data have;
input id date : date9. LoanBalance;
format date date9.;
cards;
1 01JUN2023 1000
1 02JUn2023 980
1 03JUN2023 950
1 04JUN2023 920
2 01JUN2023 3000
2 02JUn2023 2800
2 03JUN2023 0
;
Run;

Proc SQL;
Create table want as
select id,date,LoanBalance
from have
group by id
having date=max(date)
;
run;

ballardw · Posted 08-13-2023 02:43 AM

Proc sort data=have;
   by id date;
run;

data want;
   set have;
   by id;
   if last.id;
run;

The summary functions in a HAVING clause do have a certain amount of overhead.

The above code may take time to sort. May subset to reduce the number of variables to reduce disk space and read/write times if there very many more.

Kurt_Bremser · Posted 08-13-2023 03:10 AM

To speed up the process, reduce data horizontally in the first step:

proc sort
  data=have (keep=id date loanbalance)
  out=intermediate
;
by id date;
run;

data want;
set intermediate;
by id;
if last.id;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 08-13-2023 09:15 PM

Since the data are already sorted by ID, you can pass through each ID twice: first pass to find the maximum value, the second pass to output the instance that match that maximum value.

data have;
  input id date : date9. LoanBalance;
  format date date9.;
cards;
1 01JUN2023 1000
1 02JUn2023 980
1 03JUN2023 950
1 04JUN2023 920
2 01JUN2023 3000
2 02JUn2023 2800
2 03JUN2023 0
run;

data want (drop=_:);
  set have (in=firstpass)  have (in=secondpass);
  by id;
  retain _max;
  if first.id then _max=loanbalance;
  else _max=max(_max,loanbalance);
  if secondpass and loanbalance=_max;
run;

This assumes there is only one instance of a maximum value. If not, and you still want only one instance per id, then a rule is needed to choose among the tied observations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ronein · Posted 08-21-2023 01:05 AM

Thanks but in real data is not sorted, what solution do you offer?

Ronein · Posted 08-21-2023 01:39 AM

II run your code and got error


data want (drop=_:);
set TeraData.VBM410_ACTIVE_LOAN_BAL (in=firstpass keep=Agreement_Account_Id Agreement_Summary_Date Fund_Actual_Payment_Count)
TeraData.VBM410_ACTIVE_LOAN_BAL (in=secondpass keep=Agreement_Account_Id Agreement_Summary_Date Fund_Actual_Payment_Count);
by Agreement_Account_Id;
retain _max;
if first.Agreement_Account_Id then _max=Fund_Actual_Payment_Count;
else _max=max(_max,Fund_Actual_Payment_Count);
if secondpass and Fund_Actual_Payment_Count=_max;
run;

Here error log

1                                                          The SAS System                              08:15 Monday, August 21, 2023

1          ;*';*";*/;quit;run;
2          OPTIONS PAGENO=MIN;
3          %LET _CLIENTTASKLABEL='Program';
4          %LET _CLIENTPROCESSFLOWNAME='Process Flow';
5          %LET _CLIENTPROJECTPATH='';
6          %LET _CLIENTPROJECTPATHHOST='';
7          %LET _CLIENTPROJECTNAME='';
8          %LET _SASPROGRAMFILE='';
9          %LET _SASPROGRAMFILEHOST='';
10         
11         ODS _ALL_ CLOSE;
12         OPTIONS DEV=PNG;
13         GOPTIONS XPIXELS=0 YPIXELS=0;
14         FILENAME EGSR TEMP;
15         ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
16             STYLE=HTMLBlue
17             STYLESHEET=(URL="file:///C:/Program%20Files/SASHome/SASEnterpriseGuide/7.1/Styles/HTMLBlue.css")
18             NOGTITLE
19             NOGFOOTNOTE
20             GPATH=&sasworklocation
21             ENCODING=UTF8
22             options(rolap="on")
23         ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24         
25         GOPTIONS ACCESSIBLE;
26         data want (drop=_:);
27         set TeraData.VBM410_ACTIVE_LOAN_BAL (in=firstpass keep=Agreement_Account_Id Agreement_Summary_Date
27       ! Fund_Actual_Payment_Count)
28         TeraData.VBM410_ACTIVE_LOAN_BAL (in=secondpass keep=Agreement_Account_Id Agreement_Summary_Date
28       ! Fund_Actual_Payment_Count);
29         by Agreement_Account_Id;
30         retain _max;
31         if first.Agreement_Account_Id then _max=Fund_Actual_Payment_Count;
32         else _max=max(_max,Fund_Actual_Payment_Count);
33         if secondpass and Fund_Actual_Payment_Count=_max;
34         run;

NOTE: Compression was disabled for data set WORK.WANT because compression overhead would increase the size of the data set.
ERROR: Teradata row not delivered (trget): CLI2: REQEXHAUST(307): Request data exhausted. .
NOTE: The DATA step has been abnormally terminated.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 258289 observations read from the data set TERADATA.VBM410_ACTIVE_LOAN_BAL.
NOTE: There were 258285 observations read from the data set TERADATA.VBM410_ACTIVE_LOAN_BAL.
WARNING: The data set WORK.WANT may be incomplete.  When this step was stopped there were 123804 observations and 3 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
      real time           1:31.11
      user cpu time       0.41 seconds
      system cpu time     0.01 seconds
      memory              2027.40k
      OS Memory           26544.00k
      Timestamp           08/21/2023 08:18:32 AM
      Step Count                        4  Switch Count  47
      Page Faults                       0
      Page Reclaims                     560
      Page Swaps                        0
      Voluntary Context Switches        502
2                                                          The SAS System                              08:15 Monday, August 21, 2023

      Involuntary Context Switches      5
      Block Input Operations            0
      Block Output Operations           0
      

35         
36         GOPTIONS NOACCESSIBLE;
37         %LET _CLIENTTASKLABEL=;
38         %LET _CLIENTPROCESSFLOWNAME=;
39         %LET _CLIENTPROJECTPATH=;
40         %LET _CLIENTPROJECTPATHHOST=;
41         %LET _CLIENTPROJECTNAME=;
42         %LET _SASPROGRAMFILE=;
43         %LET _SASPROGRAMFILEHOST=;
44         
45         ;*';*";*/;quit;run;
46         ODS _ALL_ CLOSE;
47         
48         
49         QUIT; RUN;
50

Kurt_Bremser · Posted 08-21-2023 02:35 AM

You seem to have run into a limit in Teradata.

ERROR: Teradata row not delivered (trget): CLI2: REQEXHAUST(307): Request data exhausted.

Check with your Teradata admins why this happens.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 08-21-2023 03:12 AM

Access to data in external databases is a factor in planning programming. You didn't mention that the data was on a Teradata server until getting an error.

Ronein · Posted 08-21-2023 06:03 PM

So what solution do you suggest?

Patrick · Posted 08-21-2023 04:25 AM

That your source table resides in a database is significant and something you should have told us from start.

Using a data step solution will likely pull all the data first to the SAS side before sub-setting it. And that's likely not efficient. I'm also not sure if Teradata will return the rows sorted only because you've got a BY <variable> in your data step.

Googling the error you've got returned this Teradata knowledge base article.

I would have expected that your initial Proc SQL would be right code to do things. Is that the exact code that causes the bad performance?

Eventually add the following options prior to the SQL and share the SAS log with us that shows what actually gets executed on the database side.

options sastrace=',,,d' sastraceloc=saslog nostsuffix;

Ronein · Posted 08-21-2023 06:05 PM

May you please write more clearly what code should i run in order to help the experts suggest solution?

Patrick · Posted 08-22-2023 12:40 AM

@Ronein I feel my comments were clear. But Tom provided now actual SQL passthrough code that you could try.

Tom · Posted 08-21-2023 10:09 PM

Ask your Teradata system admins for an efficient query .

I doubt that SAS can pass your original query into Teradata as I don't think it will allow the use of aggregate functions in a HAVING clause the way that PROC SQL does.

You might be able to use the advanced SQL syntax of "windowing" functions to create the Teradata SQL.

If you already have a libref pointing to your TERADATA database you can use that in PROC SQL to send pass thru SQL into Teradata.

So something like:

proc sql;
connect to teradata;
create table want as
  select
  Agreement_Account_Id
, Agreement_Summary_Date
, Fund_Actual_Payment_Count
 from connection to teradata
(select 
  Agreement_Account_Id
, Agreement_Summary_Date
, Fund_Actual_Payment_Count
from SCHEMA_NAME.VBM410_ACTIVE_LOAN_BAL
where 1=row_number() partition by Agreement_Account_Id order by Fund_Actual_Payment_Count desc
  );
quit;

Registration is open

SAS Training: Just a Click Away