SAS Programming

Ronein

Hello

I want to use proc sort of a data set located in WORK library.

This data set has 29 million rows and 76 columns- 9 of them char with length 2014 and 67 numeric (length 8

It take almost 1 hour to sort it


proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12;
by  UPDATE_DATE   FK_APPLICATION   FK_MONTHLY_DATA_ACCOUNT   REFERENCE_DATE;
Run;

Any idea what do to to reduce run time?

Here is the LOG

1                                                          The SAS System                            10:05 Monday, February 17, 2025

1          ;*';*";*/;quit;run;
2          OPTIONS PAGENO=MIN;
3          %LET _CLIENTTASKLABEL='V13_Run_Via_Macro_Added_Birth_Date_TO_YANIV.sas';
4          %LET _CLIENTPROCESSFLOWNAME='Standalone Not In Project';
5          %LET _CLIENTPROJECTPATH='';
6          %LET _CLIENTPROJECTPATHHOST='';
7          %LET _CLIENTPROJECTNAME='';
8          %LET _SASPROGRAMFILE='M:\aתחום מודלים\מסגרות
8        ! סיכון\בקרות\בקרות_אצווה_פרטי_שימושי_עבור_רחל\V13_Run_Via_Macro_Added_Birth_Date_TO_YANIV.sas';
9          %LET _SASPROGRAMFILEHOST='HKSU102561A79';
10         
11         ODS _ALL_ CLOSE;
12         OPTIONS DEV=SVG;
13         GOPTIONS XPIXELS=0 YPIXELS=0;
14         %macro HTML5AccessibleGraphSupported;
15             %if %_SAS_VERCOMP_FV(9,4,4, 0,0,0) >= 0 %then ACCESSIBLE_GRAPH;
16         %mend;
17         FILENAME EGHTML TEMP;
18         ODS HTML5(ID=EGHTML) FILE=EGHTML
19             OPTIONS(BITMAP_MODE='INLINE')
20             %HTML5AccessibleGraphSupported
NOTE: The ACCESSIBLE_GRAPH option is pre-production for this release.
21             ENCODING='utf-8'
22             STYLE=HTMLBlue
23             NOGTITLE
24             NOGFOOTNOTE
25             GPATH=&sasworklocation
26         ;
NOTE: Writing HTML5(EGHTML) Body file: EGHTML
27         
28         proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12;
29         by  UPDATE_DATE   FK_APPLICATION   FK_MONTHLY_DATA_ACCOUNT   REFERENCE_DATE;
30         Run;

NOTE: There were 29334832 observations read from the data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12.
NOTE: SAS threaded sort was used.
NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 29334832 observations and 75 variables.
NOTE: Compressing data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 decreased size by 96.39 percent. 
      Compressed is 151230 pages; un-compressed would require 4190691 pages.
NOTE: PROCEDURE SORT used (Total process time):
      real time           38:44.79
      user cpu time       8:39.11
      system cpu time     4:22.12
      memory              6325149.70k
      OS Memory           6348620.00k
      Timestamp           02/17/2025 11:28:44 AM
      Step Count                        34  Switch Count  3298
      Page Faults                       0
      Page Reclaims                     4844240
      Page Swaps                        0
      Voluntary Context Switches        9834164
      Involuntary Context Switches      19362
      Block Input Operations            0
      Block Output Operations           0
      

31         
32         
2                                                          The SAS System                            10:05 Monday, February 17, 2025

33         %LET _CLIENTTASKLABEL=;
34         %LET _CLIENTPROCESSFLOWNAME=;
35         %LET _CLIENTPROJECTPATH=;
36         %LET _CLIENTPROJECTPATHHOST=;
37         %LET _CLIENTPROJECTNAME=;
38         %LET _SASPROGRAMFILE=;
39         %LET _SASPROGRAMFILEHOST=;
40         
41         ;*';*";*/;quit;run;
42         ODS _ALL_ CLOSE;
43         
44         
45         QUIT; RUN;
46

Ksharp

You could request more memory from OS by option SORTSIZE=:
proc sort data=have sortsize=60G

And if you don't care about the order of record/obs ,also could get you faster , by using option NOEUQLS
proc sort data=have noequals

View solution in original post

Patrick

@Ronein Thank you for sharing the log with fullstimer and some characteristics about your table. That helps!

From the looks of it the Proc Sort as such runs efficiently. There is enough memory (zero page swaps) and also the cpu times look reasonable.

The real time is much higher than the CPU times which means time is spent waiting for resources (if your environment is busy) and input/output operations. Given the number of rows of your table and especially these 2KB variables I'd assume it's I/O operations that consume most of the time.

I don't believe you can achieve a much better real time unless you reduce your data volume, get faster disk and/or eventually if you use the SPDE engine and potentially don't sort at all but create indexes instead.

Questions

1. Why do you end-up with such a massive table in WORK?

2. Why do you need the table sorted? What do you intend to do with it? Sorting data in SAS: can you skip it?

It's often beneficial to save really big tables using the SPDE engine. ...but it depends on how you get to this big table and the intended use.

Ronein

I use it in order to select last row by group


proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12;
by  UPDATE_DATE   FK_APPLICATION   FK_MONTHLY_DATA_ACCOUNT   REFERENCE_DATE;
Run;


Data ACCOUNT_MONTHLY_DATA_RF_CS_1(RENAME=(ACCOUNT_OPEN_DATE_=ACCOUNT_OPEN_DATE ACCOUNT_ID_OPEN_DATE_=ACCOUNT_ID_OPEN_DATE));
SET ACCOUNT_MONTHLY_DATA_RF_CS_12;
by   UPDATE_DATE    FK_APPLICATION       FK_MONTHLY_DATA_ACCOUNT    REFERENCE_DATE  ;
If last.FK_MONTHLY_DATA_ACCOUNT;

IF PRIMARY_ACCOUNT_INDICATOR=2 then Ind_Mishni=1; else Ind_Mishni=0;  
IF ACCOUNT_CHARACTER IN(30,38,57,60,12,88,96) then Ind_Tmura=1; else Ind_Tmura=0; 
 
ACCOUNT_OPEN_DATE_=DATEPART(ACCOUNT_OPEN_DATE);
ACCOUNT_ID_OPEN_DATE_= DATEPART(ACCOUNT_ID_OPEN_DATE);
format ACCOUNT_OPEN_DATE_ ACCOUNT_ID_OPEN_DATE_ date9.;

ACCOUNT_AGE_MONTHS=ROUND(ACCOUNT_AGE*12);
KEEP
UTILIZATION_AMOUNT
RISK_RATING
Ind_Mishni
Ind_Tmura
FK_APPLICATION
UPDATE_DATE
REFERENCE_DATE
ID
FK_MONTHLY_DATA_ACCOUNT
ACCOUNT_AGE
ACCOUNT_AGE_MONTHS
ACCOUNT_OPEN_DATE_
ACCOUNT_ID_OPEN_DATE_
ACCOUNT_BALANCE
ACCOUNT_CHARACTER
ACCOUNT_ID_BANK
ACCOUNT_ID_DEPARTMENT
ACCOUNT_SCORING_STATUS
ACCOUNT_TYPE_NUMBER
ATTRIBUTION_CHARACTER
BCG
BRANCH_NUMBER_MASTER
CASH_DEPOSITS_AMOUNT
CHECK_DEPOSITS_AMOUNT
CHILDREN_ALLOWANCE
CLIENT_ID_MASTER
CREDIT_CARD_UTIL
CREDIT_CARD_UTIL_CURRENT
CUSTOMER_DOMAIN_CODE
CUSTOMER_NUMBER_MASTER
DATA_SOURCE
DEBT_INDICATOR
DIVISION_CODE
ECONOMIC_SECTOR
EXTERNAL_LOANS_AMOUNT
FINANCIAL_WEALTH
FINANCIAL_WEALTH_DAILY
GUARANTEES_AMOUNT
INTEGRATION_SALARY
LINKED_ACCOUNT_IND
LINKED_ACOUNT_ELIGIBLE_IND
LOANS_BALANCE
MONTHLY_CYCLE
MONTHLY_LOAN_PAYMENT
MONTHS_WITH_DATA_MASK
NUM_ATM_CARD_BLOCKINGS
NUM_CREDIT_CARD_BLOCKINGS
NUM_MONTHS_WITH_DATA
NUM_MORTGAGES
NUM_OVERLIMIT_PAYMENTS
NUM_REJECTED_PAYMENTS
OBLIGO
OBLIGO_BUSINESS
OTHER_ALLOWANCE
OVERDUE_AMOUNT
OVERDUE_DAYS
OVERDUE_TYPE
OVERLIMIT_AMOUNT
OVERLIMIT_DAYS
OVERLIMIT_PAYMENTS_AMOUNT
OWN_ACCOUNT_TRANSFERS_AMOUNT
OWNERSHIP_CODE
PENSION_AMOUNT
POPULATION_CODE
PRIMARY_ACCOUNT_INDICATOR
REJECTED_PAYMENTS_AMOUNT
RELATIONSHIP_TYPE
REQ_INVESTIGATION_INDICATOR
RESTRICTION_SOURSE
RISK_FRAME_ORIGINAL
SCORING_SUB_POPULATION
SECURITIES_AMOUNT
SECURITIES_TRUST_FUND_AMOUNT
TEAM_CODE
UTIL_DAILY
UTIL_OBLIGO
UTIL_OBLIGO_ADJ
UTIL_OBLIGO_BUSINESS
;
Run;

Patrick

And how do you create this huge source table in WORK. Is this another data step or something else?

Patrick

@Ronein You could try below to avoid a) sorting and b) writing the whole sorted table to disk.
I did ask how you created this big source table in work because if that's done with an earlier data step then you incorporate the creation of the hash lookup table into this data step instead of the data _null_ step below because this will avoid a full pass through the data.

/* create sample data */
data ACCOUNT_MONTHLY_DATA_RF_CS_12;
    do i = 1 to 30;
        UPDATE_DATE = '01JAN2025'd + i;
        FK_APPLICATION = round(1000 + ranuni(0) * 100);
        FK_MONTHLY_DATA_ACCOUNT = round(2000 + ranuni(0) * 100);
        REFERENCE_DATE = '01JAN2025'd + intnx('month', 0, i);
        output;
        output;
    end;
    drop i;
run;

/* create table with row_number of source table that contains the max reference_date */
data _null_;
  if _n_=1 then
    do;
      length row_num max_REFERENCE_DATE 8;
      dcl hash h1();
      h1.defineKey('UPDATE_DATE', 'FK_APPLICATION', 'FK_MONTHLY_DATA_ACCOUNT');
      h1.defineData('row_num','max_REFERENCE_DATE');
      h1.defineDone();
      call missing(row_num, max_REFERENCE_DATE );
    end;
  set ACCOUNT_MONTHLY_DATA_RF_CS_12(keep=UPDATE_DATE FK_APPLICATION FK_MONTHLY_DATA_ACCOUNT REFERENCE_DATE) end=last;
  if h1.find() = 0 then
    do;
      if REFERENCE_DATE>max_REFERENCE_DATE then
        do;
          rc=h1.replace(key:UPDATE_DATE, key:FK_APPLICATION , key:FK_MONTHLY_DATA_ACCOUNT , data:_n_ , data:REFERENCE_DATE);
        end;
    end;
  else
    do;
      rc=h1.add(key:UPDATE_DATE, key:FK_APPLICATION , key:FK_MONTHLY_DATA_ACCOUNT , data:_n_ , data:REFERENCE_DATE);    
    end;
    
  if last then h1.output(dataset:"work.row_num_with_max_REFERENCE_DATE(keep=row_num rename=(row_num=_n_))");
run;

data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'work.row_num_with_max_REFERENCE_DATE');
      h1.defineKey('_n_');
      h1.defineDone();
    end;
  set ACCOUNT_MONTHLY_DATA_RF_CS_12;
  if h1.check()=0 then output;
run;

Ksharp

You could request more memory from OS by option SORTSIZE=:
proc sort data=have sortsize=60G

And if you don't care about the order of record/obs ,also could get you faster , by using option NOEUQLS
proc sort data=have noequals

Patrick

@Ronein Given you've accepted @Ksharp 's approach I'd be really interested how much this improved performance. Can you please share the proc sort log portion with fullstimer.

Ksharp

Sure. Using my poor laptop PC version SAS 9.4m7 .

1    options fullstimer;
2    data ACCOUNT_MONTHLY_DATA_RF_CS_12;
3    call streaminit(123);
4        do i = 1 to 1e7;
5            UPDATE_DATE = '01JAN2025'd + rand('integer',-1000,1000);
6            FK_APPLICATION = round(1000 + ranuni(0) * 100);
7            FK_MONTHLY_DATA_ACCOUNT = round(2000 + ranuni(0) * 100);
8            REFERENCE_DATE = '01JAN2025'd + rand('integer',-1000,1000);
9            output;
10           output;
11       end;
12       drop i;
13   run;

NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 20000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           1.75 seconds
      user cpu time       1.18 seconds
      system cpu time     0.21 seconds
      memory              1227.87k
      OS Memory           10476.00k
      Timestamp           02/18/2025 03:16:58 PM
      Step Count                        1  Switch Count  0


14
15   proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12  ;
16   by  UPDATE_DATE   FK_APPLICATION   FK_MONTHLY_DATA_ACCOUNT   REFERENCE_DATE;
17   Run;

NOTE: There were 20000000 observations read from the data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12.
NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 20000000 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           10.76 seconds
      user cpu time       6.67 seconds
      system cpu time     0.54 seconds
      memory              1058604.60k
      OS Memory           1069256.00k
      Timestamp           02/18/2025 03:17:09 PM
      Step Count                        2  Switch Count  0

1    options fullstimer;
2    data ACCOUNT_MONTHLY_DATA_RF_CS_12;
3    call streaminit(123);
4        do i = 1 to 1e7;
5            UPDATE_DATE = '01JAN2025'd + rand('integer',-1000,1000);
6            FK_APPLICATION = round(1000 + ranuni(0) * 100);
7            FK_MONTHLY_DATA_ACCOUNT = round(2000 + ranuni(0) * 100);
8            REFERENCE_DATE = '01JAN2025'd + rand('integer',-1000,1000);
9            output;
10           output;
11       end;
12       drop i;
13   run;

NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 20000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           1.75 seconds
      user cpu time       1.07 seconds
      system cpu time     0.21 seconds
      memory              1227.87k
      OS Memory           10476.00k
      Timestamp           02/18/2025 03:19:03 PM
      Step Count                        1  Switch Count  0


14
15   proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12 sortsize=60G noequals ;
16   by  UPDATE_DATE   FK_APPLICATION   FK_MONTHLY_DATA_ACCOUNT   REFERENCE_DATE;
17   Run;

NOTE: There were 20000000 observations read from the data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12.
NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 20000000 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           6.82 seconds
      user cpu time       9.01 seconds
      system cpu time     0.79 seconds
      memory              1572251.62k
      OS Memory           1583016.00k
      Timestamp           02/18/2025 03:19:10 PM
      Step Count                        2  Switch Count  0

Patrick

@Ksharp Thanks. But what would be really interesting is to get that from @Ronein 's likely more powerful environment.

I certainly have to read-up on the sortsize option because I don't expect you to have 60GB of memory. I would have expected that with such a high number a lot of memory swapping will be happening with a potentially negative effect on real time.

Ksharp

Patrick,
" a lot of memory swapping will be happening with a potentially negative effect on real time."
I believe it would not be.
PROC SORT would use memory as big as 60G , if OS could not have such big memory,then PROC SORT would only use the max memory that OS could have be.

Patrick

@Ksharp Given that you won't have 60GB of memory on your laptop is a strong indication that SAS will limit itself to what's available as else the process would have to fail with an out of memory condition if trying to load a table that exceeds the actually available memory.

Reading through the docu here I can't really understand why increasing the sortsize to 60GB should make any difference for a table that can't fit into memory. The docu actually even recommends to set the sortsize to a value of 1GB or less if you can't fit the whole table into memory.

Ksharp

Patrick,
I think you are right.
Sometimes if you could not load the whole dataset into memory, you should set SORTSIZE= be smaller.
Here is a test.

14
15 proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12 sortsize=400M noequals ;
16 by UPDATE_DATE FK_APPLICATION FK_MONTHLY_DATA_ACCOUNT REFERENCE_DATE;
17 Run;

NOTE: There were 20000000 observations read from the data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12.
NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 20000000 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 6.17 seconds
cpu time 8.45 seconds

SAS Programming

Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Re: Ling run time for proc sort

Custom Sorting with Proc SQL

Running SingleStore Studio within the SAS Viya namespace

SAS Programming Run-Time Architecture with the new SAS Viya

custom sort

Running PROC COPY multiple times

Follow Us

What is...

SAS Programming

Register Today!

SAS Training: Just a Click Away

Follow Us

What is...