- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello
I want to use proc sort of a data set located in WORK library.
This data set has 29 million rows and 76 columns- 9 of them char with length 2014 and 67 numeric (length 8
It take almost 1 hour to sort it
proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12;
by UPDATE_DATE FK_APPLICATION FK_MONTHLY_DATA_ACCOUNT REFERENCE_DATE;
Run;
Any idea what do to to reduce run time?
Here is the LOG
1 The SAS System 10:05 Monday, February 17, 2025
1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='V13_Run_Via_Macro_Added_Birth_Date_TO_YANIV.sas';
4 %LET _CLIENTPROCESSFLOWNAME='Standalone Not In Project';
5 %LET _CLIENTPROJECTPATH='';
6 %LET _CLIENTPROJECTPATHHOST='';
7 %LET _CLIENTPROJECTNAME='';
8 %LET _SASPROGRAMFILE='M:\aתחום מודלים\מסגרות
8 ! סיכון\בקרות\בקרות_אצווה_פרטי_שימושי_עבור_רחל\V13_Run_Via_Macro_Added_Birth_Date_TO_YANIV.sas';
9 %LET _SASPROGRAMFILEHOST='HKSU102561A79';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=SVG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 %macro HTML5AccessibleGraphSupported;
15 %if %_SAS_VERCOMP_FV(9,4,4, 0,0,0) >= 0 %then ACCESSIBLE_GRAPH;
16 %mend;
17 FILENAME EGHTML TEMP;
18 ODS HTML5(ID=EGHTML) FILE=EGHTML
19 OPTIONS(BITMAP_MODE='INLINE')
20 %HTML5AccessibleGraphSupported
NOTE: The ACCESSIBLE_GRAPH option is pre-production for this release.
21 ENCODING='utf-8'
22 STYLE=HTMLBlue
23 NOGTITLE
24 NOGFOOTNOTE
25 GPATH=&sasworklocation
26 ;
NOTE: Writing HTML5(EGHTML) Body file: EGHTML
27
28 proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12;
29 by UPDATE_DATE FK_APPLICATION FK_MONTHLY_DATA_ACCOUNT REFERENCE_DATE;
30 Run;
NOTE: There were 29334832 observations read from the data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12.
NOTE: SAS threaded sort was used.
NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 29334832 observations and 75 variables.
NOTE: Compressing data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 decreased size by 96.39 percent.
Compressed is 151230 pages; un-compressed would require 4190691 pages.
NOTE: PROCEDURE SORT used (Total process time):
real time 38:44.79
user cpu time 8:39.11
system cpu time 4:22.12
memory 6325149.70k
OS Memory 6348620.00k
Timestamp 02/17/2025 11:28:44 AM
Step Count 34 Switch Count 3298
Page Faults 0
Page Reclaims 4844240
Page Swaps 0
Voluntary Context Switches 9834164
Involuntary Context Switches 19362
Block Input Operations 0
Block Output Operations 0
31
32
2 The SAS System 10:05 Monday, February 17, 2025
33 %LET _CLIENTTASKLABEL=;
34 %LET _CLIENTPROCESSFLOWNAME=;
35 %LET _CLIENTPROJECTPATH=;
36 %LET _CLIENTPROJECTPATHHOST=;
37 %LET _CLIENTPROJECTNAME=;
38 %LET _SASPROGRAMFILE=;
39 %LET _SASPROGRAMFILEHOST=;
40
41 ;*';*";*/;quit;run;
42 ODS _ALL_ CLOSE;
43
44
45 QUIT; RUN;
46
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc sort data=have sortsize=60G
And if you don't care about the order of record/obs ,also could get you faster , by using option NOEUQLS
proc sort data=have noequals
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ronein Thank you for sharing the log with fullstimer and some characteristics about your table. That helps!
From the looks of it the Proc Sort as such runs efficiently. There is enough memory (zero page swaps) and also the cpu times look reasonable.
The real time is much higher than the CPU times which means time is spent waiting for resources (if your environment is busy) and input/output operations. Given the number of rows of your table and especially these 2KB variables I'd assume it's I/O operations that consume most of the time.
I don't believe you can achieve a much better real time unless you reduce your data volume, get faster disk and/or eventually if you use the SPDE engine and potentially don't sort at all but create indexes instead.
Questions
1. Why do you end-up with such a massive table in WORK?
2. Why do you need the table sorted? What do you intend to do with it? Sorting data in SAS: can you skip it?
It's often beneficial to save really big tables using the SPDE engine. ...but it depends on how you get to this big table and the intended use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I use it in order to select last row by group
proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12;
by UPDATE_DATE FK_APPLICATION FK_MONTHLY_DATA_ACCOUNT REFERENCE_DATE;
Run;
Data ACCOUNT_MONTHLY_DATA_RF_CS_1(RENAME=(ACCOUNT_OPEN_DATE_=ACCOUNT_OPEN_DATE ACCOUNT_ID_OPEN_DATE_=ACCOUNT_ID_OPEN_DATE));
SET ACCOUNT_MONTHLY_DATA_RF_CS_12;
by UPDATE_DATE FK_APPLICATION FK_MONTHLY_DATA_ACCOUNT REFERENCE_DATE ;
If last.FK_MONTHLY_DATA_ACCOUNT;
IF PRIMARY_ACCOUNT_INDICATOR=2 then Ind_Mishni=1; else Ind_Mishni=0;
IF ACCOUNT_CHARACTER IN(30,38,57,60,12,88,96) then Ind_Tmura=1; else Ind_Tmura=0;
ACCOUNT_OPEN_DATE_=DATEPART(ACCOUNT_OPEN_DATE);
ACCOUNT_ID_OPEN_DATE_= DATEPART(ACCOUNT_ID_OPEN_DATE);
format ACCOUNT_OPEN_DATE_ ACCOUNT_ID_OPEN_DATE_ date9.;
ACCOUNT_AGE_MONTHS=ROUND(ACCOUNT_AGE*12);
KEEP
UTILIZATION_AMOUNT
RISK_RATING
Ind_Mishni
Ind_Tmura
FK_APPLICATION
UPDATE_DATE
REFERENCE_DATE
ID
FK_MONTHLY_DATA_ACCOUNT
ACCOUNT_AGE
ACCOUNT_AGE_MONTHS
ACCOUNT_OPEN_DATE_
ACCOUNT_ID_OPEN_DATE_
ACCOUNT_BALANCE
ACCOUNT_CHARACTER
ACCOUNT_ID_BANK
ACCOUNT_ID_DEPARTMENT
ACCOUNT_SCORING_STATUS
ACCOUNT_TYPE_NUMBER
ATTRIBUTION_CHARACTER
BCG
BRANCH_NUMBER_MASTER
CASH_DEPOSITS_AMOUNT
CHECK_DEPOSITS_AMOUNT
CHILDREN_ALLOWANCE
CLIENT_ID_MASTER
CREDIT_CARD_UTIL
CREDIT_CARD_UTIL_CURRENT
CUSTOMER_DOMAIN_CODE
CUSTOMER_NUMBER_MASTER
DATA_SOURCE
DEBT_INDICATOR
DIVISION_CODE
ECONOMIC_SECTOR
EXTERNAL_LOANS_AMOUNT
FINANCIAL_WEALTH
FINANCIAL_WEALTH_DAILY
GUARANTEES_AMOUNT
INTEGRATION_SALARY
LINKED_ACCOUNT_IND
LINKED_ACOUNT_ELIGIBLE_IND
LOANS_BALANCE
MONTHLY_CYCLE
MONTHLY_LOAN_PAYMENT
MONTHS_WITH_DATA_MASK
NUM_ATM_CARD_BLOCKINGS
NUM_CREDIT_CARD_BLOCKINGS
NUM_MONTHS_WITH_DATA
NUM_MORTGAGES
NUM_OVERLIMIT_PAYMENTS
NUM_REJECTED_PAYMENTS
OBLIGO
OBLIGO_BUSINESS
OTHER_ALLOWANCE
OVERDUE_AMOUNT
OVERDUE_DAYS
OVERDUE_TYPE
OVERLIMIT_AMOUNT
OVERLIMIT_DAYS
OVERLIMIT_PAYMENTS_AMOUNT
OWN_ACCOUNT_TRANSFERS_AMOUNT
OWNERSHIP_CODE
PENSION_AMOUNT
POPULATION_CODE
PRIMARY_ACCOUNT_INDICATOR
REJECTED_PAYMENTS_AMOUNT
RELATIONSHIP_TYPE
REQ_INVESTIGATION_INDICATOR
RESTRICTION_SOURSE
RISK_FRAME_ORIGINAL
SCORING_SUB_POPULATION
SECURITIES_AMOUNT
SECURITIES_TRUST_FUND_AMOUNT
TEAM_CODE
UTIL_DAILY
UTIL_OBLIGO
UTIL_OBLIGO_ADJ
UTIL_OBLIGO_BUSINESS
;
Run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
And how do you create this huge source table in WORK. Is this another data step or something else?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ronein You could try below to avoid a) sorting and b) writing the whole sorted table to disk.
I did ask how you created this big source table in work because if that's done with an earlier data step then you incorporate the creation of the hash lookup table into this data step instead of the data _null_ step below because this will avoid a full pass through the data.
/* create sample data */
data ACCOUNT_MONTHLY_DATA_RF_CS_12;
do i = 1 to 30;
UPDATE_DATE = '01JAN2025'd + i;
FK_APPLICATION = round(1000 + ranuni(0) * 100);
FK_MONTHLY_DATA_ACCOUNT = round(2000 + ranuni(0) * 100);
REFERENCE_DATE = '01JAN2025'd + intnx('month', 0, i);
output;
output;
end;
drop i;
run;
/* create table with row_number of source table that contains the max reference_date */
data _null_;
if _n_=1 then
do;
length row_num max_REFERENCE_DATE 8;
dcl hash h1();
h1.defineKey('UPDATE_DATE', 'FK_APPLICATION', 'FK_MONTHLY_DATA_ACCOUNT');
h1.defineData('row_num','max_REFERENCE_DATE');
h1.defineDone();
call missing(row_num, max_REFERENCE_DATE );
end;
set ACCOUNT_MONTHLY_DATA_RF_CS_12(keep=UPDATE_DATE FK_APPLICATION FK_MONTHLY_DATA_ACCOUNT REFERENCE_DATE) end=last;
if h1.find() = 0 then
do;
if REFERENCE_DATE>max_REFERENCE_DATE then
do;
rc=h1.replace(key:UPDATE_DATE, key:FK_APPLICATION , key:FK_MONTHLY_DATA_ACCOUNT , data:_n_ , data:REFERENCE_DATE);
end;
end;
else
do;
rc=h1.add(key:UPDATE_DATE, key:FK_APPLICATION , key:FK_MONTHLY_DATA_ACCOUNT , data:_n_ , data:REFERENCE_DATE);
end;
if last then h1.output(dataset:"work.row_num_with_max_REFERENCE_DATE(keep=row_num rename=(row_num=_n_))");
run;
data want;
if _n_=1 then
do;
dcl hash h1(dataset:'work.row_num_with_max_REFERENCE_DATE');
h1.defineKey('_n_');
h1.defineDone();
end;
set ACCOUNT_MONTHLY_DATA_RF_CS_12;
if h1.check()=0 then output;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc sort data=have sortsize=60G
And if you don't care about the order of record/obs ,also could get you faster , by using option NOEUQLS
proc sort data=have noequals
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sure. Using my poor laptop PC version SAS 9.4m7 .
1 options fullstimer; 2 data ACCOUNT_MONTHLY_DATA_RF_CS_12; 3 call streaminit(123); 4 do i = 1 to 1e7; 5 UPDATE_DATE = '01JAN2025'd + rand('integer',-1000,1000); 6 FK_APPLICATION = round(1000 + ranuni(0) * 100); 7 FK_MONTHLY_DATA_ACCOUNT = round(2000 + ranuni(0) * 100); 8 REFERENCE_DATE = '01JAN2025'd + rand('integer',-1000,1000); 9 output; 10 output; 11 end; 12 drop i; 13 run; NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 20000000 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 1.75 seconds user cpu time 1.18 seconds system cpu time 0.21 seconds memory 1227.87k OS Memory 10476.00k Timestamp 02/18/2025 03:16:58 PM Step Count 1 Switch Count 0 14 15 proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12 ; 16 by UPDATE_DATE FK_APPLICATION FK_MONTHLY_DATA_ACCOUNT REFERENCE_DATE; 17 Run; NOTE: There were 20000000 observations read from the data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12. NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 20000000 observations and 4 variables. NOTE: PROCEDURE SORT used (Total process time): real time 10.76 seconds user cpu time 6.67 seconds system cpu time 0.54 seconds memory 1058604.60k OS Memory 1069256.00k Timestamp 02/18/2025 03:17:09 PM Step Count 2 Switch Count 0
1 options fullstimer; 2 data ACCOUNT_MONTHLY_DATA_RF_CS_12; 3 call streaminit(123); 4 do i = 1 to 1e7; 5 UPDATE_DATE = '01JAN2025'd + rand('integer',-1000,1000); 6 FK_APPLICATION = round(1000 + ranuni(0) * 100); 7 FK_MONTHLY_DATA_ACCOUNT = round(2000 + ranuni(0) * 100); 8 REFERENCE_DATE = '01JAN2025'd + rand('integer',-1000,1000); 9 output; 10 output; 11 end; 12 drop i; 13 run; NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 20000000 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 1.75 seconds user cpu time 1.07 seconds system cpu time 0.21 seconds memory 1227.87k OS Memory 10476.00k Timestamp 02/18/2025 03:19:03 PM Step Count 1 Switch Count 0 14 15 proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12 sortsize=60G noequals ; 16 by UPDATE_DATE FK_APPLICATION FK_MONTHLY_DATA_ACCOUNT REFERENCE_DATE; 17 Run; NOTE: There were 20000000 observations read from the data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12. NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 20000000 observations and 4 variables. NOTE: PROCEDURE SORT used (Total process time): real time 6.82 seconds user cpu time 9.01 seconds system cpu time 0.79 seconds memory 1572251.62k OS Memory 1583016.00k Timestamp 02/18/2025 03:19:10 PM Step Count 2 Switch Count 0
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ksharp Thanks. But what would be really interesting is to get that from @Ronein 's likely more powerful environment.
I certainly have to read-up on the sortsize option because I don't expect you to have 60GB of memory. I would have expected that with such a high number a lot of memory swapping will be happening with a potentially negative effect on real time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
" a lot of memory swapping will be happening with a potentially negative effect on real time."
I believe it would not be.
PROC SORT would use memory as big as 60G , if OS could not have such big memory,then PROC SORT would only use the max memory that OS could have be.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ksharp Given that you won't have 60GB of memory on your laptop is a strong indication that SAS will limit itself to what's available as else the process would have to fail with an out of memory condition if trying to load a table that exceeds the actually available memory.
Reading through the docu here I can't really understand why increasing the sortsize to 60GB should make any difference for a table that can't fit into memory. The docu actually even recommends to set the sortsize to a value of 1GB or less if you can't fit the whole table into memory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think you are right.
Sometimes if you could not load the whole dataset into memory, you should set SORTSIZE= be smaller.
Here is a test.
14
15 proc sort data=ACCOUNT_MONTHLY_DATA_RF_CS_12 sortsize=400M noequals ;
16 by UPDATE_DATE FK_APPLICATION FK_MONTHLY_DATA_ACCOUNT REFERENCE_DATE;
17 Run;
NOTE: There were 20000000 observations read from the data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12.
NOTE: The data set WORK.ACCOUNT_MONTHLY_DATA_RF_CS_12 has 20000000 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 6.17 seconds
cpu time 8.45 seconds