Solved: Re: Reduce run time

Ronein · Posted 10-31-2022 08:23 AM

Hello

I want to create a new data set from existing data set.

The existing data set is very big and contains 100 million rows.

It is taking long long time to run proc sort+using where conditions ,

What do you think is better way to run ? Way1 or way2 ? or maybe there is another way that can help to reduce run time?

Sorry ta=hat I cannot send the data

/****Way1*****/
%let YYMM=2209;

proc sort data=ttt(Where=(UPDATE_DATE  ne '16MAY2022'd  AND input(put(UPDATE_DATE,yymmn4.),4.)=&YYMM.))  out=want;
by   UPDATE_DATE    FK_A       FK_B    REFERENCE_DATE  ;
Run;


/****Way2*****/
%let YYMM=2209;

Data temp;
SET ttt(Where=(UPDATE_DATE  ne '16MAY2022'd  AND input(put(UPDATE_DATE,yymmn4.),4.)=&YYMM.));
Run;

proc sort data=temp out=want;
by   UPDATE_DATE    FK_A       FK_B    REFERENCE_DATE  ;
Run;

Ksharp · Posted 10-31-2022 08:34 AM

Do NOT use PUT() and INPUT() function in your code , Try other function Like :

UPDATE_DATE ne '16MAY2022'd AND input(put(UPDATE_DATE,yymmn4.),4.)=&YYMM.
---->
year(UPDATE_DATE)=2022 AND month(UPDATE_DATE)=9

View solution in original post

Ksharp · Posted 10-31-2022 08:34 AM

Do NOT use PUT() and INPUT() function in your code , Try other function Like :

UPDATE_DATE ne '16MAY2022'd AND input(put(UPDATE_DATE,yymmn4.),4.)=&YYMM.
---->
year(UPDATE_DATE)=2022 AND month(UPDATE_DATE)=9

Ronein · Posted 10-31-2022 09:04 AM

Your solution is perfect,

It still takes a few minutes to run it but before it run very long time

Quentin · Posted 10-31-2022 08:37 AM

Sorting is slow. First I would consider whether you really need to sort. Perhaps a hash-table or some other method might be an alternative.

I would think (hope) that the PROC SORT with the WHERE would be faster than using two steps. But in both approaches, assuming your data is not indexed, you need to read all 100m rows just to select the rows you want.

The use of INPUT(PUT()) might be expensive. It looks like you are selecting all the September 2022 updates? I would try:


proc sort data=ttt(Where=('01Sep2022'd<=UPDATE_DATE<='30Sep2022'd)) out=want;
by   UPDATE_DATE    FK_A       FK_B    REFERENCE_DATE  ;
Run;

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

LinusH · Posted 10-31-2022 08:37 AM

Way 1 should be the most efficient, only one pass of the data.

Question, what are the constraints in your system?

No of cores, system memory?

Your settings of MEMESIZE and SORTSIZE?

If you could sent the log with

options stimer msglevel=i;

Generally speaking, you want to avoid too many function calls in your where clauses. Maybe you can use BETWEEN... AND.. logic instead?

What is the ratio of your subset? If less than 10% you could consider indexing on update_date.

Data never sleeps

Ronein · Posted 10-31-2022 09:05 AM

1 The SAS System 14:41 Monday, October 31, 2022

1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Program';
4 %LET _CLIENTPROCESSFLOWNAME='Process Flow';
5 %LET _CLIENTPROJECTPATH='';
6 %LET _CLIENTPROJECTPATHHOST='';
7 %LET _CLIENTPROJECTNAME='';
8 %LET _SASPROGRAMFILE='';
9 %LET _SASPROGRAMFILEHOST='';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=PNG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 FILENAME EGSR TEMP;
15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
16 STYLE=HTMLBlue
17 STYLESHEET=(URL="file:///C:/Program%20Files/SASHome/SASEnterpriseGuide/7.1/Styles/HTMLBlue.css")
18 NOGTITLE
19 NOGFOOTNOTE
20 GPATH=&sasworklocation
21 ENCODING=UTF8
22 options(rolap="on")
23 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24
25 GOPTIONS ACCESSIBLE;
26 options stimer msglevel=i;
27
28 GOPTIONS NOACCESSIBLE;
29 %LET _CLIENTTASKLABEL=;
30 %LET _CLIENTPROCESSFLOWNAME=;
31 %LET _CLIENTPROJECTPATH=;
32 %LET _CLIENTPROJECTPATHHOST=;
33 %LET _CLIENTPROJECTNAME=;
34 %LET _SASPROGRAMFILE=;
35 %LET _SASPROGRAMFILEHOST=;
36
37 ;*';*";*/;quit;run;
38 ODS _ALL_ CLOSE;
39
40
41 QUIT; RUN;
42

japelin · Posted 10-31-2022 08:40 AM

Maybe Way1 is faster, but why don't you try it anyway?

Also...try

-Increase MEMSIZE and REALMEMSIZE allocations.

-Use TAG sorting.

-Get better hardware.

PaigeMiller · Posted 10-31-2022 08:42 AM

Way 1 requires one data set to be written, Way 2 requires 2 data sets to be written. I would guess that Way 1 wins in speed just for that reason.

Why the manipulation of dates with INPUT(PUT())? That's got to be slower than just doing tests on the variable UPDATE_DATE. Compute &start_date and &end_date from %let YYMM=2209;

%let start_date='01SEP2022'd;
%let end_date='30SEP2022'd;

proc sort data=ttt(Where=(&start_date<=update_date<=&end_date)) out=want;

Don't do this part:

UPDATE_DATE  ne '16MAY2022'd

unless &YYMM = 2205

--
Paige Miller

Ronein · Posted 10-31-2022 09:14 AM

In September there were observations of different update_date and I should ignore observations from 16MAY2022.

In other months there is no problem and just need to select all rows with YYMM as user put

PaigeMiller · Posted 10-31-2022 10:01 AM

@Ronein wrote:

In September there were observations of different update_date and I should ignore observations from 16MAY2022.

In other months there is no problem and just need to select all rows with YYMM as user put

Thanks for the explanation. You can still create macro code that only does the test for 16MAY2022 for the month where it is a problem and not do the test for months when it is not a problem. This should help the speed issue. Perhaps you have done that already in your real code.

--
Paige Miller

Ronein · Posted 10-31-2022 10:12 AM

May you please show code?

PaigeMiller · Posted 10-31-2022 10:28 AM

Something like this:

%let YYMM=2209;
data _null_;
    call symputx('start_date',input("&yymm",yymmn4.));
    call symputx('end_date',intnx('month',input("&yymm",yymmn4.),0,'e'));
run;

%if &yymm=2209 %then %do;
proc sort data=ttt(Where=(UPDATE_DATE ne '16MAY2022'd AND &start_date<=update_date<=&end_date)) out=want;
%end;
%else %do;
proc sort data=ttt(Where=(&start_date<=update_date<=&end_date)) out=want;
%end;

or even simpler

proc sort data=ttt(Where=(%if &yymm=2209 %then %do; UPDATE_DATE ne '16MAY2022'd AND %end; 
    &start_date<=update_date<=&end_date)) out=want;

--
Paige Miller

Quentin · Posted 10-31-2022 10:10 AM

@Ronein wrote:

In September there were observations of different update_date and I should ignore observations from 16MAY2022.

In other months there is no problem and just need to select all rows with YYMM as user put

I think you mean "In May there were observations of different update_date and I should ignore observations from 16MAY2022" ?

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

Kurt_Bremser · Posted 10-31-2022 08:45 AM

What about variables? Reducing the number of variables through KEEP= or DROP= will reduce the runtime, so see if you need all the variables from the input dataset in your result.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ronein · Posted 10-31-2022 09:12 AM

Sure but in this case the main problem was not due to number of variables....

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away