Hello
I want to create a new data set from existing data set.
The existing data set is very big and contains 100 million rows.
It is taking long long time to run proc sort+using where conditions ,
What do you think is better way to run ? Way1 or way2 ? or maybe there is another way that can help to reduce run time?
Sorry ta=hat I cannot send the data
/****Way1*****/
%let YYMM=2209;
proc sort data=ttt(Where=(UPDATE_DATE ne '16MAY2022'd AND input(put(UPDATE_DATE,yymmn4.),4.)=&YYMM.)) out=want;
by UPDATE_DATE FK_A FK_B REFERENCE_DATE ;
Run;
/****Way2*****/
%let YYMM=2209;
Data temp;
SET ttt(Where=(UPDATE_DATE ne '16MAY2022'd AND input(put(UPDATE_DATE,yymmn4.),4.)=&YYMM.));
Run;
proc sort data=temp out=want;
by UPDATE_DATE FK_A FK_B REFERENCE_DATE ;
Run;
Your solution is perfect,
It still takes a few minutes to run it but before it run very long time
Sorting is slow. First I would consider whether you really need to sort. Perhaps a hash-table or some other method might be an alternative.
I would think (hope) that the PROC SORT with the WHERE would be faster than using two steps. But in both approaches, assuming your data is not indexed, you need to read all 100m rows just to select the rows you want.
The use of INPUT(PUT()) might be expensive. It looks like you are selecting all the September 2022 updates? I would try:
proc sort data=ttt(Where=('01Sep2022'd<=UPDATE_DATE<='30Sep2022'd)) out=want;
by UPDATE_DATE FK_A FK_B REFERENCE_DATE ;
Run;
Way 1 should be the most efficient, only one pass of the data.
Question, what are the constraints in your system?
No of cores, system memory?
Your settings of MEMESIZE and SORTSIZE?
If you could sent the log with
options stimer msglevel=i;
Generally speaking, you want to avoid too many function calls in your where clauses. Maybe you can use BETWEEN... AND.. logic instead?
What is the ratio of your subset? If less than 10% you could consider indexing on update_date.
Maybe Way1 is faster, but why don't you try it anyway?
Also...try
-Increase MEMSIZE and REALMEMSIZE allocations.
-Use TAG sorting.
-Get better hardware.
Way 1 requires one data set to be written, Way 2 requires 2 data sets to be written. I would guess that Way 1 wins in speed just for that reason.
Why the manipulation of dates with INPUT(PUT())? That's got to be slower than just doing tests on the variable UPDATE_DATE. Compute &start_date and &end_date from %let YYMM=2209;
%let start_date='01SEP2022'd;
%let end_date='30SEP2022'd;
proc sort data=ttt(Where=(&start_date<=update_date<=&end_date)) out=want;
Don't do this part:
UPDATE_DATE ne '16MAY2022'd
unless &YYMM = 2205
In September there were observations of different update_date and I should ignore observations from 16MAY2022.
In other months there is no problem and just need to select all rows with YYMM as user put
@Ronein wrote:
In September there were observations of different update_date and I should ignore observations from 16MAY2022.
In other months there is no problem and just need to select all rows with YYMM as user put
Thanks for the explanation. You can still create macro code that only does the test for 16MAY2022 for the month where it is a problem and not do the test for months when it is not a problem. This should help the speed issue. Perhaps you have done that already in your real code.
Something like this:
%let YYMM=2209;
data _null_;
call symputx('start_date',input("&yymm",yymmn4.));
call symputx('end_date',intnx('month',input("&yymm",yymmn4.),0,'e'));
run;
%if &yymm=2209 %then %do;
proc sort data=ttt(Where=(UPDATE_DATE ne '16MAY2022'd AND &start_date<=update_date<=&end_date)) out=want;
%end;
%else %do;
proc sort data=ttt(Where=(&start_date<=update_date<=&end_date)) out=want;
%end;
or even simpler
proc sort data=ttt(Where=(%if &yymm=2209 %then %do; UPDATE_DATE ne '16MAY2022'd AND %end;
&start_date<=update_date<=&end_date)) out=want;
@Ronein wrote:
In September there were observations of different update_date and I should ignore observations from 16MAY2022.
In other months there is no problem and just need to select all rows with YYMM as user put
I think you mean "In May there were observations of different update_date and I should ignore observations from 16MAY2022" ?
What about variables? Reducing the number of variables through KEEP= or DROP= will reduce the runtime, so see if you need all the variables from the input dataset in your result.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.