BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Ronein
Meteorite | Level 14

Hello

I want to create a new data set  from existing data set.

The existing data set is very big and contains 100 million rows.

It is taking long long time to run proc sort+using where conditions ,

What do you think is better way to run ? Way1 or way2 ? or maybe there is another way that can help to reduce run time?

Sorry ta=hat I cannot send the data

 

/****Way1*****/
%let YYMM=2209;

proc sort data=ttt(Where=(UPDATE_DATE  ne '16MAY2022'd  AND input(put(UPDATE_DATE,yymmn4.),4.)=&YYMM.))  out=want;
by   UPDATE_DATE    FK_A       FK_B    REFERENCE_DATE  ;
Run;


/****Way2*****/
%let YYMM=2209;

Data temp;
SET ttt(Where=(UPDATE_DATE  ne '16MAY2022'd  AND input(put(UPDATE_DATE,yymmn4.),4.)=&YYMM.));
Run;

proc sort data=temp out=want;
by   UPDATE_DATE    FK_A       FK_B    REFERENCE_DATE  ;
Run;
1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
Do NOT use PUT() and INPUT() function in your code , Try other function Like :

UPDATE_DATE ne '16MAY2022'd AND input(put(UPDATE_DATE,yymmn4.),4.)=&YYMM.
---->
year(UPDATE_DATE)=2022 AND month(UPDATE_DATE)=9

View solution in original post

14 REPLIES 14
Ksharp
Super User
Do NOT use PUT() and INPUT() function in your code , Try other function Like :

UPDATE_DATE ne '16MAY2022'd AND input(put(UPDATE_DATE,yymmn4.),4.)=&YYMM.
---->
year(UPDATE_DATE)=2022 AND month(UPDATE_DATE)=9
Ronein
Meteorite | Level 14

Your solution is perfect,

It still takes a few minutes to run it but before it run very long time

Quentin
Super User

Sorting is slow. First I would consider whether you really need to sort.  Perhaps a hash-table or some other method might be an alternative.

 

I would think (hope) that the PROC SORT with the WHERE would be faster than using two steps.  But in both approaches, assuming your data is not indexed, you need to read all 100m rows just to select the rows you want.

 

The use of INPUT(PUT()) might be expensive.  It looks like you are selecting all the September 2022 updates?  I would try:


proc sort data=ttt(Where=('01Sep2022'd<=UPDATE_DATE<='30Sep2022'd)) out=want;
by   UPDATE_DATE    FK_A       FK_B    REFERENCE_DATE  ;
Run;

 

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.
LinusH
Tourmaline | Level 20

Way 1 should be the most efficient, only one pass of the data.

Question, what are the constraints in your system?

No of cores, system memory?

Your settings of MEMESIZE and SORTSIZE?

If you could sent the log with 

options stimer msglevel=i;

Generally speaking, you want to avoid too many function calls in your where clauses. Maybe you can use BETWEEN... AND.. logic instead?

What is the ratio of your subset? If less than 10% you could consider indexing on update_date.

Data never sleeps
Ronein
Meteorite | Level 14
1 The SAS System 14:41 Monday, October 31, 2022

1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Program';
4 %LET _CLIENTPROCESSFLOWNAME='Process Flow';
5 %LET _CLIENTPROJECTPATH='';
6 %LET _CLIENTPROJECTPATHHOST='';
7 %LET _CLIENTPROJECTNAME='';
8 %LET _SASPROGRAMFILE='';
9 %LET _SASPROGRAMFILEHOST='';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=PNG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 FILENAME EGSR TEMP;
15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
16 STYLE=HTMLBlue
17 STYLESHEET=(URL="file:///C:/Program%20Files/SASHome/SASEnterpriseGuide/7.1/Styles/HTMLBlue.css")
18 NOGTITLE
19 NOGFOOTNOTE
20 GPATH=&sasworklocation
21 ENCODING=UTF8
22 options(rolap="on")
23 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24
25 GOPTIONS ACCESSIBLE;
26 options stimer msglevel=i;
27
28 GOPTIONS NOACCESSIBLE;
29 %LET _CLIENTTASKLABEL=;
30 %LET _CLIENTPROCESSFLOWNAME=;
31 %LET _CLIENTPROJECTPATH=;
32 %LET _CLIENTPROJECTPATHHOST=;
33 %LET _CLIENTPROJECTNAME=;
34 %LET _SASPROGRAMFILE=;
35 %LET _SASPROGRAMFILEHOST=;
36
37 ;*';*";*/;quit;run;
38 ODS _ALL_ CLOSE;
39
40
41 QUIT; RUN;
42
japelin
Rhodochrosite | Level 12

Maybe Way1 is faster, but why don't you try it anyway?

 

Also...try 

-Increase MEMSIZE and REALMEMSIZE allocations.

-Use TAG sorting.

-Get better hardware.

PaigeMiller
Diamond | Level 26

Way 1 requires one data set to be written, Way 2 requires 2 data sets to be written. I would guess that Way 1 wins in speed just for that reason.

 

Why the manipulation of dates with INPUT(PUT())? That's got to be slower than just doing tests on the variable UPDATE_DATE. Compute &start_date and &end_date from %let YYMM=2209;

 

%let start_date='01SEP2022'd;
%let end_date='30SEP2022'd;

proc sort data=ttt(Where=(&start_date<=update_date<=&end_date)) out=want;

  

 

Don't do this part:

UPDATE_DATE  ne '16MAY2022'd

unless &YYMM = 2205

--
Paige Miller
Ronein
Meteorite | Level 14

In September there were observations of  different update_date and I should ignore observations from 16MAY2022.

In other months there is no problem and just need to select all rows with YYMM  as user put

PaigeMiller
Diamond | Level 26

@Ronein wrote:

In September there were observations of  different update_date and I should ignore observations from 16MAY2022.

In other months there is no problem and just need to select all rows with YYMM  as user put


Thanks for the explanation. You can still create macro code that only does the test for 16MAY2022 for the month where it is a problem and not do the test for months when it is not a problem. This should help the speed issue. Perhaps you have done that already in your real code.

--
Paige Miller
Ronein
Meteorite | Level 14
May you please show code?
PaigeMiller
Diamond | Level 26

Something like this:

 

%let YYMM=2209;
data _null_;
    call symputx('start_date',input("&yymm",yymmn4.));
    call symputx('end_date',intnx('month',input("&yymm",yymmn4.),0,'e'));
run;

%if &yymm=2209 %then %do;
proc sort data=ttt(Where=(UPDATE_DATE ne '16MAY2022'd AND &start_date<=update_date<=&end_date)) out=want;
%end;
%else %do;
proc sort data=ttt(Where=(&start_date<=update_date<=&end_date)) out=want;
%end;

 

or even simpler

 

proc sort data=ttt(Where=(%if &yymm=2209 %then %do; UPDATE_DATE ne '16MAY2022'd AND %end; 
    &start_date<=update_date<=&end_date)) out=want;
--
Paige Miller
Quentin
Super User

@Ronein wrote:

In September there were observations of  different update_date and I should ignore observations from 16MAY2022.

In other months there is no problem and just need to select all rows with YYMM  as user put


I think you mean "In May there were observations of  different update_date and I should ignore observations from 16MAY2022" ?

 

 

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.
Kurt_Bremser
Super User

What about variables? Reducing the number of variables through KEEP= or DROP= will reduce the runtime, so see if you need all the variables from the input dataset in your result.

Ronein
Meteorite | Level 14
Sure but in this case the main problem was not due to number of variables....

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1799 views
  • 4 likes
  • 7 in conversation