BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ronein
Meteorite | Level 14

Hello,

What is more efficient between the 3 options:

1- Using Data step statement

2-Using Data step option on input data set

2-Using Data step option on output data set

 

Which option you prefer to apply on your program?

Is there any article about it?

 

Data have;
Input idnum salary age status $;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;


DATA want1; 
SET have(keep=idnum salary);
bonus=salary*1.1; 
RUN;

DATA want2(keep=idnum salary bonus); 
SET have;
bonus=salary*1.1; 
RUN;

DATA want3; 
SET have;
bonus=salary*1.1; 
Keep idnum salary bonus;
RUN;

thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
yabwon
Onyx | Level 15

If you add some `PUT _ALL_;` to your code you will see the difference at once.

Code:

Data have;
Input idnum salary age status $;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;


DATA want1; 
SET have(keep=idnum salary);
bonus=salary*1.1; 
put _all_;
RUN;

DATA want2(keep=idnum salary bonus); 
SET have;
bonus=salary*1.1; 
put _all_;
RUN;

DATA want3; 
SET have;
bonus=salary*1.1; 
keep idnum salary bonus;
put _all_;
RUN;

Log:

1    Data have;
2    Input idnum salary age status $;
3    cards;

NOTE: The data set WORK.HAVE has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.06 seconds
      cpu time            0.00 seconds


7    ;
8    Run;
9
10
11   DATA want1;
12   SET have(keep=idnum salary);
13   bonus=salary*1.1;
14   put _all_;
15   RUN;

idnum=1 salary=1000 bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT1 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


16
17   DATA want2(keep=idnum salary bonus);
18   SET have;
19   bonus=salary*1.1;
20   put _all_;
21   RUN;

idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT2 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


22
23   DATA want3;
24   SET have;
25   bonus=salary*1.1;
26   keep idnum salary bonus;
27   put _all_;
28   RUN;

idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT3 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

The `KEEP=` data set option on the INPUT reduces data amount processed at the very beginning of the process (code: want1). The other two (want2, want3) reduces data on the end (in the output process). Imagine you have several thousands variables in your data set, with last two approach your code will read them all into the PDV processing.

 

The difference between last two can be seen with the following code:

DATA want2(keep=idnum salary BON); 
SET have;
bonus=salary*1.1;
rename bonus = BON; 
put _all_;
RUN;

DATA want3; 
SET have;
bonus=salary*1.1; 
rename bonus = BON; 
keep idnum salary bonus;
put _all_;
RUN;

In the want2 you can refer to the renamed variable, in the want3 you have to use the original name of renamed variable.

 

All the best

Bart

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



View solution in original post

6 REPLIES 6
japelin
Rhodochrosite | Level 12

By using dataset option in set statement(want1), the variables age and status are not accessible in the data step.

2 and 3 are almost the same.

 

yabwon
Onyx | Level 15

If you add some `PUT _ALL_;` to your code you will see the difference at once.

Code:

Data have;
Input idnum salary age status $;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;


DATA want1; 
SET have(keep=idnum salary);
bonus=salary*1.1; 
put _all_;
RUN;

DATA want2(keep=idnum salary bonus); 
SET have;
bonus=salary*1.1; 
put _all_;
RUN;

DATA want3; 
SET have;
bonus=salary*1.1; 
keep idnum salary bonus;
put _all_;
RUN;

Log:

1    Data have;
2    Input idnum salary age status $;
3    cards;

NOTE: The data set WORK.HAVE has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.06 seconds
      cpu time            0.00 seconds


7    ;
8    Run;
9
10
11   DATA want1;
12   SET have(keep=idnum salary);
13   bonus=salary*1.1;
14   put _all_;
15   RUN;

idnum=1 salary=1000 bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT1 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


16
17   DATA want2(keep=idnum salary bonus);
18   SET have;
19   bonus=salary*1.1;
20   put _all_;
21   RUN;

idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT2 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


22
23   DATA want3;
24   SET have;
25   bonus=salary*1.1;
26   keep idnum salary bonus;
27   put _all_;
28   RUN;

idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT3 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

The `KEEP=` data set option on the INPUT reduces data amount processed at the very beginning of the process (code: want1). The other two (want2, want3) reduces data on the end (in the output process). Imagine you have several thousands variables in your data set, with last two approach your code will read them all into the PDV processing.

 

The difference between last two can be seen with the following code:

DATA want2(keep=idnum salary BON); 
SET have;
bonus=salary*1.1;
rename bonus = BON; 
put _all_;
RUN;

DATA want3; 
SET have;
bonus=salary*1.1; 
rename bonus = BON; 
keep idnum salary bonus;
put _all_;
RUN;

In the want2 you can refer to the renamed variable, in the want3 you have to use the original name of renamed variable.

 

All the best

Bart

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Patrick
Opal | Level 21

Option 1 only adds two variable to the PDV when reading the data while options 2 & 3 define what variables get kept when writing the data.

Option 1 will be more efficient, options 2 and 3 are more or less the same. 

ballardw
Super User

How about making a larger data set, say 100,000 or more records(trivial with a do loop with an output statemen added to your data step), turn on FULLSTIMER system option to get more time details on runs and then run the option code yourself? Then you can see the actual affect on your system with your settings.

Ronein
Meteorite | Level 14

May you please show the full code and explain the results ,thanks

ballardw
Super User

@Ronein wrote:

May you please show the full code and explain the results ,thanks


To make a larger data set with more records: (Like I said, trivial)

Data have;
Input idnum salary age status $;
do i=1 to 50000;
   output;
end;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;

to turn on more timer information before running the data step to test the options:

options fullstimer;

The log will show more information about time spent.

To turn of use NOFULLSTIMER/

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 602 views
  • 5 likes
  • 5 in conversation