Hello,
What is more efficient between the 3 options:
1- Using Data step statement
2-Using Data step option on input data set
2-Using Data step option on output data set
Which option you prefer to apply on your program?
Is there any article about it?
Data have;
Input idnum salary age status $;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;
DATA want1;
SET have(keep=idnum salary);
bonus=salary*1.1;
RUN;
DATA want2(keep=idnum salary bonus);
SET have;
bonus=salary*1.1;
RUN;
DATA want3;
SET have;
bonus=salary*1.1;
Keep idnum salary bonus;
RUN;
thanks
If you add some `PUT _ALL_;` to your code you will see the difference at once.
Code:
Data have;
Input idnum salary age status $;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;
DATA want1;
SET have(keep=idnum salary);
bonus=salary*1.1;
put _all_;
RUN;
DATA want2(keep=idnum salary bonus);
SET have;
bonus=salary*1.1;
put _all_;
RUN;
DATA want3;
SET have;
bonus=salary*1.1;
keep idnum salary bonus;
put _all_;
RUN;
Log:
1 Data have; 2 Input idnum salary age status $; 3 cards; NOTE: The data set WORK.HAVE has 3 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.06 seconds cpu time 0.00 seconds 7 ; 8 Run; 9 10 11 DATA want1; 12 SET have(keep=idnum salary); 13 bonus=salary*1.1; 14 put _all_; 15 RUN; idnum=1 salary=1000 bonus=1100 _ERROR_=0 _N_=1 idnum=2 salary=2000 bonus=2200 _ERROR_=0 _N_=2 idnum=3 salary=3000 bonus=3300 _ERROR_=0 _N_=3 NOTE: There were 3 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT1 has 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 16 17 DATA want2(keep=idnum salary bonus); 18 SET have; 19 bonus=salary*1.1; 20 put _all_; 21 RUN; idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1 idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2 idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3 NOTE: There were 3 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT2 has 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 22 23 DATA want3; 24 SET have; 25 bonus=salary*1.1; 26 keep idnum salary bonus; 27 put _all_; 28 RUN; idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1 idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2 idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3 NOTE: There were 3 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT3 has 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
The `KEEP=` data set option on the INPUT reduces data amount processed at the very beginning of the process (code: want1). The other two (want2, want3) reduces data on the end (in the output process). Imagine you have several thousands variables in your data set, with last two approach your code will read them all into the PDV processing.
The difference between last two can be seen with the following code:
DATA want2(keep=idnum salary BON);
SET have;
bonus=salary*1.1;
rename bonus = BON;
put _all_;
RUN;
DATA want3;
SET have;
bonus=salary*1.1;
rename bonus = BON;
keep idnum salary bonus;
put _all_;
RUN;
In the want2 you can refer to the renamed variable, in the want3 you have to use the original name of renamed variable.
All the best
Bart
By using dataset option in set statement(want1), the variables age and status are not accessible in the data step.
2 and 3 are almost the same.
If you add some `PUT _ALL_;` to your code you will see the difference at once.
Code:
Data have;
Input idnum salary age status $;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;
DATA want1;
SET have(keep=idnum salary);
bonus=salary*1.1;
put _all_;
RUN;
DATA want2(keep=idnum salary bonus);
SET have;
bonus=salary*1.1;
put _all_;
RUN;
DATA want3;
SET have;
bonus=salary*1.1;
keep idnum salary bonus;
put _all_;
RUN;
Log:
1 Data have; 2 Input idnum salary age status $; 3 cards; NOTE: The data set WORK.HAVE has 3 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.06 seconds cpu time 0.00 seconds 7 ; 8 Run; 9 10 11 DATA want1; 12 SET have(keep=idnum salary); 13 bonus=salary*1.1; 14 put _all_; 15 RUN; idnum=1 salary=1000 bonus=1100 _ERROR_=0 _N_=1 idnum=2 salary=2000 bonus=2200 _ERROR_=0 _N_=2 idnum=3 salary=3000 bonus=3300 _ERROR_=0 _N_=3 NOTE: There were 3 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT1 has 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 16 17 DATA want2(keep=idnum salary bonus); 18 SET have; 19 bonus=salary*1.1; 20 put _all_; 21 RUN; idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1 idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2 idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3 NOTE: There were 3 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT2 has 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 22 23 DATA want3; 24 SET have; 25 bonus=salary*1.1; 26 keep idnum salary bonus; 27 put _all_; 28 RUN; idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1 idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2 idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3 NOTE: There were 3 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT3 has 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
The `KEEP=` data set option on the INPUT reduces data amount processed at the very beginning of the process (code: want1). The other two (want2, want3) reduces data on the end (in the output process). Imagine you have several thousands variables in your data set, with last two approach your code will read them all into the PDV processing.
The difference between last two can be seen with the following code:
DATA want2(keep=idnum salary BON);
SET have;
bonus=salary*1.1;
rename bonus = BON;
put _all_;
RUN;
DATA want3;
SET have;
bonus=salary*1.1;
rename bonus = BON;
keep idnum salary bonus;
put _all_;
RUN;
In the want2 you can refer to the renamed variable, in the want3 you have to use the original name of renamed variable.
All the best
Bart
Option 1 only adds two variable to the PDV when reading the data while options 2 & 3 define what variables get kept when writing the data.
Option 1 will be more efficient, options 2 and 3 are more or less the same.
How about making a larger data set, say 100,000 or more records(trivial with a do loop with an output statemen added to your data step), turn on FULLSTIMER system option to get more time details on runs and then run the option code yourself? Then you can see the actual affect on your system with your settings.
May you please show the full code and explain the results ,thanks
@Ronein wrote:
May you please show the full code and explain the results ,thanks
To make a larger data set with more records: (Like I said, trivial)
Data have;
Input idnum salary age status $;
do i=1 to 50000;
output;
end;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;
to turn on more timer information before running the data step to test the options:
options fullstimer;
The log will show more information about time spent.
To turn of use NOFULLSTIMER/
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.