Solved: Data step statement or Data step option

Ronein · Posted 09-06-2021 12:55 AM

Hello,

What is more efficient between the 3 options:

1- Using Data step statement

2-Using Data step option on input data set

2-Using Data step option on output data set

Which option you prefer to apply on your program?

Is there any article about it?

Data have;
Input idnum salary age status $;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;


DATA want1; 
SET have(keep=idnum salary);
bonus=salary*1.1; 
RUN;

DATA want2(keep=idnum salary bonus); 
SET have;
bonus=salary*1.1; 
RUN;

DATA want3; 
SET have;
bonus=salary*1.1; 
Keep idnum salary bonus;
RUN;

thanks

yabwon · Posted 09-06-2021 01:56 AM

If you add some `PUT _ALL_;` to your code you will see the difference at once.

Code:

Data have;
Input idnum salary age status $;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;


DATA want1; 
SET have(keep=idnum salary);
bonus=salary*1.1; 
put _all_;
RUN;

DATA want2(keep=idnum salary bonus); 
SET have;
bonus=salary*1.1; 
put _all_;
RUN;

DATA want3; 
SET have;
bonus=salary*1.1; 
keep idnum salary bonus;
put _all_;
RUN;

Log:

1    Data have;
2    Input idnum salary age status $;
3    cards;

NOTE: The data set WORK.HAVE has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.06 seconds
      cpu time            0.00 seconds


7    ;
8    Run;
9
10
11   DATA want1;
12   SET have(keep=idnum salary);
13   bonus=salary*1.1;
14   put _all_;
15   RUN;

idnum=1 salary=1000 bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT1 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


16
17   DATA want2(keep=idnum salary bonus);
18   SET have;
19   bonus=salary*1.1;
20   put _all_;
21   RUN;

idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT2 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


22
23   DATA want3;
24   SET have;
25   bonus=salary*1.1;
26   keep idnum salary bonus;
27   put _all_;
28   RUN;

idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT3 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

The `KEEP=` data set option on the INPUT reduces data amount processed at the very beginning of the process (code: want1). The other two (want2, want3) reduces data on the end (in the output process). Imagine you have several thousands variables in your data set, with last two approach your code will read them all into the PDV processing.

The difference between last two can be seen with the following code:

DATA want2(keep=idnum salary BON); 
SET have;
bonus=salary*1.1;
rename bonus = BON; 
put _all_;
RUN;

DATA want3; 
SET have;
bonus=salary*1.1; 
rename bonus = BON; 
keep idnum salary bonus;
put _all_;
RUN;

In the want2 you can refer to the renamed variable, in the want3 you have to use the original name of renamed variable.

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

View solution in original post

japelin · Posted 09-06-2021 01:21 AM

By using dataset option in set statement(want1), the variables age and status are not accessible in the data step.

2 and 3 are almost the same.

yabwon · Posted 09-06-2021 01:56 AM

If you add some `PUT _ALL_;` to your code you will see the difference at once.

Code:

Data have;
Input idnum salary age status $;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;


DATA want1; 
SET have(keep=idnum salary);
bonus=salary*1.1; 
put _all_;
RUN;

DATA want2(keep=idnum salary bonus); 
SET have;
bonus=salary*1.1; 
put _all_;
RUN;

DATA want3; 
SET have;
bonus=salary*1.1; 
keep idnum salary bonus;
put _all_;
RUN;

Log:

1    Data have;
2    Input idnum salary age status $;
3    cards;

NOTE: The data set WORK.HAVE has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.06 seconds
      cpu time            0.00 seconds


7    ;
8    Run;
9
10
11   DATA want1;
12   SET have(keep=idnum salary);
13   bonus=salary*1.1;
14   put _all_;
15   RUN;

idnum=1 salary=1000 bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT1 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


16
17   DATA want2(keep=idnum salary bonus);
18   SET have;
19   bonus=salary*1.1;
20   put _all_;
21   RUN;

idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT2 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


22
23   DATA want3;
24   SET have;
25   bonus=salary*1.1;
26   keep idnum salary bonus;
27   put _all_;
28   RUN;

idnum=1 salary=1000 age=20 status=S bonus=1100 _ERROR_=0 _N_=1
idnum=2 salary=2000 age=40 status=M bonus=2200 _ERROR_=0 _N_=2
idnum=3 salary=3000 age=30 status=D bonus=3300 _ERROR_=0 _N_=3
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT3 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

The `KEEP=` data set option on the INPUT reduces data amount processed at the very beginning of the process (code: want1). The other two (want2, want3) reduces data on the end (in the output process). Imagine you have several thousands variables in your data set, with last two approach your code will read them all into the PDV processing.

The difference between last two can be seen with the following code:

DATA want2(keep=idnum salary BON); 
SET have;
bonus=salary*1.1;
rename bonus = BON; 
put _all_;
RUN;

DATA want3; 
SET have;
bonus=salary*1.1; 
rename bonus = BON; 
keep idnum salary bonus;
put _all_;
RUN;

In the want2 you can refer to the renamed variable, in the want3 you have to use the original name of renamed variable.

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Patrick · Posted 09-06-2021 03:21 AM

Option 1 only adds two variable to the PDV when reading the data while options 2 & 3 define what variables get kept when writing the data.

Option 1 will be more efficient, options 2 and 3 are more or less the same.

ballardw · Posted 09-06-2021 12:00 PM

How about making a larger data set, say 100,000 or more records(trivial with a do loop with an output statemen added to your data step), turn on FULLSTIMER system option to get more time details on runs and then run the option code yourself? Then you can see the actual affect on your system with your settings.

Ronein · Posted 09-06-2021 03:10 PM

May you please show the full code and explain the results ,thanks

ballardw · Posted 09-06-2021 04:57 PM

@Ronein wrote:

May you please show the full code and explain the results ,thanks

To make a larger data set with more records: (Like I said, trivial)

Data have;
Input idnum salary age status $;
do i=1 to 50000;
   output;
end;
cards;
1 1000 20 S
2 2000 40 M
3 3000 30 D
;
Run;

to turn on more timer information before running the data step to test the options:

options fullstimer;

The log will show more information about time spent.

To turn of use NOFULLSTIMER/

Data step statement or Data step option

Re: Data step statement or Data step option

Re: Data step statement or Data step option

Re: Data step statement or Data step option

Re: Data step statement or Data step option

Re: Data step statement or Data step option

Re: Data step statement or Data step option

Re: Data step statement or Data step option

Classroom Training Available!