Hi everyone,
I got a question about using 2 SETs to combine 2 datasets. My question came from the Little SAS Book (5th ed) on page 12/27: Combining a Grand Total with the Original Data.
To make it easier to understand, I copy the whole programming:
A distributor of athletic shoes is considering doing a special promotion for the top-selling styles. The vice-president of marketing asks you to produce a report showing the percentage of total sales for each style. For each style of shoe the raw data file contains the style name, type of exercise, and sales for the last quarter:
Max Flight running 1930
Zip Fit Leather walking 2250
Zoom Airborne running 4150
Light Step walking 1130
Max Step Woven walking 2230
Zip Sneak c-train 1190
Here is the program:
DATA shoes;
INFILE 'c:\MyRawData\Shoesales.dat';
INPUT Style $ 1-15 ExerciseType $ Sales;
RUN;
* Output grand total of sales to a data set and print;
PROC MEANS NOPRINT DATA = shoes;
VAR Sales;
OUTPUT OUT = summarydata SUM(Sales) = GrandTotal;
RUN;
PROC PRINT DATA = summarydata;
TITLE 'Summary Data Set';
RUN;
* Combine the grand total with the original data;
DATA shoesummary;
IF _N_ = 1 THEN SET summarydata;
SET shoes;
Percent = Sales / GrandTotal * 100;
RUN;
PROC PRINT DATA = shoesummary;
VAR Style ExerciseType Sales GrandTotal Percent;
TITLE 'Overall Sales Share';
RUN;
----------------------
My question is in the last part (Combine the grand total with the original data):
DATA shoesummary;
IF _N_ = 1 THEN SET summarydata;
SET shoes;
Percent = Sales / GrandTotal * 100;
I do'nt quite understand why "IF _N_ = 1 THEN SET summarydata" is used here (although it works beautifully), why don't just simply use my simple version like this:
DATA shoesummary;
SET summarydata;
SET shoes;
Percent = Sales / GrandTotal * 100;
Of course, my simple version does not work.
The dataset summarydata has only 1 line of observation, "IF _N_ = 1 THEN SET summarydata" will copy the whole dataset, my simple version will also copy the whole dataset. They are supposed to be the same, in my opinion, but actually NOT. Then what's the difference between them? is there any special reason to use "IF _N_ = 1 THEN SET summarydata" ?
(if this question sounds to be too silly, please forgive me)
Thanks.
mich_ard,
Consider a simple data step like this:
data new;
set shoes;
amount=100;
run;
The key questions related to your example are how many times does the SET statement execute? And how does this DATA step end?
The SET statement does not read all the data. Rather, it reads a single observation. For that single observation, the DATA step computes AMOUNT and outputs the results. Then the SET statement executes again and gets the second observation. For the second observation the DATA step computes AMOUNT and outputs the result. The SET statement executes many times, each time reading in a single observation.
Eventually, the SET statement fails. It looks for another observation, and determines that there are no more. That's a normal ending to a DATA step to have the SET statement fail.
Now consider your program:
data shoesummary;
set summarydata;
set shoes;
percent = ...;
run;
As this DATA step begins, the first set statement reads in the one and only observation from SUMMARYDATA. Then the second observation reads in the first observation from SHOES. The DATA step computes PERCENT and outputs the results. Then the DATA step continues. The first SET statement searches for a second observation in SUMMARYDATA. It fails to find another observation, at which point the DATA step is over. A SET statement has failed. It doesn't matter that SHOES has not yet been fully read in ... it only matters that a SET statement has failed.
using IF _N_ = 1 wiil read grandtotal into memory once and proceed with next code lines;
using set twice will read the summary dataet again and
wiil get to end-of-file and grandtotal will be set to missing
Your question involved several fundamentals of data step. For details, the best instruction I found is from Ian Whitlock.
In short, 'if _n_=1' demands to read only 1 obs from 'summarydata' without triggering off the 'end of the file' event, so data step can keep moving on reading obs from 'shoes' until it is done. You may have already tested, without 'if _n_=1', data step will stop after outputting 1 obs.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.