BookmarkSubscribeRSS Feed
Nietzsche
Lapis Lazuli | Level 10

So in the base exam prep guide, the first data set combine method is the One-to-One Reading (from here on OTOR), where the second data set overrides the first data set in the same named column. Basic logic is shown below.

 

Nietzsche_1-1670272103214.png

 

In the example, they used two data sets cert.patients and cert.measure.  And they used the IF condition age < 60.

Nietzsche_3-1670272284368.png

Anyways, the patients data set is not sorted, so you'll have to sort it by ID first to get the same result as the book.

So the code should be

proc sort data=spg.patients;by id; run;
proc print data=spg.patients;run;
proc print data=spg.measure;run;

data one2one;
set spg.patients;
if age<60;
set spg.measure;
run;

proc print data=one2one;run;

And result will be..

Nietzsche_5-1670273094729.png

 

Anyways regardless, where is the point of OTOR? The only reason why the combined result is correct (correct as in that all 9 obs in the combined set have correct combined sex age height and weight values) is because of the condition IF age < 60.

 

But the author NEVER explained how he got the age < 60, it just pop up.

And if you put any other conditions or no conditions at all, you would get a messed up result with obs having incorrect values.

 

Eg, no IF condition

proc sort data=spg.patients;by id; run;
data one2one;
set spg.patients; set spg.measure; run; proc print data=one2one;run;

you get 

Nietzsche_7-1670273434805.png

you can see the obs from ID 8045 are all messed up.

 

 

Or if you change from age < 60 to age < 50, you'll get..

Nietzsche_8-1670273562976.png

still messed up.

 

So I don't get the point or use of OTOR in real life. When would you want to have the second data set to over write variables in the first data set when combined them and get all messed up observations.

 

I have attached two SAS files for the data sets if you wish to replicate the results.

 

 

 

SAS Base Programming (2022 Dec), Preparing for SAS Advanced Programming (Cancelled).
5 REPLIES 5
Astounding
PROC Star

In real life, these examples are only mildly realistic.  The most common use for multiple SET statements is to add a single observation in one data set to each observation in another data set.  For example:

data want;
   set many;
   if _n_ = 1 then set one;
run;

Here, the data set ONE contains one observation.  The data set MANY contains many observations.  The net result:  The observation from ONE gets read in by the second SET statement, and its values are automatically retained as the DATA step reads observations one by one from MANY.

Tom
Super User Tom
Super User

I suspect the main point of that is to remind you that SAS data steps normally end when it reads past the end of the input.  At least that is the most important thing I see in that article.

 

The only reason to actually do something like that it real life is if somehow you got given two datasets that you KNOW match row to row, but there is no common ID variable (or set of variables) you could use to MERGE the datasets.

 

Unfortunately this type of strangeness might actually be more common now that users are used to systems like R and Python that let you create independent variables instead of dataset (or "data frames").  So to combine them and see the relationship between AGE and HEIGHT for example you would need to combine the variable values one by one in the way the article you shared presents.

gema
Calcite | Level 5

there is no point.  you can do a one to one match where there are no identical identifiers.  junk data

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Nietzsche 

 

There might - as explained by others - be a use of this simple form of one-to-one reading in some special cases. I have written SAS programs since 1980, but never needed it. But it is important to know what's happening in a Data Step, and this guide exemplifies the basic behaviour of a data step including some tricky details, that might confuse a newbie, but it doesn't really explain what is going on.

 

Per default, a Data Step containing a SET statement performs a loop over input. When a SET statement is executed, the next observation is read into the Program Data Vector (PDV), and this continues until the last observation is read. At the end of each iteration, all variables in the PDV (with some exceptions) are written to output, and then the PDV is cleared. If the current observation is the last, the Step stops.

 

Now, what happens, if there are more than one SET statement? - The first SET statement reads all variables from the current input observation into the PDV. Then the next SET Statement adds all variables from the current input observation to the PDV, overwriting all values that might exist in the PDV with the same variable name. When the last observation in any of the input datasets is processed, the step stops.

 

So in youe example, you get Sex and Age from the Patients dataset, and ID, Height and Weight from the Measure dataset. The ID variable in the PDV is taken from Measure, because the ID already read into the PDV from Patients is overwritten. Since the ID's in the 2 input datasets are not aligned 1:1, the output is garbled, because Sex and Age comes from another ID. 

 

The ID 9125 has correct values, because it happens to be observation 9 in both data sets, but 8125 is number 7 in the Measures dataset, but get Sex and Age from ID 4759. So the example code warns about things that could go really wrong. One might never notice the problem if there had been 11 observatiuons in both datasets. Note also, that there are only 9 observations in output, because one of the input data sets has only 9 observations, which causes the step to stop. 

 

Common coding practice would be to use a MERGE statement instead of the two SET Statements. MERGE requires the input datasets to be sorted on the ID variable, which is used as a BY variable with the MERGE Statement in a Data Step. But there are cases where more than one SET statement is very useful, like using 2 SET Statements referring the SAME dataset, but with different starting point, to perform a look-ahead by getting values from the next observation to use in conjunction with the current observation. But then it is not a "blind" one-to-one reading as exemplified here, but controlled by different data set options, automatic end-variable and conditional execution of the second set statement.

 

 

 

 

 

Quentin
Super User

I can't think of a time when I've used the method shown here to combine data.

 

But, agreeing with others, it's helpful as an example of "interesting things you can do with the DATA step."  There are many uses to having  multiple SET statements in a step.  I would expect there be to several examples like this in a data step class.

 

As Astounding wrote, the use of if _N_=1 then set  to read a one-observation dataset is common.

 

If you want to create a cartesian production in the DATA step, you can't do it with merge, but you can with two SET statements.

 

The double-DOW-loop also relies on multiple SET statements.

 

When I started using SAS, I wasn't a programmer, I didn't think of it as programming.  I thought of the DATA step as being for data cleaning and calculating new variables.  Then when I learned about the PDV, and saw steps with multiple SET statements, and DOW loops, and statements before the SET statement, I slowly learned DATA step programming. So even if an example like this isn't a useful coding pattern, understanding how this example works will help you in understanding other more complex coding patterns that are useful.  Understanding the limitations, as you've pointed out, is also useful.

 

As an aside, I wouldn't think it would be a bad interview question: "How would you do a 1:1 merge using a DATA step that does not use the MERGE statement"?

 

Or, When would the results of the following steps differ?

 

A) 
data want1;
  set a;
  set b;
run;

B) 
options mergenoby=nowarn;
data want2;
   merge a b;
run;

I would expect an experienced SAS programmer to understand how the DATA step works, and be comfortable with this sort of stuff.

 

 

BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 565 views
  • 7 likes
  • 6 in conversation