BookmarkSubscribeRSS Feed
Shmuel
Garnet | Level 18

@Kurt_Bremser, I liked your trick which is new to me.

Please explain what do you mean by "automatic retain of variables" 

and when does it occur? Thanks ahead.

Kurt_Bremser
Super User

@Shmuel wrote:

@Kurt_Bremser, I liked your trick which is new to me.

Please explain what do you mean by "automatic retain of variables" 

and when does it occur? Thanks ahead.


This is a feature of the data step that does not sound logical at first, but is perfectly logical once one looks into the depth of the data step.

 

We know that the data step sets all variables to missing at the start of an iteration, with the exception of those that are mentioned in a retain statement. But now think of the typical merge situation:

data a;
input id dat1;
cards;
1 1
;
run;

data b;
input id dat2;
cards;
1 3
1 4
;
run;

data want;
merge
  a
  b
;
by id;
iter = _n_;
run;

proc print data=want noobs;
run;

In the result

id    dat1    dat2    iter

 1      1       3       1 
 1      1       4       2 

we can clearly see that dat1 was not missing in the second iteration of the data step. That is not because the data step rereads dataset a, but because the variables from the dataset are not set to missing (and therefore retained); if another observation for id 1 had been present, its values would have overwritten the PDV, but so they simply stay.

 

Since the set or merge statement also defines the point in time when a dataset is read, manipulating variables before is effective.

lixuan
Obsidian | Level 7

I guess that without 

roe = .;

 , variable A would have retain 0.5, then the second row in want would be

 A  0.5     0.8    1995

 I don't know if my understanding is right. Also, I wonder whether  format function is necessary? I delete format and the result is same.

SAS Output

roe roa company date
0.50.2A1995
.0.8A1995
0.60.7B1998
.0.5B1998

 

 
Kurt_Bremser
Super User

@lixuan wrote:

I guess that without 

roe = .;

 , variable A would have retain 0.5, then the second row in want would be

 A  0.5     0.8    1995

 I don't know if my understanding is right. Also, I wonder whether  format function is necessary? I delete format and the result is same.

SAS Output

roe roa company date
0.5 0.2 A 1995
. 0.8 A 1995
0.6 0.7 B 1998
. 0.5 B 1998

 

 

You perfectly understood what's happening.

The format is just there so that the horizontal order of variables(columns) is forced to be the one you showed in your example "want" data. Nothing more, it does not influence the actual formats of the variables, as those are taken from the dataset metadata when the merge statement is compiled.

 

Note that variables are added to the PDV (program data vector) in the sequence they are encountered by the data step compiler. Attributes that can be changed (like labels or formats, but not the type!) are set from the last statement encountered that explicitly (like a format statement) or implicitly (like a set or merge statement) sets them.

lixuan
Obsidian | Level 7

@Kurt_Bremser Thanks, I get your idea and have more comprehension of funtion merge. But I do think the result of  merge or proc sql(I prefer this and think it is preciser than data step) is very hard to be controled in the situation of duplicate data.

Shmuel
Garnet | Level 18

@Kurt_Bremser, thanks again for your explanation.

I have expanded your example to next code:

data a;
  input id x y;
cards;
1 1 1
2 2 2
2 3 3
;run;
data b;
  input id a b;
cards;
1 5 5
1 6 6
2 7 7
; run;
data tst1;
 format ID x y a b;
 merge a b;
 by id;
run;
data tst2;
 format ID x y a b;
 x=.; a=.;
 merge a b;
 by id;
run;

It seems that:

  -  SAS checks ID in buffer before reading/copying data to PDV.
  - when IDs in PDV of both datasets are equal, but the ID is going to change in one of them,

    SAS will read/copy from that dataset where ID doesn't change.

    In this case the other PDV retain unchanged.

  - Statement given before SET or MERGE are executed before reading/copying the observation to a PDV.

 

That makes sense.

Kurt_Bremser
Super User

@lixuan wrote:

@Kurt_Bremser Thanks, I get your idea and have more comprehension of funtion merge. But I do think the result of  merge or proc sql(I prefer this and think it is preciser than data step) is very hard to be controled in the situation of duplicate data.


SQL and the data step use a basically different method for joins. The data step puts the datasets side-by-side and reads sequentially, while SQL always combines all records from a with all records from b, just honoring where or join conditions.

 

For one-to-many joins, I prefer the data step (better performance with large datasets), while many-to-many usually needs SQL.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 21 replies
  • 1831 views
  • 2 likes
  • 5 in conversation