@Kurt_Bremser, I liked your trick which is new to me.
Please explain what do you mean by "automatic retain of variables"
and when does it occur? Thanks ahead.
@Shmuel wrote:
@Kurt_Bremser, I liked your trick which is new to me.
Please explain what do you mean by "automatic retain of variables"
and when does it occur? Thanks ahead.
This is a feature of the data step that does not sound logical at first, but is perfectly logical once one looks into the depth of the data step.
We know that the data step sets all variables to missing at the start of an iteration, with the exception of those that are mentioned in a retain statement. But now think of the typical merge situation:
data a;
input id dat1;
cards;
1 1
;
run;
data b;
input id dat2;
cards;
1 3
1 4
;
run;
data want;
merge
a
b
;
by id;
iter = _n_;
run;
proc print data=want noobs;
run;
In the result
id dat1 dat2 iter 1 1 3 1 1 1 4 2
we can clearly see that dat1 was not missing in the second iteration of the data step. That is not because the data step rereads dataset a, but because the variables from the dataset are not set to missing (and therefore retained); if another observation for id 1 had been present, its values would have overwritten the PDV, but so they simply stay.
Since the set or merge statement also defines the point in time when a dataset is read, manipulating variables before is effective.
I guess that without
roe = .;
, variable A would have retain 0.5, then the second row in want would be
A 0.5 0.8 1995
I don't know if my understanding is right. Also, I wonder whether format function is necessary? I delete format and the result is same.
SAS Output
0.5 | 0.2 | A | 1995 |
. | 0.8 | A | 1995 |
0.6 | 0.7 | B | 1998 |
. | 0.5 | B | 1998 |
@lixuan wrote:
I guess that without
roe = .;
, variable A would have retain 0.5, then the second row in want would be
A 0.5 0.8 1995I don't know if my understanding is right. Also, I wonder whether format function is necessary? I delete format and the result is same.
SAS Output
roe roa company date
0.5 0.2 A 1995 . 0.8 A 1995 0.6 0.7 B 1998 . 0.5 B 1998
You perfectly understood what's happening.
The format is just there so that the horizontal order of variables(columns) is forced to be the one you showed in your example "want" data. Nothing more, it does not influence the actual formats of the variables, as those are taken from the dataset metadata when the merge statement is compiled.
Note that variables are added to the PDV (program data vector) in the sequence they are encountered by the data step compiler. Attributes that can be changed (like labels or formats, but not the type!) are set from the last statement encountered that explicitly (like a format statement) or implicitly (like a set or merge statement) sets them.
@Kurt_Bremser Thanks, I get your idea and have more comprehension of funtion merge. But I do think the result of merge or proc sql(I prefer this and think it is preciser than data step) is very hard to be controled in the situation of duplicate data.
@Kurt_Bremser, thanks again for your explanation.
I have expanded your example to next code:
data a;
input id x y;
cards;
1 1 1
2 2 2
2 3 3
;run;
data b;
input id a b;
cards;
1 5 5
1 6 6
2 7 7
; run;
data tst1;
format ID x y a b;
merge a b;
by id;
run;
data tst2;
format ID x y a b;
x=.; a=.;
merge a b;
by id;
run;
It seems that:
- SAS checks ID in buffer before reading/copying data to PDV.
- when IDs in PDV of both datasets are equal, but the ID is going to change in one of them,
SAS will read/copy from that dataset where ID doesn't change.
In this case the other PDV retain unchanged.
- Statement given before SET or MERGE are executed before reading/copying the observation to a PDV.
That makes sense.
@lixuan wrote:
@Kurt_Bremser Thanks, I get your idea and have more comprehension of funtion merge. But I do think the result of merge or proc sql(I prefer this and think it is preciser than data step) is very hard to be controled in the situation of duplicate data.
SQL and the data step use a basically different method for joins. The data step puts the datasets side-by-side and reads sequentially, while SQL always combines all records from a with all records from b, just honoring where or join conditions.
For one-to-many joins, I prefer the data step (better performance with large datasets), while many-to-many usually needs SQL.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.