Re: The subject of date format - Page 2

Shmuel · Posted 08-03-2017 10:06 AM

@Kurt_Bremser, I liked your trick which is new to me.

Please explain what do you mean by "automatic retain of variables"

and when does it occur? Thanks ahead.

Kurt_Bremser · Posted 08-04-2017 01:55 AM

@Shmuel wrote:

@Kurt_Bremser, I liked your trick which is new to me.

Please explain what do you mean by "automatic retain of variables"

and when does it occur? Thanks ahead.

This is a feature of the data step that does not sound logical at first, but is perfectly logical once one looks into the depth of the data step.

We know that the data step sets all variables to missing at the start of an iteration, with the exception of those that are mentioned in a retain statement. But now think of the typical merge situation:

data a;
input id dat1;
cards;
1 1
;
run;

data b;
input id dat2;
cards;
1 3
1 4
;
run;

data want;
merge
  a
  b
;
by id;
iter = _n_;
run;

proc print data=want noobs;
run;

In the result

id    dat1    dat2    iter

 1      1       3       1 
 1      1       4       2

we can clearly see that dat1 was not missing in the second iteration of the data step. That is not because the data step rereads dataset a, but because the variables from the dataset are not set to missing (and therefore retained); if another observation for id 1 had been present, its values would have overwritten the PDV, but so they simply stay.

Since the set or merge statement also defines the point in time when a dataset is read, manipulating variables before is effective.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

lixuan · Posted 08-03-2017 10:39 AM

I guess that without

roe = .;

, variable A would have retain 0.5, then the second row in want would be

 A  0.5     0.8    1995

I don't know if my understanding is right. Also, I wonder whether format function is necessary? I delete format and the result is same.

SAS Output

roe roa company date

0.5	0.2	A	1995
.	0.8	A	1995
0.6	0.7	B	1998
.	0.5	B	1998

Kurt_Bremser · Posted 08-04-2017 02:02 AM

@lixuan wrote:

I guess that without
roe = .;
, variable A would have retain 0.5, then the second row in want would be
 A  0.5     0.8    1995
I don't know if my understanding is right. Also, I wonder whether format function is necessary? I delete format and the result is same.

SAS Output

roe roa company date

0.5 0.2 A 1995

. 0.8 A 1995

0.6 0.7 B 1998

. 0.5 B 1998

You perfectly understood what's happening.

The format is just there so that the horizontal order of variables(columns) is forced to be the one you showed in your example "want" data. Nothing more, it does not influence the actual formats of the variables, as those are taken from the dataset metadata when the merge statement is compiled.

Note that variables are added to the PDV (program data vector) in the sequence they are encountered by the data step compiler. Attributes that can be changed (like labels or formats, but not the type!) are set from the last statement encountered that explicitly (like a format statement) or implicitly (like a set or merge statement) sets them.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

lixuan · Posted 08-04-2017 02:31 AM

@Kurt_Bremser Thanks, I get your idea and have more comprehension of funtion merge. But I do think the result of merge or proc sql(I prefer this and think it is preciser than data step) is very hard to be controled in the situation of duplicate data.

Shmuel · Posted 08-04-2017 02:50 AM

@Kurt_Bremser, thanks again for your explanation.

I have expanded your example to next code:

data a;
  input id x y;
cards;
1 1 1
2 2 2
2 3 3
;run;
data b;
  input id a b;
cards;
1 5 5
1 6 6
2 7 7
; run;
data tst1;
 format ID x y a b;
 merge a b;
 by id;
run;
data tst2;
 format ID x y a b;
 x=.; a=.;
 merge a b;
 by id;
run;

It seems that:

- SAS checks ID in buffer before reading/copying data to PDV.
- when IDs in PDV of both datasets are equal, but the ID is going to change in one of them,

SAS will read/copy from that dataset where ID doesn't change.

In this case the other PDV retain unchanged.

- Statement given before SET or MERGE are executed before reading/copying the observation to a PDV.

That makes sense.

Kurt_Bremser · Posted 08-04-2017 02:52 AM

@lixuan wrote:

@Kurt_Bremser Thanks, I get your idea and have more comprehension of funtion merge. But I do think the result of merge or proc sql(I prefer this and think it is preciser than data step) is very hard to be controled in the situation of duplicate data.

SQL and the data step use a basically different method for joins. The data step puts the datasets side-by-side and reads sequentially, while SQL always combines all records from a with all records from b, just honoring where or join conditions.

For one-to-many joins, I prefer the data step (better performance with large datasets), while many-to-many usually needs SQL.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Classroom Training Available!