Re: Collapsing data - RETAIN statement doesn't seem to retain

enginemane44 · Posted 10-21-2010 10:14 AM

All,
I have data set and program as below :
data one;
input id$ qn a1 b1 c1 a2 b2 c2 a3 b3 c3;
datalines;
001 1 1 2 3 . . . . . .
001 2 . . . 2 7 2 . . .
001 3 . . . . . . 2 4 6
002 1 1 1 1 . . . . . .
002 2 . . . 2 3 4 . . .
002 3 . . . . . . 2 3 2
;
data stack;
set one;
retain a1--c3; * Accumulate valid data from line to line in old data set ;
array cp[9] a1--c3;
obsno = _N_;
if qn = 3 then do; * If last ob for a particular section - output and reset ;
output;
do x = 1 to 9;
cp =.;
end;
end;
run;
proc print data = stack;
var a1--c3;
title 'Stacked data set';
run;
I'm trying to get is this :
001 1 2 3 2 7 2 2 4 6
002 1 1 1 2 3 4 2 3 2
What I get is this:
001 . . . . . . 2 4 6
002 . . . . . . 2 3 2
Apparently, RETAIN is allowing valid values to be overwritten by missing values, which is what I'm trying to avoid. I think it has to do with how the PDV reads in and handles the data.
Does anybody have an alternate way to do this ? I wish there were a SUPERRETAIN in SAS like the %SUPERQ macro function...
Any help would be much appreciated. I searched the archives for tips, but couldn't find anything I could use for this problem.
Barry Walton
Barry.Walton@millersville.edu

sbb · Posted 10-21-2010 10:36 AM

Consider that the RETAIN statement is used to manage previously undeclared SAS variables (not occurring in an incoming "SET" file) -- no question, your SET statement will replace same-named SAS variables, when executed in your DATA step.

Given the desired result you conveyed, I suspect you will need to define parallel-named SAS variables or consider using PROC SUMMARY with OUTPUT statement and MAX(...).

Also, rather than hardcoding array logic, I would recommend using the DIM() function with any applicable ARRAY element-range or max-element SAS programming logic, as with a DO/END code paragraph.

Scott Barry
SBBWorks, Inc. Message was edited by: sbb

chang_y_chung_hotmail_com · Posted 10-21-2010 11:37 AM

The retain *does* retain the values. The set, however, loads a new observation every implicit data step iteration. As it does, it overwrites the retained values.

You need a separate array that is retained. As it turns out, temporary arrays are automatically retained. Using one, you can do this easily like so:
[pre]
data one;
input id $ qn a1 b1 c1 a2 b2 c2 a3 b3 c3;
datalines;
001 1 1 2 3 . . . . . .
001 2 . . . 2 7 2 . . .
001 3 . . . . . . 2 4 6
002 1 1 1 1 . . . . . .
002 2 . . . 2 3 4 . . .
002 3 . . . . . . 2 3 2
;
run;

data stack;
set one;
by id qn;
array old[1:3,1:3] a1--c3;
array new[1:3,1:3] _temporary_; /* automatically retained */

if first.id then do;
do i = 1 to 3; do j = 1 to 3; new[i,j] = .; end; end;
end;

do j = 1 to 3;
new[qn,j] = old[qn,j];
end;

if last.id then do;
do i = 1 to 3; do j = 1 to 3; old[i,j] = new[i,j]; end; end;
output;
keep id a1--c3;
end;
run;

/* check */
title 'Stacked data set';
proc print data = stack noobs;
var id a1--c3;
run;
title;
/* on lst
Stacked data set
id a1 b1 c1 a2 b2 c2 a3 b3 c3
001 1 2 3 2 7 2 2 4 6
002 1 1 1 2 3 4 2 3 2
*/
[/pre]

If you get more experienced, you will have to eventually learn how to use the ever-popular DoW loop (see
http://www.devenezia.com/papers/other-authors/sesug-2002/TheMagnificentDO.pdf) and do something like below. Notice how nicely the code matches the programming logic: initialization comes first, then processing of the observations within a given by-group, followed by the processing to be done when the by-group has ended.

DoW also frees us from retaining and explicitly outputting. Can you see why?
[pre]
data stack2;
if 0 then set one; /* to prep pdv */
array old[1:3,1:3] a1--c3;
array new[1:3,1:3] t1-t9;

/* initialize new */
call missing(of t:);

/* DoW */
do until (last.id);
set one;
by id qn;
do j = 1 to 3;
new[qn,j] = old[qn,j];
end;
end;

/* output */
do i = 1 to 3; do j = 1 to 3; old[i,j] = new[i,j]; end; end;
keep id a1--c3;
run;

/* check */
title stack2;
proc print data=stack2 noobs;
run;
title;
/* on lst
stack2
id a1 b1 c1 a2 b2 c2 a3 b3 c3
001 1 2 3 2 7 2 2 4 6
002 1 1 1 2 3 4 2 3 2
*/
[/pre]

Or, you can simply do update as below. But in this case you are assuming that your data are clean in that every id has one and only one observation with qn=1, and given qn=x, there are no non-missing values other than in ax, bx, cx.
[pre]
data stack3;
update one(where=(qn=1)) one(drop=qn);
by id;
if last.id then output;
drop qn;
run;
/* check */
title stack3;
proc print data=stack3 noobs;
run;
title;
/* on lst
stack3
id a1 b1 c1 a2 b2 c2 a3 b3 c3
001 1 2 3 2 7 2 2 4 6
002 1 1 1 2 3 4 2 3 2
*/
[/pre]

Under the same assumptions, you can calculate maximum of each variable, by id, using some procs like so:
[pre]
proc means data=one;
var a1--c3;
by id;
output out=stack4(drop=_:) max=;
run;
/* check */
title stack4;
proc print data=stack4 noobs;
run;
title;
/*
stack4
id a1 b1 c1 a2 b2 c2 a3 b3 c3
001 1 2 3 2 7 2 2 4 6
002 1 1 1 2 3 4 2 3 2
*/
[/pre]

enginemane44 · Posted 10-22-2010 02:34 PM

Hello all,
Chang_y_chung's program with the two-dimensional arrays worked. A big thank you to him and the others who responded. I'm not familiar with PROC SQL (hope to remedy that soon...), so I didn't try that approach - yet. I'll have to review 2-D arrays - I use 1-D arrays extensively, but not the 2-D versions.
Again, thanks to all who responded.
Barry Walton

Patrick · Posted 10-22-2010 05:22 AM

A SQL approach:

proc sql;
/* create table want as*/
select
id
,sum(a1) as a1
,sum(b1) as b1
,sum(c1) as c1
,sum(a2) as a2
,sum(b2) as b2
,sum(c2) as c2
,sum(a3) as a3
,sum(b3) as b3
,sum(c3) as c3
from one
group by id
order by id
;
quit;

Collapsing data - RETAIN statement doesn't seem to retain