Hi everyone, I have a dataset below.
for every subjid i have randno for only the baseline visit . i would like to carry forward that randno value both to screening and subsequent visits like 12weeks ,24weeks,32weeks. it has to be done like that by each subject id both upper row and below rows.
Data main;
Input subjid visit randno;
Cards
10001 screening .
10001 baseline 90001
10001 12_weeks .
10001 24_weeks .
10002 screening .
10002 baseline 90010
10002 12_weeks .
10002 48_weeks .
10003 screening .
10003 baseline 90011
10003 12_weeks .
10003 24_weeks .
10003 36_weeks .
;
Run;
can you suggest what method would be appropriate to proceed?
Ok. See if this helps you
Data main;
Input subjid $ 1-5 visit $ 7-15 randno;
Cards;
10001 screening .
10001 baseline 90001
10001 12_weeks .
10001 24_weeks .
10002 screening .
10002 baseline 90010
10002 12_weeks .
10002 48_weeks .
10003 screening .
10003 baseline 90011
10003 12_weeks .
10003 24_weeks .
10003 36_weeks .
;
data want (drop=_:);
do _N_=1 by 1 until (last.subjid);
set main;
by subjid;
_randno = max(_randno, randno);
end;
do _N_=1 to _N_;
set main;
randno=_randno;
output;
end;
run;
Can a Subjid containg more than one non missing value? And what if it does?
Ok. See if this helps you
Data main;
Input subjid $ 1-5 visit $ 7-15 randno;
Cards;
10001 screening .
10001 baseline 90001
10001 12_weeks .
10001 24_weeks .
10002 screening .
10002 baseline 90010
10002 12_weeks .
10002 48_weeks .
10003 screening .
10003 baseline 90011
10003 12_weeks .
10003 24_weeks .
10003 36_weeks .
;
data want (drop=_:);
do _N_=1 by 1 until (last.subjid);
set main;
by subjid;
_randno = max(_randno, randno);
end;
do _N_=1 to _N_;
set main;
randno=_randno;
output;
end;
run;
_N_ here is an index variable for a loop and is just a normal variable that is not written to the output dataset or in other words automatically dropped. When you do use this _N_ variable for anything other than its original stated purpose(i.e data step iteration counter) you are basically controlling the value of the variable.
On the 1st loop, you use SET statement to read observations in a BY GROUP one by one. The compiler sets the value of FIRST. and LAST. variables at compile time determining the first and last observations of the BY GROUP.
Once you have read the observations of a particular by group, you do not need to reset the FIRST and LAST variable values as you can read directly from cache in memory. This saves the compiler from doing extra work and the index variable _N_ value becomes the number of observations to read from the dataset during the 2nd loop. This kind of mimics a self join per se though not quite the same. This technique is also known as DOW loop.
The first loop counts how many and the second loop uses the count to know how many times to execute.
Your problem does not need the count for anything so it is probably easier to eliminate the counter and just code both loops the same.
data want ;
do until (last.subjid);
set main;
by subjid;
_randno = max(_randno, randno);
end;
do until (last.subjid);
set main;
by subjid;
output;
end;
drop _randno;
run;
Very good point by @PeterClemmensen . Until then for what it's worth as I am bored
Data main;
Input subjid visit :$32. randno;
Cards;
10001 screening .
10001 baseline 90001
10001 12_weeks .
10001 24_weeks .
10002 screening .
10002 baseline 90010
10002 12_weeks .
10002 48_weeks .
10003 screening .
10003 baseline 90011
10003 12_weeks .
10003 24_weeks .
10003 36_weeks .
;
Run;
proc sql;
create table want as
select subjid ,visit ,max((visit='baseline')*randno) as randno
from (select *,monotonic() as rn from main)
group by subjid
order by subjid,rn;
quit;
data want;
merge main(drop=randno) main(keep=visit subjid randno rename=(visit=v) where=(v='baseline'));
by subjid;
drop v;
run;
This is also a great case for somebody who is a beginner of HASH to get the understanding
data want ;
if _n_=1 then do;
dcl hash H (dataset:'main(where=(visit="baseline")') ;
h.definekey ("subjid") ;
h.definedata ("randno") ;
h.definedone () ;
end;
set main;
_n_=h.find();
run;
If you reverse the operands of the merge statement, you can simplify even a bit more:
data want;
merge main (where=(visit='baseline')) main (drop=randno);
by subjid;
run;
max((visit='baseline')*randno) is merely a fancy boolean expression. So to learn how the expression works is by breaking down to smaller fragments
1. When visit='baseline' is true, the expression evaluates to 1 and when it is false, the expression evaluates to 0. So you would have values of 1's ad 0's
2. You multiply the evaluated 1's and 0's with the randno. This would result in an expression like 1*90001 for subjid 110001 and the same applies for other subjid's
3. Then you take the MAX value of the result of 2, which is max of (90001,0 , 0,0,.... and so forth of 0's). Of course, You are bound to get 90001 in this case. and likewise the same logic would hold true for other subjids
4. Proc SQL is awesome as it can be used as readymeals solutions for many cases as along as the dataset is narrow with its capacity to remerge automatically. I really like this functionality that makes many datastep solutions innocent.
Well, there you go. I hope that helps?
@sahoositaram555 wrote:
Hi @novinosrin, Thank you very much for showing multiple ways of doing this.
My i know what exactly below statement does to code? its a bit confusing when i'm trying to correlate with the group by statement.
max((visit='baseline')*randno)
please can you drop a line.
I think that statement was designed to insure permanently job stability maintaining the code 🙂
Let's break it down and see what it does and then try guess at why it might help with this problem. The boolean expression VISIT='baseline' will evaluate to 0 (false) or 1(true). The multiplication operator will multiply that value times the value of RANDNO. Then the MAX() aggregate function from SQL will take the largest result over all of the records for that subject. SAS will then re-merge that subject level onto all of the observations it returns for that subject.
Note that this will work in SQL only if RANDNO was defined as numeric. Unlike a DATA step SQL does not automatically convert character strings to numbers. Note if RANDNO is a numeric variable then that is another coding error in the design of your dataset. You do not ever need to do arithmetic with the values of the RANDNO variable.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.