BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hi everyone, I have a dataset below.

for every subjid i have randno for only the baseline visit . i would like to carry forward that randno value both to screening and subsequent visits like 12weeks ,24weeks,32weeks. it has to be done like that by each subject id both upper row and below rows. 

 

Data main;

Input subjid visit randno;

Cards

10001    screening            .

10001    baseline               90001

10001    12_weeks           .

10001    24_weeks           .

10002    screening            .

10002    baseline               90010

10002    12_weeks           .

10002    48_weeks           .

10003    screening            .

10003    baseline               90011

10003    12_weeks           .

10003    24_weeks           .

10003    36_weeks           .

;

Run;

 

 

can you suggest what method would be appropriate to proceed?

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Ok. See if this helps you

 

Data main;
Input subjid $ 1-5 visit $ 7-15 randno;
Cards;
10001 screening .    
10001 baseline  90001
10001 12_weeks  .    
10001 24_weeks  .    
10002 screening .    
10002 baseline  90010
10002 12_weeks  .    
10002 48_weeks  .    
10003 screening .    
10003 baseline  90011
10003 12_weeks  .    
10003 24_weeks  .    
10003 36_weeks  .    
;

data want (drop=_:);
   do _N_=1 by 1 until (last.subjid);
      set main;
      by subjid;
      _randno = max(_randno, randno);
   end;
   do _N_=1 to _N_;
      set main;
      randno=_randno;
      output;
   end;
run;

View solution in original post

14 REPLIES 14
PeterClemmensen
Tourmaline | Level 20

Can a Subjid containg more than one non missing value? And what if it does?

sahoositaram555
Pyrite | Level 9
Hi @PeterClemmensen. i had received the data like that. it means the randomisation number was assigned to them at the baseline visit , not before . Also after baseline they ideally must carry the randno along but due to some app error the value coulnot get loaded, so i want to do it through programming.
PeterClemmensen
Tourmaline | Level 20

Ok. See if this helps you

 

Data main;
Input subjid $ 1-5 visit $ 7-15 randno;
Cards;
10001 screening .    
10001 baseline  90001
10001 12_weeks  .    
10001 24_weeks  .    
10002 screening .    
10002 baseline  90010
10002 12_weeks  .    
10002 48_weeks  .    
10003 screening .    
10003 baseline  90011
10003 12_weeks  .    
10003 24_weeks  .    
10003 36_weeks  .    
;

data want (drop=_:);
   do _N_=1 by 1 until (last.subjid);
      set main;
      by subjid;
      _randno = max(_randno, randno);
   end;
   do _N_=1 to _N_;
      set main;
      randno=_randno;
      output;
   end;
run;
sahoositaram555
Pyrite | Level 9
Hi @novinosrin,if you can,please help with one quick response.
That will be helpful to get me this one completely as well. In the second do loop when you mentioned that do _N_=1 to _N_;
does the second _N_ means till end of obs . it's a bit unclear to me as we are not using a by statement in the second DO loop to get all the visit rows per subject . how is it all happening ?

novinosrin
Tourmaline | Level 20

_N_ here is an index variable for a loop and is just a normal variable that is not written to the output dataset or in other words automatically dropped.  When you do use this _N_ variable for anything other than its original stated purpose(i.e data step iteration counter) you are basically controlling the value of the variable. 

 

On the 1st loop, you use SET statement to read observations in a BY GROUP one by one. The compiler sets the value of FIRST. and LAST. variables at compile time determining the first and last observations of the BY GROUP.  

 

Once you have read the observations of a particular by group, you do not need to reset the FIRST and LAST variable values as you can read directly from cache in memory. This saves the compiler from doing extra work and the index variable _N_ value becomes the number of observations to read from the dataset during the 2nd loop. This kind of mimics a self join per se though not quite the same. This technique is also known as DOW loop. 

 

Tom
Super User Tom
Super User

The first loop counts how many and the second loop uses the count to know how many times to execute.

Your problem does not need the count for anything so it is probably easier to eliminate the counter and just code both loops the same.

data want ;
   do until (last.subjid);
      set main;
      by subjid;
      _randno = max(_randno, randno);
   end;
   do until (last.subjid);
      set main;
      by subjid;
      output;
   end;
  drop _randno;
run;

 

novinosrin
Tourmaline | Level 20

Very good point by @PeterClemmensen . Until then for what it's worth as I am bored

 


Data main;

Input subjid visit :$32.  randno;

Cards;

10001    screening            .

10001    baseline               90001

10001    12_weeks           .

10001    24_weeks           .

10002    screening            .

10002    baseline               90010

10002    12_weeks           .

10002    48_weeks           .

10003    screening            .

10003    baseline               90011

10003    12_weeks           .

10003    24_weeks           .

10003    36_weeks           .

;

Run;

proc sql;
create table want as
select subjid ,visit ,max((visit='baseline')*randno) as randno
from (select *,monotonic() as rn from main)
group by subjid
order by subjid,rn;
quit;
novinosrin
Tourmaline | Level 20
data want;
 merge main(drop=randno) main(keep=visit subjid randno rename=(visit=v) where=(v='baseline'));
 by subjid;
 drop v;
run;
novinosrin
Tourmaline | Level 20

This is also a great case for somebody who is a beginner of HASH to get the understanding

 

data want ;
 if _n_=1 then do;
   dcl hash H (dataset:'main(where=(visit="baseline")') ;
   h.definekey  ("subjid") ;
   h.definedata ("randno") ;
   h.definedone () ;
 end;
 set main;
 _n_=h.find();
run;

 

mkeintz
PROC Star

If you reverse the operands of the merge statement, you can simplify even a bit more:

 

data want;
  merge main (where=(visit='baseline')) main (drop=randno);
  by subjid;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
sahoositaram555
Pyrite | Level 9
Hi @novinosrin, Thank you very much for showing multiple ways of doing this.
My i know what exactly below statement does to code? its a bit confusing when i'm trying to correlate with the group by statement.
max((visit='baseline')*randno)

please can you drop a line.
novinosrin
Tourmaline | Level 20

max((visit='baseline')*randno) is merely a fancy boolean expression. So to learn  how the expression works is by breaking down to smaller fragments

 

1. When visit='baseline' is true, the expression evaluates to 1 and when it is false, the expression evaluates to 0. So you would have values of 1's ad 0's

2. You multiply the evaluated 1's and 0's with the randno. This would result in an expression like 1*90001 for subjid 110001 and the same applies for other subjid's

3. Then you take the MAX value of the result of 2, which is max of (90001,0 , 0,0,.... and so forth of 0's). Of course, You are bound to get 90001 in this case. and likewise the same logic would hold true for other subjids

4. Proc SQL is awesome as it can be used as readymeals solutions for many cases as along as the dataset is narrow with its capacity to remerge automatically. I really like this functionality that makes many datastep solutions innocent. 

 

Well, there you go. I hope that helps?

 

 

sahoositaram555
Pyrite | Level 9
Yes, thank you very much for your time .
Tom
Super User Tom
Super User

@sahoositaram555 wrote:
Hi @novinosrin, Thank you very much for showing multiple ways of doing this.
My i know what exactly below statement does to code? its a bit confusing when i'm trying to correlate with the group by statement.
max((visit='baseline')*randno)

please can you drop a line.

I think that statement was designed to insure permanently job stability maintaining the code 🙂 

 

Let's break it down and see what it does and then try guess at why it might help with this problem.  The boolean expression VISIT='baseline' will evaluate to 0 (false) or 1(true).  The multiplication operator will multiply that value times the value of RANDNO.  Then the MAX() aggregate function from SQL will take the largest result over all of the records for that subject.  SAS will then re-merge that subject level onto all of the observations it returns for that subject.

 

Note that this will work in SQL only if RANDNO was defined as numeric.  Unlike a DATA step SQL does not automatically convert character strings to numbers.  Note if RANDNO is a numeric variable then that is another coding error in the design of your dataset.  You do not ever need to do arithmetic with the values of the RANDNO variable. 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 3098 views
  • 9 likes
  • 5 in conversation