Solved: Re: carry forward the value both up and down rows

sahoositaram555 · Posted 03-21-2020 03:05 PM

Hi everyone, I have a dataset below.

for every subjid i have randno for only the baseline visit . i would like to carry forward that randno value both to screening and subsequent visits like 12weeks ,24weeks,32weeks. it has to be done like that by each subject id both upper row and below rows.

Data main;

Input subjid visit randno;

Cards

10001 screening .

10001 baseline 90001

10001 12_weeks .

10001 24_weeks .

10002 screening .

10002 baseline 90010

10002 12_weeks .

10002 48_weeks .

10003 screening .

10003 baseline 90011

10003 12_weeks .

10003 24_weeks .

10003 36_weeks .

;

Run;

can you suggest what method would be appropriate to proceed?

PeterClemmensen · Posted 03-21-2020 03:21 PM

Ok. See if this helps you

Data main;
Input subjid $ 1-5 visit $ 7-15 randno;
Cards;
10001 screening .    
10001 baseline  90001
10001 12_weeks  .    
10001 24_weeks  .    
10002 screening .    
10002 baseline  90010
10002 12_weeks  .    
10002 48_weeks  .    
10003 screening .    
10003 baseline  90011
10003 12_weeks  .    
10003 24_weeks  .    
10003 36_weeks  .    
;

data want (drop=_:);
   do _N_=1 by 1 until (last.subjid);
      set main;
      by subjid;
      _randno = max(_randno, randno);
   end;
   do _N_=1 to _N_;
      set main;
      randno=_randno;
      output;
   end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 03-21-2020 03:10 PM

Can a Subjid containg more than one non missing value? And what if it does?

The DATA to DATA Step Macro
Blog: SASnrd

sahoositaram555 · Posted 03-21-2020 03:15 PM

Hi @PeterClemmensen. i had received the data like that. it means the randomisation number was assigned to them at the baseline visit , not before . Also after baseline they ideally must carry the randno along but due to some app error the value coulnot get loaded, so i want to do it through programming.

PeterClemmensen · Posted 03-21-2020 03:21 PM

Ok. See if this helps you

Data main;
Input subjid $ 1-5 visit $ 7-15 randno;
Cards;
10001 screening .    
10001 baseline  90001
10001 12_weeks  .    
10001 24_weeks  .    
10002 screening .    
10002 baseline  90010
10002 12_weeks  .    
10002 48_weeks  .    
10003 screening .    
10003 baseline  90011
10003 12_weeks  .    
10003 24_weeks  .    
10003 36_weeks  .    
;

data want (drop=_:);
   do _N_=1 by 1 until (last.subjid);
      set main;
      by subjid;
      _randno = max(_randno, randno);
   end;
   do _N_=1 to _N_;
      set main;
      randno=_randno;
      output;
   end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

sahoositaram555 · Posted 03-21-2020 04:39 PM

Hi @novinosrin,if you can,please help with one quick response.
That will be helpful to get me this one completely as well. In the second do loop when you mentioned that do _N_=1 to _N_;
does the second _N_ means till end of obs . it's a bit unclear to me as we are not using a by statement in the second DO loop to get all the visit rows per subject . how is it all happening ?

novinosrin · Posted 03-21-2020 04:52 PM

_N_ here is an index variable for a loop and is just a normal variable that is not written to the output dataset or in other words automatically dropped. When you do use this _N_ variable for anything other than its original stated purpose(i.e data step iteration counter) you are basically controlling the value of the variable.

On the 1st loop, you use SET statement to read observations in a BY GROUP one by one. The compiler sets the value of FIRST. and LAST. variables at compile time determining the first and last observations of the BY GROUP.

Once you have read the observations of a particular by group, you do not need to reset the FIRST and LAST variable values as you can read directly from cache in memory. This saves the compiler from doing extra work and the index variable _N_ value becomes the number of observations to read from the dataset during the 2nd loop. This kind of mimics a self join per se though not quite the same. This technique is also known as DOW loop.

Tom · Posted 03-21-2020 06:19 PM

The first loop counts how many and the second loop uses the count to know how many times to execute.

Your problem does not need the count for anything so it is probably easier to eliminate the counter and just code both loops the same.

data want ;
   do until (last.subjid);
      set main;
      by subjid;
      _randno = max(_randno, randno);
   end;
   do until (last.subjid);
      set main;
      by subjid;
      output;
   end;
  drop _randno;
run;

novinosrin · Posted 03-21-2020 03:14 PM

Very good point by @PeterClemmensen . Until then for what it's worth as I am bored


Data main;

Input subjid visit :$32.  randno;

Cards;

10001    screening            .

10001    baseline               90001

10001    12_weeks           .

10001    24_weeks           .

10002    screening            .

10002    baseline               90010

10002    12_weeks           .

10002    48_weeks           .

10003    screening            .

10003    baseline               90011

10003    12_weeks           .

10003    24_weeks           .

10003    36_weeks           .

;

Run;

proc sql;
create table want as
select subjid ,visit ,max((visit='baseline')*randno) as randno
from (select *,monotonic() as rn from main)
group by subjid
order by subjid,rn;
quit;

novinosrin · Posted 03-21-2020 03:20 PM

data want;
 merge main(drop=randno) main(keep=visit subjid randno rename=(visit=v) where=(v='baseline'));
 by subjid;
 drop v;
run;

novinosrin · Posted 03-21-2020 03:26 PM

This is also a great case for somebody who is a beginner of HASH to get the understanding

data want ;
 if _n_=1 then do;
   dcl hash H (dataset:'main(where=(visit="baseline")') ;
   h.definekey  ("subjid") ;
   h.definedata ("randno") ;
   h.definedone () ;
 end;
 set main;
 _n_=h.find();
run;

mkeintz · Posted 03-21-2020 04:54 PM

If you reverse the operands of the merge statement, you can simplify even a bit more:

data want;
  merge main (where=(visit='baseline')) main (drop=randno);
  by subjid;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sahoositaram555 · Posted 03-21-2020 04:05 PM

Hi @novinosrin, Thank you very much for showing multiple ways of doing this.
My i know what exactly below statement does to code? its a bit confusing when i'm trying to correlate with the group by statement.
max((visit='baseline')*randno)

please can you drop a line.

novinosrin · Posted 03-21-2020 04:18 PM

max((visit='baseline')*randno) is merely a fancy boolean expression. So to learn how the expression works is by breaking down to smaller fragments

1. When visit='baseline' is true, the expression evaluates to 1 and when it is false, the expression evaluates to 0. So you would have values of 1's ad 0's

2. You multiply the evaluated 1's and 0's with the randno. This would result in an expression like 1*90001 for subjid 110001 and the same applies for other subjid's

3. Then you take the MAX value of the result of 2, which is max of (90001,0 , 0,0,.... and so forth of 0's). Of course, You are bound to get 90001 in this case. and likewise the same logic would hold true for other subjids

4. Proc SQL is awesome as it can be used as readymeals solutions for many cases as along as the dataset is narrow with its capacity to remerge automatically. I really like this functionality that makes many datastep solutions innocent.

Well, there you go. I hope that helps?

sahoositaram555 · Posted 03-21-2020 04:26 PM

Yes, thank you very much for your time .

Tom · Posted 03-21-2020 04:26 PM

@sahoositaram555 wrote:
Hi @novinosrin, Thank you very much for showing multiple ways of doing this.
My i know what exactly below statement does to code? its a bit confusing when i'm trying to correlate with the group by statement.
max((visit='baseline')*randno)

please can you drop a line.

I think that statement was designed to insure permanently job stability maintaining the code 🙂

Let's break it down and see what it does and then try guess at why it might help with this problem. The boolean expression VISIT='baseline' will evaluate to 0 (false) or 1(true). The multiplication operator will multiply that value times the value of RANDNO. Then the MAX() aggregate function from SQL will take the largest result over all of the records for that subject. SAS will then re-merge that subject level onto all of the observations it returns for that subject.

Note that this will work in SQL only if RANDNO was defined as numeric. Unlike a DATA step SQL does not automatically convert character strings to numbers. Note if RANDNO is a numeric variable then that is another coding error in the design of your dataset. You do not ever need to do arithmetic with the values of the RANDNO variable.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away