DATA Step, Macro, Functions and more

How do I calculate a variable based on another variable value arranged in multiple rows

Reply
Contributor
Posts: 40

How do I calculate a variable based on another variable value arranged in multiple rows

I need urgent help with this situation. I have a dataset that looks like this:

 

ID   VISIT       T0             RESULT
1     V1      07SEP2005    negative
1     v2      06oct2005       negative
1     v3      01apr2006      positive
1     v4      03sep2006      negative
1     v5      08sep2006      negative
2     v1      07sep2005      negative
2     v2      02oct2005       positive
2     v3      05nov2005      positive
2     v4     10nov2006       negative
3     v1      07SEP2005     negative
3     v2      03OCT2005    positive
3     v3      09nov2005
3     v4     10dec2005       negative
3     v5      03mar2006     positive

 

Number of visits may vary for each id and sometimes 'results' might be missing. 

For each unique ID:

1.) When there is 'positive' test result, we need to look for next consecutive 'negative' result and- we need to calculate the difference between the dates of the first positive result and the date of the first negative result after the first positive result and a new variable 'changed_result' is set to 1

2.) if there are no positive result at all for an id-, no diff in dates is calculated and 'changed_result' =0

 

I hope I can get some answers soon. I use BASE SAS V9

Thank you in advance for your help.

Super User
Posts: 19,875

Re: How do I calculate a variable based on another variable value arranged in multiple rows

Please post your expected output, both values and file. 

Contributor
Posts: 40

Re: How do I calculate a variable based on another variable value arranged in multiple rows

Here is my dataset and the output I need:

 

ID VISIT T0 RESULT
1 V1 07SEP2005 negative
1 v2 06oct2005 negative
1 v3 01apr2006 positive
1 v4 03sep2006 negative
1 v5 08sep2006 negative
2 v1 07sep2005 negative
2 v2 02oct2005 positive
2 v3 05nov2005 positive
2 v4 10nov2006 negative
3 v1 07SEP2005 negative
3 v2 03OCT2005 positive
3 v3 09nov2005
3 v4 10dec2005 negative
3 v5 03mar2006 positive

 

expected output:

ID VISIT T0 RESULT diff changed_result
1 V1 07SEP2005 negative . 0
1 v2 06oct2005 negative . 0
1 v3 01apr2006 positive . 0
1 v4 03sep2006 negative 155 1
1 v5 08sep2006 negative . 0
2 v1 07sep2005 negative . 0
2 v2 02oct2005 positive . 0
2 v3 05nov2005 positive . 0
2 v4 10nov2006 negative 404 1
3 v1 07SEP2005 negative . 0
3 v2 03OCT2005 positive . 0
3 v3 09nov2005 . 0
3 v4 10dec2005 negative 68 1
3 v5 03mar2006 positive . 0

 

 

The date value for first positive needs to be stored and compared to the one for first negative after the first positive.

For ID1 : 01apr2006 - 03sep2006;

for id2 : 02oct2005 - 10nov2006;

for id3: 03OCT2005 - 10dec2005

 

Thank you 

Trusted Advisor
Posts: 1,137

Re: How do I calculate a variable based on another variable value arranged in multiple rows

[ Edited ]

Hope you are expecting an output as below. if yes then follow this code

 

data want;
do until(first.id) ;
set have(where=(result='positive'));
by id visit;
posdate=t0;
end;
do until(last.id) ;
set have;
by id visit;
if lag(result) in ('positive',' ') and result='negative' and t0>posdate then changed_result=1; 
else changed_result=0; 
if result='negative' and t0>posdate then diff=t0-posdate; output; end; run;

 

 


Capture.PNG
Thanks,
Jag
Contributor
Posts: 40

Re: How do I calculate a variable based on another variable value arranged in multiple rows

Posted in reply to Jagadishkatam

Thank you Jag, I am a couple of issues with this one- I would need all the observations in the final dataset, for some reason I am geting fewer obs. Also somewhere the logic is not working.

Nevertheless, I appreciate your help!

Trusted Advisor
Posts: 1,022

Re: How do I calculate a variable based on another variable value arranged in multiple rows

Posted in reply to Jagadishkatam

You don't really need the double-dow here (i.e. two "do until (last.id)" groups), because there is no need to go back and re-read records.  All one should be doing is to maintain the current status, i.e. whether a first positive record has been encountered.  If so, then a secondary status is checked, namely whether a subsequent negative has been encountered:

 

Note the retained variables FIRST_POS in the program below can have three values:

  • a missing value, set at the first.id records, which is continued until ...
  • a T0 value corresponding to the first "positive" record, (if result="positive" and first_pos=.) which is continued until ...
  • the next "negative" record, (if result='negative' and not(missing(first_pos) ) when DIFF is calculated, and first_pos is set to .Z.

    It's set to .Z instead of ., so that a subsequent "positive" won't inadvertantly restart the process.

 

data want (drop=first_pos);
  set have;
  by id;

  retain first_pos;  /* T0 date of first positive rec */

  if result='positive' and first_pos=. then first_pos=t0;

  if result='negative' and not(missing(first_pos)) then do;
    changed_result=1;
    diff=t0-first_pos;
    first_pos=.Z;
  end;
  else changed_result=0;

  if last.id then first_pos=.;
run ;

 

Trusted Advisor
Posts: 1,586

Re: How do I calculate a variable based on another variable value arranged in multiple rows

[ Edited ]

Try next code - changed according to your answer:

 

data want;

 set have;

  by id;

       retain  phase pos_d neg_d ;

      changed_result =0;  /* initial value */

       if first.id then phase = 0;

       if phase = 0 and result = 'positive'

           then do;  pos_d = t0; phase=1; end;

       if phase = 1 and result = 'negative'

           then do; neg_d = t0; phase =2; end;

       if last.id the do;

          if phase = 2 then do;

             dif_days = neg_d - pos_d;

             changed_result = 1;

          end;

          else do;

              dif_days = .;

              changed_result = 0;

          end;

         ***output;

         keep ID dif_days changed_result;

  run;

          

  

       

Contributor
Posts: 40

Re: How do I calculate a variable based on another variable value arranged in multiple rows

This code works great, Thanks a ton SHMUEL Smiley Happy

Contributor
Posts: 40

UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

There is an update in the algorithm, please help:

CALCULATING THE TIME BETWEEN THE FIRST POSITIVE TO HOW EVER LONG THEY REMAIN POSITIVE (BEFORE HAVING A NEGATIVE OR UNKNOWN/MISSING RESULT)

 

ID VISIT T0 RESULT
1 V1 07SEP2005 negative
1 v2 06oct2005 negative
1 v3 01apr2006 positive
1 v4 03sep2006 negative
1 v5 08sep2006 negative
2 v1 07sep2005 negative
2 v2 02oct2005 positive
2 v3 05nov2005 positive
2 v4 10nov2006 negative
3 v1 07SEP2005 negative
3 v2 03OCT2005 positive
3 v3 09nov2005
3 v4 10dec2005 negative
3 v5 03mar2006 positive
4 v1 07SEP2005 negative
4 v2 10SEP2005 negative
4 v3 07OCT2005 negative
4 v4 17nOV2005 negative
4 v5 27JAN2006 negative
5 v1 03mar2007 positive
5 v2 30mar2007 positive
5 v3 06jun2007 positive
5 v4 13mar2008 positive
6 v1 03feb2007 positive
6 v2 30jun2007 UNKNOWN
6 v3 06nov2007
6 v4 03mar2008 positive
7 v1 01jan2007 positive
7 v2 30jun2007 positive
7 v3 08nov2007
7 v4 01feb2008 positive
8 v1 07aug2005 negative
8 v2 07oct2005 negative
8 v3 15nov2005 positive
8 v4 01jan2006
9 v1 07jan2005 negative
9 v2 10mar2005 positive
9 v3 27OCT2005 UNKNOWN
9 v4 07dec2005 negative
9 v5 20JAN2006 negative

 

diff=t1-t2
BASICALLY WE FOLLOW FIRST POSITIVE RESULT UNTIL 1.)THERE IS A NEGATIVE RESULT OR 2.) TILL THE LAST IF THEY NEVER BECOME NEGATIVES OR 3.) THE NEXT RESULT IS MISSING OR UNKNOWN

1.) if an id has all negative results (and no positives) it has to be flagged (flag=1)(example: Id 4),
2.) if an id has all positive results (and no negatives) it has to be flagged (flag=2)(example: Id 5),
and diff= date of first positive minus date of last positive and changed_result=0
3.) if the results are like this-- 1- POSITIVE
2- POSITIVE
3- POSITIVE
4- (MISSING VALUE)
5- POSITIVE
then diff= dates of first positive and 3 rd positive need to be subtracted and changed_result=0
(similar situation when results are
1- POSITIVE
2- POSITIVE
3- (MISSING VALUE)
4- POSITIVE
then diff= dates of first positive and 2 rd positive need to be subtracted and changed_result=0
In other words, whenever resulTS are in (positive, missing) then diff=dates of first positive and the positive result right before the missing value
need to be subtracted and changed_result=0
4.) if the result are like this-- 1-negative
2-negative
3-positive
4-missing
then diff= missing value (cuz we have nothing to compare it to) and changed_result=0
In other words, if the second element while calculating diff is a missing value, then diff=missing and changed_result=0
5.)'unknown' values in result var should be treated same as missing values


Number of visits may vary for each id and sometimes 'results' might be missing/unknown (treat both the same way).
For each unique ID:
6.) When there is 'positive' test result, we need to look for next consecutive 'negative' result and- we need to calculate the difference between the dates of the first positive result and the date of the first negative result after the first positive result and a new variable 'changed_result' is set to 1

 


In the output only one row should be there for each id:


expected output:

ID diff changed_result t1 t2 FLAG
1 155 1 01apr2006 03sep2006
2 404 1 02oct2005 10nov2006
3 . 0 . . (CANT' BE CALCULATED AS THERE IS A MISSING RESULT AFTER FIRST POSITIVE)
4 . 0 . . 1 (ALL NEGATIVE RESULTS)
5 376 0 03mar2007 13mar2008 2 (ALL POSITIVE RESULTS)
6 . 0 . . (CANT' BE CALCULATED AS THERE IS A unknown RESULT AFTER FIRST POSITIVE)
7 180 0 01jan2007 30jun2007 (this is how long were definitely positive)
8 . 0 . . (CANT' BE CALCULATED AS THERE IS A MISSING RESULT AFTER FIRST POSITIVE)
9 . 0 . . (CANT' BE CALCULATED AS THERE IS A unknown RESULT AFTER FIRST POSITIVE)


I hope I can get some answers soon. I use BASE SAS V9
Thank you so much in advance for your help.

 

Trusted Advisor
Posts: 1,586

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

Attached code contains 3 steps:

1) reading input into dataset TEST

2) the old algoritm - you may ignore it, if not need

3) the new algorithm - producing the expected results from TEST

 

 

Attachment
Contributor
Posts: 40

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

Thank you Shmuel. The code works but for id 7,

7 v1 01jan2007 positive
7 v2 30jun2007 positive
7 v3 08nov2007
7 v4 01feb2008 positive

 

the t1,t2 values are incorrect. The diff value is calculated correctly.

output for your code for id7: 

id diff changed_result      t1                    t2                   flag

7  180  0                       01feb2008       30jun2007         .

 

when it should be :

id diff changed_result      t1                    t2                   flag

7  180  0                       01jan2007      30jun2007         .

Trusted Advisor
Posts: 1,586

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

I have fixed the error with ID=7. Please check again.

 

data alg2;
format id diff changed_result t1 t2 flag;
keep id diff changed_result t1 t2 flag;
set test;
by id;
retain type ; /* 1=start negative, 2=start positive, 0=changed */
retain pos1d; /* 1st positive date */
retain flag; /* 1=all negative, 2=all positive, .=changed */
retain t2; /* date used to calc diff */
retain diff changed_result 0;
result = lowcase(result);

if first.id then do;
diff=.;
if result = 'positive' then do; type=2; pos1d = t0; flag=2; end; else
if result = 'negative' then do; type=1; flag=1; end;
else do; type = 0; flag=.; end;

end; else

if result = 'positive' then do;
if type ne 2 /* 1st positive which is not 1st in ID */

and diff=.                            /* <<<<<<<<<<<< line added */
then do; type=2; pos1d = t0; flag=.; end;
else if flag=2 then do;
diff=t0-pos1d; t2=t0;
end;
/* if type=2 no change need */
end; else
if result = 'negative' then do;
if type =2 and diff=. /* 1st negative following positive(s) */
then do; diff = t0 - pos1d; t2=t0; flag=.; changed_result=1; end;
/* if type is 1 or 0 no change need */
end;
else do;
if type =2 /* missing following positive(s) */
then do; ***pos1d=.; t2=.; changed_result=0; flag=.; type=0; end;    /* <<<<< line changed */
end;

if last.id then do;
if flag=1 then do; diff=.; t1=.; t2=.; output; end; else
if flag=2 then do; diff=t0 - pos1d; t1=pos1d; t2=t0; output; end;
else do;
if diff > 0
then do; t1=pos1d; /* changed_result diff and t2 retained */ output; end;
else do; diff=.; t1=.; t2=.; output; end;
end;
end;
format t1 t2 date9.;
run;

 

Question:

what would you like to do in case of two periods with break in ID ?

assume ID=7 has one more line:

   7  v5  20DEC2009 positive   <<< or negative >>> those are two different cases to consider >>>>

Would you prefer the first period? the last one? or the longest ?

Contributor
Posts: 40

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

Answer:

Step1: Need to see when they first become positive and note the date as t1

Step2: Follow the id until we encounter a negative or missing value or unknown , we do not care what ever happens after that. Even if there are multiple positive results or negative results after 1-positive 2-missing, we don't care for them as we encountered a missing value. We are calculating how long a person stays positive before encountering a missing/unknown/negative value.

Step3: if the person had a missing/unknown after first positive, we censor them there and diff=1st positive date minus date(missing/unknown) but changed_result=0;

Step4: if the person had a negative after first positive, we censor them there and diff=1st positive date minus date(first negative after 1st positive) but changed_result=1;

 

So, in the example you gave for ID 7, we don't care about v5 as after first positive we encountered a misisng at v3, so diff=01jan2007-08nov2007=311

 

Your latest code working but I still need the t2 value to be printed instead of missing value there.

Thanks a lot!

Contributor
Posts: 40

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

So to put it succinctly- the FIRST PERIOD
Contributor
Posts: 40

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

I am sorry- the diff=01jan2007-30jun2007 =180
Ask a Question
Discussion stats
  • 27 replies
  • 410 views
  • 4 likes
  • 5 in conversation