Re: How do I calculate a variable based on another variable value arra...

ahhh · Posted 12-14-2016 07:41 PM

I need urgent help with this situation. I have a dataset that looks like this:

ID VISIT T0 RESULT
1 V1 07SEP2005 negative
1 v2 06oct2005 negative
1 v3 01apr2006 positive
1 v4 03sep2006 negative
1 v5 08sep2006 negative
2 v1 07sep2005 negative
2 v2 02oct2005 positive
2 v3 05nov2005 positive
2 v4 10nov2006 negative
3 v1 07SEP2005 negative
3 v2 03OCT2005 positive
3 v3 09nov2005
3 v4 10dec2005 negative
3 v5 03mar2006 positive

Number of visits may vary for each id and sometimes 'results' might be missing.

For each unique ID:

1.) When there is 'positive' test result, we need to look for next consecutive 'negative' result and- we need to calculate the difference between the dates of the first positive result and the date of the first negative result after the first positive result and a new variable 'changed_result' is set to 1

2.) if there are no positive result at all for an id-, no diff in dates is calculated and 'changed_result' =0

I hope I can get some answers soon. I use BASE SAS V9

Thank you in advance for your help.

Reeza · Posted 12-14-2016 08:23 PM

Please post your expected output, both values and file.

ahhh · Posted 12-14-2016 08:41 PM

Here is my dataset and the output I need:

ID VISIT T0 RESULT
1 V1 07SEP2005 negative
1 v2 06oct2005 negative
1 v3 01apr2006 positive
1 v4 03sep2006 negative
1 v5 08sep2006 negative
2 v1 07sep2005 negative
2 v2 02oct2005 positive
2 v3 05nov2005 positive
2 v4 10nov2006 negative
3 v1 07SEP2005 negative
3 v2 03OCT2005 positive
3 v3 09nov2005
3 v4 10dec2005 negative
3 v5 03mar2006 positive

expected output:

ID VISIT T0 RESULT diff changed_result
1 V1 07SEP2005 negative . 0
1 v2 06oct2005 negative . 0
1 v3 01apr2006 positive . 0
1 v4 03sep2006 negative 155 1
1 v5 08sep2006 negative . 0
2 v1 07sep2005 negative . 0
2 v2 02oct2005 positive . 0
2 v3 05nov2005 positive . 0
2 v4 10nov2006 negative 404 1
3 v1 07SEP2005 negative . 0
3 v2 03OCT2005 positive . 0
3 v3 09nov2005 . 0
3 v4 10dec2005 negative 68 1
3 v5 03mar2006 positive . 0

The date value for first positive needs to be stored and compared to the one for first negative after the first positive.

For ID1 : 01apr2006 - 03sep2006;

for id2 : 02oct2005 - 10nov2006;

for id3: 03OCT2005 - 10dec2005

Thank you

Jagadishkatam · Posted 12-14-2016 08:42 PM

Hope you are expecting an output as below. if yes then follow this code

data want;
do until(first.id) ;
set have(where=(result='positive'));
by id visit;
posdate=t0;
end;
do until(last.id) ;
set have;
by id visit;
if lag(result) in ('positive',' ') and result='negative' and t0>posdate then changed_result=1; 
else changed_result=0; 
if result='negative' and t0>posdate then diff=t0-posdate;
output;
end;
run;

Thanks,
Jag

ahhh · Posted 12-14-2016 09:43 PM

Thank you Jag, I am a couple of issues with this one- I would need all the observations in the final dataset, for some reason I am geting fewer obs. Also somewhere the logic is not working.

Nevertheless, I appreciate your help!

mkeintz · Posted 12-14-2016 11:57 PM

You don't really need the double-dow here (i.e. two "do until (last.id)" groups), because there is no need to go back and re-read records. All one should be doing is to maintain the current status, i.e. whether a first positive record has been encountered. If so, then a secondary status is checked, namely whether a subsequent negative has been encountered:

Note the retained variables FIRST_POS in the program below can have three values:

a missing value, set at the first.id records, which is continued until ...
a T0 value corresponding to the first "positive" record, (if result="positive" and first_pos=.) which is continued until ...
the next "negative" record, (if result='negative' and not(missing(first_pos) ) when DIFF is calculated, and first_pos is set to .Z.

It's set to .Z instead of ., so that a subsequent "positive" won't inadvertantly restart the process.

data want (drop=first_pos);
  set have;
  by id;

  retain first_pos;  /* T0 date of first positive rec */

  if result='positive' and first_pos=. then first_pos=t0;

  if result='negative' and not(missing(first_pos)) then do;
    changed_result=1;
    diff=t0-first_pos;
    first_pos=.Z;
  end;
  else changed_result=0;

  if last.id then first_pos=.;
run ;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Shmuel · Posted 12-14-2016 08:43 PM

Try next code - changed according to your answer:

data want;

set have;

by id;

retain phase pos_d neg_d ;

changed_result =0; /* initial value */

if first.id then phase = 0;

if phase = 0 and result = 'positive'

then do; pos_d = t0; phase=1; end;

if phase = 1 and result = 'negative'

then do; neg_d = t0; phase =2; end;

if last.id the do;

if phase = 2 then do;

dif_days = neg_d - pos_d;

changed_result = 1;

end;

else do;

dif_days = .;

changed_result = 0;

end;

***output;

keep ID dif_days changed_result;

run;

ahhh · Posted 12-14-2016 09:41 PM

This code works great, Thanks a ton SHMUEL 🙂

ahhh · Posted 12-15-2016 08:11 PM

There is an update in the algorithm, please help:

CALCULATING THE TIME BETWEEN THE FIRST POSITIVE TO HOW EVER LONG THEY REMAIN POSITIVE (BEFORE HAVING A NEGATIVE OR UNKNOWN/MISSING RESULT)

ID VISIT T0 RESULT
1 V1 07SEP2005 negative
1 v2 06oct2005 negative
1 v3 01apr2006 positive
1 v4 03sep2006 negative
1 v5 08sep2006 negative
2 v1 07sep2005 negative
2 v2 02oct2005 positive
2 v3 05nov2005 positive
2 v4 10nov2006 negative
3 v1 07SEP2005 negative
3 v2 03OCT2005 positive
3 v3 09nov2005
3 v4 10dec2005 negative
3 v5 03mar2006 positive
4 v1 07SEP2005 negative
4 v2 10SEP2005 negative
4 v3 07OCT2005 negative
4 v4 17nOV2005 negative
4 v5 27JAN2006 negative
5 v1 03mar2007 positive
5 v2 30mar2007 positive
5 v3 06jun2007 positive
5 v4 13mar2008 positive
6 v1 03feb2007 positive
6 v2 30jun2007 UNKNOWN
6 v3 06nov2007
6 v4 03mar2008 positive
7 v1 01jan2007 positive
7 v2 30jun2007 positive
7 v3 08nov2007
7 v4 01feb2008 positive
8 v1 07aug2005 negative
8 v2 07oct2005 negative
8 v3 15nov2005 positive
8 v4 01jan2006
9 v1 07jan2005 negative
9 v2 10mar2005 positive
9 v3 27OCT2005 UNKNOWN
9 v4 07dec2005 negative
9 v5 20JAN2006 negative

diff=t1-t2
BASICALLY WE FOLLOW FIRST POSITIVE RESULT UNTIL 1.)THERE IS A NEGATIVE RESULT OR 2.) TILL THE LAST IF THEY NEVER BECOME NEGATIVES OR 3.) THE NEXT RESULT IS MISSING OR UNKNOWN

1.) if an id has all negative results (and no positives) it has to be flagged (flag=1)(example: Id 4),
2.) if an id has all positive results (and no negatives) it has to be flagged (flag=2)(example: Id 5),
and diff= date of first positive minus date of last positive and changed_result=0
3.) if the results are like this-- 1- POSITIVE
2- POSITIVE
3- POSITIVE
4- (MISSING VALUE)
5- POSITIVE
then diff= dates of first positive and 3 rd positive need to be subtracted and changed_result=0
(similar situation when results are
1- POSITIVE
2- POSITIVE
3- (MISSING VALUE)
4- POSITIVE
then diff= dates of first positive and 2 rd positive need to be subtracted and changed_result=0
In other words, whenever resulTS are in (positive, missing) then diff=dates of first positive and the positive result right before the missing value
need to be subtracted and changed_result=0
4.) if the result are like this-- 1-negative
2-negative
3-positive
4-missing
then diff= missing value (cuz we have nothing to compare it to) and changed_result=0
In other words, if the second element while calculating diff is a missing value, then diff=missing and changed_result=0
5.)'unknown' values in result var should be treated same as missing values

Number of visits may vary for each id and sometimes 'results' might be missing/unknown (treat both the same way).
For each unique ID:
6.) When there is 'positive' test result, we need to look for next consecutive 'negative' result and- we need to calculate the difference between the dates of the first positive result and the date of the first negative result after the first positive result and a new variable 'changed_result' is set to 1

In the output only one row should be there for each id:

expected output:

ID diff changed_result t1 t2 FLAG
1 155 1 01apr2006 03sep2006
2 404 1 02oct2005 10nov2006
3 . 0 . . (CANT' BE CALCULATED AS THERE IS A MISSING RESULT AFTER FIRST POSITIVE)
4 . 0 . . 1 (ALL NEGATIVE RESULTS)
5 376 0 03mar2007 13mar2008 2 (ALL POSITIVE RESULTS)
6 . 0 . . (CANT' BE CALCULATED AS THERE IS A unknown RESULT AFTER FIRST POSITIVE)
7 180 0 01jan2007 30jun2007 (this is how long were definitely positive)
8 . 0 . . (CANT' BE CALCULATED AS THERE IS A MISSING RESULT AFTER FIRST POSITIVE)
9 . 0 . . (CANT' BE CALCULATED AS THERE IS A unknown RESULT AFTER FIRST POSITIVE)

I hope I can get some answers soon. I use BASE SAS V9
Thank you so much in advance for your help.

Shmuel · Posted 12-16-2016 04:15 AM

Attached code contains 3 steps:

1) reading input into dataset TEST

2) the old algoritm - you may ignore it, if not need

3) the new algorithm - producing the expected results from TEST

ahhh · Posted 12-16-2016 10:41 AM

Thank you Shmuel. The code works but for id 7,

7 v1 01jan2007 positive
7 v2 30jun2007 positive
7 v3 08nov2007
7 v4 01feb2008 positive

the t1,t2 values are incorrect. The diff value is calculated correctly.

output for your code for id7:

id diff changed_result t1 t2 flag

7 180 0 01feb2008 30jun2007 .

when it should be :

id diff changed_result t1 t2 flag

7 180 0 01jan2007 30jun2007 .

Shmuel · Posted 12-16-2016 03:37 PM

I have fixed the error with ID=7. Please check again.

data alg2;
format id diff changed_result t1 t2 flag;
keep id diff changed_result t1 t2 flag;
set test;
by id;
retain type ; /* 1=start negative, 2=start positive, 0=changed */
retain pos1d; /* 1st positive date */
retain flag; /* 1=all negative, 2=all positive, .=changed */
retain t2; /* date used to calc diff */
retain diff changed_result 0;
result = lowcase(result);

if first.id then do;
diff=.;
if result = 'positive' then do; type=2; pos1d = t0; flag=2; end; else
if result = 'negative' then do; type=1; flag=1; end;
else do; type = 0; flag=.; end;

end; else

if result = 'positive' then do;
if type ne 2 /* 1st positive which is not 1st in ID */

and diff=. /* <<<<<<<<<<<< line added */
then do; type=2; pos1d = t0; flag=.; end;
else if flag=2 then do;
diff=t0-pos1d; t2=t0;
end;
/* if type=2 no change need */
end; else
if result = 'negative' then do;
if type =2 and diff=. /* 1st negative following positive(s) */
then do; diff = t0 - pos1d; t2=t0; flag=.; changed_result=1; end;
/* if type is 1 or 0 no change need */
end;
else do;
if type =2 /* missing following positive(s) */
then do; ***pos1d=.; t2=.; changed_result=0; flag=.; type=0; end; /* <<<<< line changed */
end;

if last.id then do;
if flag=1 then do; diff=.; t1=.; t2=.; output; end; else
if flag=2 then do; diff=t0 - pos1d; t1=pos1d; t2=t0; output; end;
else do;
if diff > 0
then do; t1=pos1d; /* changed_result diff and t2 retained */ output; end;
else do; diff=.; t1=.; t2=.; output; end;
end;
end;
format t1 t2 date9.;
run;

Question:

what would you like to do in case of two periods with break in ID ?

assume ID=7 has one more line:

7 v5 20DEC2009 positive <<< or negative >>> those are two different cases to consider >>>>

Would you prefer the first period? the last one? or the longest ?

ahhh · Posted 12-16-2016 04:02 PM

Answer:

Step1: Need to see when they first become positive and note the date as t1

Step2: Follow the id until we encounter a negative or missing value or unknown , we do not care what ever happens after that. Even if there are multiple positive results or negative results after 1-positive 2-missing, we don't care for them as we encountered a missing value. We are calculating how long a person stays positive before encountering a missing/unknown/negative value.

Step3: if the person had a missing/unknown after first positive, we censor them there and diff=1st positive date minus date(missing/unknown) but changed_result=0;

Step4: if the person had a negative after first positive, we censor them there and diff=1st positive date minus date(first negative after 1st positive) but changed_result=1;

So, in the example you gave for ID 7, we don't care about v5 as after first positive we encountered a misisng at v3, so diff=01jan2007-08nov2007=311

Your latest code working but I still need the t2 value to be printed instead of missing value there.

Thanks a lot!

ahhh · Posted 12-16-2016 04:03 PM

So to put it succinctly- the FIRST PERIOD

ahhh · Posted 12-16-2016 04:09 PM

I am sorry- the diff=01jan2007-30jun2007 =180

How do I calculate a variable based on another variable value arranged in multiple rows

Re: How do I calculate a variable based on another variable value arranged in multiple rows

Re: How do I calculate a variable based on another variable value arranged in multiple rows

Re: How do I calculate a variable based on another variable value arranged in multiple rows

Re: How do I calculate a variable based on another variable value arranged in multiple rows

Re: How do I calculate a variable based on another variable value arranged in multiple rows

Re: How do I calculate a variable based on another variable value arranged in multiple rows

Re: How do I calculate a variable based on another variable value arranged in multiple rows

UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

Re: UPDATE: How do I calculate a variable based on another variable value arranged in multiple rows

Registration is open

SAS Training: Just a Click Away