I need urgent help with this situation. I have a dataset that looks like this:
ID VISIT T0 RESULT
1 V1 07SEP2005 negative
1 v2 06oct2005 negative
1 v3 01apr2006 positive
1 v4 03sep2006 negative
1 v5 08sep2006 negative
2 v1 07sep2005 negative
2 v2 02oct2005 positive
2 v3 05nov2005 positive
2 v4 10nov2006 negative
3 v1 07SEP2005 negative
3 v2 03OCT2005 positive
3 v3 09nov2005
3 v4 10dec2005 negative
3 v5 03mar2006 positive
Number of visits may vary for each id and sometimes 'results' might be missing.
For each unique ID:
1.) When there is 'positive' test result, we need to look for next consecutive 'negative' result and- we need to calculate the difference between the dates of the first positive result and the date of the first negative result after the first positive result and a new variable 'changed_result' is set to 1
2.) if there are no positive result at all for an id-, no diff in dates is calculated and 'changed_result' =0
I hope I can get some answers soon. I use BASE SAS V9
Thank you in advance for your help.
Please post your expected output, both values and file.
Here is my dataset and the output I need:
ID VISIT T0 RESULT
1 V1 07SEP2005 negative
1 v2 06oct2005 negative
1 v3 01apr2006 positive
1 v4 03sep2006 negative
1 v5 08sep2006 negative
2 v1 07sep2005 negative
2 v2 02oct2005 positive
2 v3 05nov2005 positive
2 v4 10nov2006 negative
3 v1 07SEP2005 negative
3 v2 03OCT2005 positive
3 v3 09nov2005
3 v4 10dec2005 negative
3 v5 03mar2006 positive
expected output:
ID VISIT T0 RESULT diff changed_result
1 V1 07SEP2005 negative . 0
1 v2 06oct2005 negative . 0
1 v3 01apr2006 positive . 0
1 v4 03sep2006 negative 155 1
1 v5 08sep2006 negative . 0
2 v1 07sep2005 negative . 0
2 v2 02oct2005 positive . 0
2 v3 05nov2005 positive . 0
2 v4 10nov2006 negative 404 1
3 v1 07SEP2005 negative . 0
3 v2 03OCT2005 positive . 0
3 v3 09nov2005 . 0
3 v4 10dec2005 negative 68 1
3 v5 03mar2006 positive . 0
The date value for first positive needs to be stored and compared to the one for first negative after the first positive.
For ID1 : 01apr2006 - 03sep2006;
for id2 : 02oct2005 - 10nov2006;
for id3: 03OCT2005 - 10dec2005
Thank you
Hope you are expecting an output as below. if yes then follow this code
data want;
do until(first.id) ;
set have(where=(result='positive'));
by id visit;
posdate=t0;
end;
do until(last.id) ;
set have;
by id visit;
if lag(result) in ('positive',' ') and result='negative' and t0>posdate then changed_result=1;
else changed_result=0;
if result='negative' and t0>posdate then diff=t0-posdate;
output;
end;
run;
Thank you Jag, I am a couple of issues with this one- I would need all the observations in the final dataset, for some reason I am geting fewer obs. Also somewhere the logic is not working.
Nevertheless, I appreciate your help!
You don't really need the double-dow here (i.e. two "do until (last.id)" groups), because there is no need to go back and re-read records. All one should be doing is to maintain the current status, i.e. whether a first positive record has been encountered. If so, then a secondary status is checked, namely whether a subsequent negative has been encountered:
Note the retained variables FIRST_POS in the program below can have three values:
data want (drop=first_pos);
set have;
by id;
retain first_pos; /* T0 date of first positive rec */
if result='positive' and first_pos=. then first_pos=t0;
if result='negative' and not(missing(first_pos)) then do;
changed_result=1;
diff=t0-first_pos;
first_pos=.Z;
end;
else changed_result=0;
if last.id then first_pos=.;
run ;
Try next code - changed according to your answer:
data want;
set have;
by id;
retain phase pos_d neg_d ;
changed_result =0; /* initial value */
if first.id then phase = 0;
if phase = 0 and result = 'positive'
then do; pos_d = t0; phase=1; end;
if phase = 1 and result = 'negative'
then do; neg_d = t0; phase =2; end;
if last.id the do;
if phase = 2 then do;
dif_days = neg_d - pos_d;
changed_result = 1;
end;
else do;
dif_days = .;
changed_result = 0;
end;
***output;
keep ID dif_days changed_result;
run;
This code works great, Thanks a ton SHMUEL 🙂
There is an update in the algorithm, please help:
CALCULATING THE TIME BETWEEN THE FIRST POSITIVE TO HOW EVER LONG THEY REMAIN POSITIVE (BEFORE HAVING A NEGATIVE OR UNKNOWN/MISSING RESULT)
ID VISIT T0 RESULT
1 V1 07SEP2005 negative
1 v2 06oct2005 negative
1 v3 01apr2006 positive
1 v4 03sep2006 negative
1 v5 08sep2006 negative
2 v1 07sep2005 negative
2 v2 02oct2005 positive
2 v3 05nov2005 positive
2 v4 10nov2006 negative
3 v1 07SEP2005 negative
3 v2 03OCT2005 positive
3 v3 09nov2005
3 v4 10dec2005 negative
3 v5 03mar2006 positive
4 v1 07SEP2005 negative
4 v2 10SEP2005 negative
4 v3 07OCT2005 negative
4 v4 17nOV2005 negative
4 v5 27JAN2006 negative
5 v1 03mar2007 positive
5 v2 30mar2007 positive
5 v3 06jun2007 positive
5 v4 13mar2008 positive
6 v1 03feb2007 positive
6 v2 30jun2007 UNKNOWN
6 v3 06nov2007
6 v4 03mar2008 positive
7 v1 01jan2007 positive
7 v2 30jun2007 positive
7 v3 08nov2007
7 v4 01feb2008 positive
8 v1 07aug2005 negative
8 v2 07oct2005 negative
8 v3 15nov2005 positive
8 v4 01jan2006
9 v1 07jan2005 negative
9 v2 10mar2005 positive
9 v3 27OCT2005 UNKNOWN
9 v4 07dec2005 negative
9 v5 20JAN2006 negative
diff=t1-t2
BASICALLY WE FOLLOW FIRST POSITIVE RESULT UNTIL 1.)THERE IS A NEGATIVE RESULT OR 2.) TILL THE LAST IF THEY NEVER BECOME NEGATIVES OR 3.) THE NEXT RESULT IS MISSING OR UNKNOWN
1.) if an id has all negative results (and no positives) it has to be flagged (flag=1)(example: Id 4),
2.) if an id has all positive results (and no negatives) it has to be flagged (flag=2)(example: Id 5),
and diff= date of first positive minus date of last positive and changed_result=0
3.) if the results are like this-- 1- POSITIVE
2- POSITIVE
3- POSITIVE
4- (MISSING VALUE)
5- POSITIVE
then diff= dates of first positive and 3 rd positive need to be subtracted and changed_result=0
(similar situation when results are
1- POSITIVE
2- POSITIVE
3- (MISSING VALUE)
4- POSITIVE
then diff= dates of first positive and 2 rd positive need to be subtracted and changed_result=0
In other words, whenever resulTS are in (positive, missing) then diff=dates of first positive and the positive result right before the missing value
need to be subtracted and changed_result=0
4.) if the result are like this-- 1-negative
2-negative
3-positive
4-missing
then diff= missing value (cuz we have nothing to compare it to) and changed_result=0
In other words, if the second element while calculating diff is a missing value, then diff=missing and changed_result=0
5.)'unknown' values in result var should be treated same as missing values
Number of visits may vary for each id and sometimes 'results' might be missing/unknown (treat both the same way).
For each unique ID:
6.) When there is 'positive' test result, we need to look for next consecutive 'negative' result and- we need to calculate the difference between the dates of the first positive result and the date of the first negative result after the first positive result and a new variable 'changed_result' is set to 1
In the output only one row should be there for each id:
expected output:
ID diff changed_result t1 t2 FLAG
1 155 1 01apr2006 03sep2006
2 404 1 02oct2005 10nov2006
3 . 0 . . (CANT' BE CALCULATED AS THERE IS A MISSING RESULT AFTER FIRST POSITIVE)
4 . 0 . . 1 (ALL NEGATIVE RESULTS)
5 376 0 03mar2007 13mar2008 2 (ALL POSITIVE RESULTS)
6 . 0 . . (CANT' BE CALCULATED AS THERE IS A unknown RESULT AFTER FIRST POSITIVE)
7 180 0 01jan2007 30jun2007 (this is how long were definitely positive)
8 . 0 . . (CANT' BE CALCULATED AS THERE IS A MISSING RESULT AFTER FIRST POSITIVE)
9 . 0 . . (CANT' BE CALCULATED AS THERE IS A unknown RESULT AFTER FIRST POSITIVE)
I hope I can get some answers soon. I use BASE SAS V9
Thank you so much in advance for your help.
Attached code contains 3 steps:
1) reading input into dataset TEST
2) the old algoritm - you may ignore it, if not need
3) the new algorithm - producing the expected results from TEST
Thank you Shmuel. The code works but for id 7,
7 v1 01jan2007 positive
7 v2 30jun2007 positive
7 v3 08nov2007
7 v4 01feb2008 positive
the t1,t2 values are incorrect. The diff value is calculated correctly.
output for your code for id7:
id diff changed_result t1 t2 flag
7 180 0 01feb2008 30jun2007 .
when it should be :
id diff changed_result t1 t2 flag
7 180 0 01jan2007 30jun2007 .
I have fixed the error with ID=7. Please check again.
data alg2;
format id diff changed_result t1 t2 flag;
keep id diff changed_result t1 t2 flag;
set test;
by id;
retain type ; /* 1=start negative, 2=start positive, 0=changed */
retain pos1d; /* 1st positive date */
retain flag; /* 1=all negative, 2=all positive, .=changed */
retain t2; /* date used to calc diff */
retain diff changed_result 0;
result = lowcase(result);
if first.id then do;
diff=.;
if result = 'positive' then do; type=2; pos1d = t0; flag=2; end; else
if result = 'negative' then do; type=1; flag=1; end;
else do; type = 0; flag=.; end;
end; else
if result = 'positive' then do;
if type ne 2 /* 1st positive which is not 1st in ID */
and diff=. /* <<<<<<<<<<<< line added */
then do; type=2; pos1d = t0; flag=.; end;
else if flag=2 then do;
diff=t0-pos1d; t2=t0;
end;
/* if type=2 no change need */
end; else
if result = 'negative' then do;
if type =2 and diff=. /* 1st negative following positive(s) */
then do; diff = t0 - pos1d; t2=t0; flag=.; changed_result=1; end;
/* if type is 1 or 0 no change need */
end;
else do;
if type =2 /* missing following positive(s) */
then do; ***pos1d=.; t2=.; changed_result=0; flag=.; type=0; end; /* <<<<< line changed */
end;
if last.id then do;
if flag=1 then do; diff=.; t1=.; t2=.; output; end; else
if flag=2 then do; diff=t0 - pos1d; t1=pos1d; t2=t0; output; end;
else do;
if diff > 0
then do; t1=pos1d; /* changed_result diff and t2 retained */ output; end;
else do; diff=.; t1=.; t2=.; output; end;
end;
end;
format t1 t2 date9.;
run;
Question:
what would you like to do in case of two periods with break in ID ?
assume ID=7 has one more line:
7 v5 20DEC2009 positive <<< or negative >>> those are two different cases to consider >>>>
Would you prefer the first period? the last one? or the longest ?
Answer:
Step1: Need to see when they first become positive and note the date as t1
Step2: Follow the id until we encounter a negative or missing value or unknown , we do not care what ever happens after that. Even if there are multiple positive results or negative results after 1-positive 2-missing, we don't care for them as we encountered a missing value. We are calculating how long a person stays positive before encountering a missing/unknown/negative value.
Step3: if the person had a missing/unknown after first positive, we censor them there and diff=1st positive date minus date(missing/unknown) but changed_result=0;
Step4: if the person had a negative after first positive, we censor them there and diff=1st positive date minus date(first negative after 1st positive) but changed_result=1;
So, in the example you gave for ID 7, we don't care about v5 as after first positive we encountered a misisng at v3, so diff=01jan2007-08nov2007=311
Your latest code working but I still need the t2 value to be printed instead of missing value there.
Thanks a lot!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.