BookmarkSubscribeRSS Feed
ahhh
Obsidian | Level 7

I need urgent help with this situation. I have a dataset that looks like this:

 

ID   VISIT       T0             RESULT
1     V1      07SEP2005    negative
1     v2      06oct2005       negative
1     v3      01apr2006      positive
1     v4      03sep2006      negative
1     v5      08sep2006      negative
2     v1      07sep2005      negative
2     v2      02oct2005       positive
2     v3      05nov2005      positive
2     v4     10nov2006       negative
3     v1      07SEP2005     negative
3     v2      03OCT2005    positive
3     v3      09nov2005
3     v4     10dec2005       negative
3     v5      03mar2006     positive

 

Number of visits may vary for each id and sometimes 'results' might be missing. 

For each unique ID:

1.) When there is 'positive' test result, we need to look for next consecutive 'negative' result and- we need to calculate the difference between the dates of the first positive result and the date of the first negative result after the first positive result and a new variable 'changed_result' is set to 1

2.) if there are no positive result at all for an id-, no diff in dates is calculated and 'changed_result' =0

 

I hope I can get some answers soon. I use BASE SAS V9

Thank you in advance for your help.

27 REPLIES 27
Reeza
Super User

Please post your expected output, both values and file. 

ahhh
Obsidian | Level 7

Here is my dataset and the output I need:

 

ID VISIT T0 RESULT
1 V1 07SEP2005 negative
1 v2 06oct2005 negative
1 v3 01apr2006 positive
1 v4 03sep2006 negative
1 v5 08sep2006 negative
2 v1 07sep2005 negative
2 v2 02oct2005 positive
2 v3 05nov2005 positive
2 v4 10nov2006 negative
3 v1 07SEP2005 negative
3 v2 03OCT2005 positive
3 v3 09nov2005
3 v4 10dec2005 negative
3 v5 03mar2006 positive

 

expected output:

ID VISIT T0 RESULT diff changed_result
1 V1 07SEP2005 negative . 0
1 v2 06oct2005 negative . 0
1 v3 01apr2006 positive . 0
1 v4 03sep2006 negative 155 1
1 v5 08sep2006 negative . 0
2 v1 07sep2005 negative . 0
2 v2 02oct2005 positive . 0
2 v3 05nov2005 positive . 0
2 v4 10nov2006 negative 404 1
3 v1 07SEP2005 negative . 0
3 v2 03OCT2005 positive . 0
3 v3 09nov2005 . 0
3 v4 10dec2005 negative 68 1
3 v5 03mar2006 positive . 0

 

 

The date value for first positive needs to be stored and compared to the one for first negative after the first positive.

For ID1 : 01apr2006 - 03sep2006;

for id2 : 02oct2005 - 10nov2006;

for id3: 03OCT2005 - 10dec2005

 

Thank you 

Jagadishkatam
Amethyst | Level 16

Hope you are expecting an output as below. if yes then follow this code

 

data want;
do until(first.id) ;
set have(where=(result='positive'));
by id visit;
posdate=t0;
end;
do until(last.id) ;
set have;
by id visit;
if lag(result) in ('positive',' ') and result='negative' and t0>posdate then changed_result=1; 
else changed_result=0; 
if result='negative' and t0>posdate then diff=t0-posdate; output; end; run;

 

 


Capture.PNG
Thanks,
Jag
ahhh
Obsidian | Level 7

Thank you Jag, I am a couple of issues with this one- I would need all the observations in the final dataset, for some reason I am geting fewer obs. Also somewhere the logic is not working.

Nevertheless, I appreciate your help!

mkeintz
PROC Star

You don't really need the double-dow here (i.e. two "do until (last.id)" groups), because there is no need to go back and re-read records.  All one should be doing is to maintain the current status, i.e. whether a first positive record has been encountered.  If so, then a secondary status is checked, namely whether a subsequent negative has been encountered:

 

Note the retained variables FIRST_POS in the program below can have three values:

  • a missing value, set at the first.id records, which is continued until ...
  • a T0 value corresponding to the first "positive" record, (if result="positive" and first_pos=.) which is continued until ...
  • the next "negative" record, (if result='negative' and not(missing(first_pos) ) when DIFF is calculated, and first_pos is set to .Z.

    It's set to .Z instead of ., so that a subsequent "positive" won't inadvertantly restart the process.

 

data want (drop=first_pos);
  set have;
  by id;

  retain first_pos;  /* T0 date of first positive rec */

  if result='positive' and first_pos=. then first_pos=t0;

  if result='negative' and not(missing(first_pos)) then do;
    changed_result=1;
    diff=t0-first_pos;
    first_pos=.Z;
  end;
  else changed_result=0;

  if last.id then first_pos=.;
run ;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Shmuel
Garnet | Level 18

Try next code - changed according to your answer:

 

data want;

 set have;

  by id;

       retain  phase pos_d neg_d ;

      changed_result =0;  /* initial value */

       if first.id then phase = 0;

       if phase = 0 and result = 'positive'

           then do;  pos_d = t0; phase=1; end;

       if phase = 1 and result = 'negative'

           then do; neg_d = t0; phase =2; end;

       if last.id the do;

          if phase = 2 then do;

             dif_days = neg_d - pos_d;

             changed_result = 1;

          end;

          else do;

              dif_days = .;

              changed_result = 0;

          end;

         ***output;

         keep ID dif_days changed_result;

  run;

          

  

       

ahhh
Obsidian | Level 7

This code works great, Thanks a ton SHMUEL 🙂

ahhh
Obsidian | Level 7

There is an update in the algorithm, please help:

CALCULATING THE TIME BETWEEN THE FIRST POSITIVE TO HOW EVER LONG THEY REMAIN POSITIVE (BEFORE HAVING A NEGATIVE OR UNKNOWN/MISSING RESULT)

 

ID VISIT T0 RESULT
1 V1 07SEP2005 negative
1 v2 06oct2005 negative
1 v3 01apr2006 positive
1 v4 03sep2006 negative
1 v5 08sep2006 negative
2 v1 07sep2005 negative
2 v2 02oct2005 positive
2 v3 05nov2005 positive
2 v4 10nov2006 negative
3 v1 07SEP2005 negative
3 v2 03OCT2005 positive
3 v3 09nov2005
3 v4 10dec2005 negative
3 v5 03mar2006 positive
4 v1 07SEP2005 negative
4 v2 10SEP2005 negative
4 v3 07OCT2005 negative
4 v4 17nOV2005 negative
4 v5 27JAN2006 negative
5 v1 03mar2007 positive
5 v2 30mar2007 positive
5 v3 06jun2007 positive
5 v4 13mar2008 positive
6 v1 03feb2007 positive
6 v2 30jun2007 UNKNOWN
6 v3 06nov2007
6 v4 03mar2008 positive
7 v1 01jan2007 positive
7 v2 30jun2007 positive
7 v3 08nov2007
7 v4 01feb2008 positive
8 v1 07aug2005 negative
8 v2 07oct2005 negative
8 v3 15nov2005 positive
8 v4 01jan2006
9 v1 07jan2005 negative
9 v2 10mar2005 positive
9 v3 27OCT2005 UNKNOWN
9 v4 07dec2005 negative
9 v5 20JAN2006 negative

 

diff=t1-t2
BASICALLY WE FOLLOW FIRST POSITIVE RESULT UNTIL 1.)THERE IS A NEGATIVE RESULT OR 2.) TILL THE LAST IF THEY NEVER BECOME NEGATIVES OR 3.) THE NEXT RESULT IS MISSING OR UNKNOWN

1.) if an id has all negative results (and no positives) it has to be flagged (flag=1)(example: Id 4),
2.) if an id has all positive results (and no negatives) it has to be flagged (flag=2)(example: Id 5),
and diff= date of first positive minus date of last positive and changed_result=0
3.) if the results are like this-- 1- POSITIVE
2- POSITIVE
3- POSITIVE
4- (MISSING VALUE)
5- POSITIVE
then diff= dates of first positive and 3 rd positive need to be subtracted and changed_result=0
(similar situation when results are
1- POSITIVE
2- POSITIVE
3- (MISSING VALUE)
4- POSITIVE
then diff= dates of first positive and 2 rd positive need to be subtracted and changed_result=0
In other words, whenever resulTS are in (positive, missing) then diff=dates of first positive and the positive result right before the missing value
need to be subtracted and changed_result=0
4.) if the result are like this-- 1-negative
2-negative
3-positive
4-missing
then diff= missing value (cuz we have nothing to compare it to) and changed_result=0
In other words, if the second element while calculating diff is a missing value, then diff=missing and changed_result=0
5.)'unknown' values in result var should be treated same as missing values


Number of visits may vary for each id and sometimes 'results' might be missing/unknown (treat both the same way).
For each unique ID:
6.) When there is 'positive' test result, we need to look for next consecutive 'negative' result and- we need to calculate the difference between the dates of the first positive result and the date of the first negative result after the first positive result and a new variable 'changed_result' is set to 1

 


In the output only one row should be there for each id:


expected output:

ID diff changed_result t1 t2 FLAG
1 155 1 01apr2006 03sep2006
2 404 1 02oct2005 10nov2006
3 . 0 . . (CANT' BE CALCULATED AS THERE IS A MISSING RESULT AFTER FIRST POSITIVE)
4 . 0 . . 1 (ALL NEGATIVE RESULTS)
5 376 0 03mar2007 13mar2008 2 (ALL POSITIVE RESULTS)
6 . 0 . . (CANT' BE CALCULATED AS THERE IS A unknown RESULT AFTER FIRST POSITIVE)
7 180 0 01jan2007 30jun2007 (this is how long were definitely positive)
8 . 0 . . (CANT' BE CALCULATED AS THERE IS A MISSING RESULT AFTER FIRST POSITIVE)
9 . 0 . . (CANT' BE CALCULATED AS THERE IS A unknown RESULT AFTER FIRST POSITIVE)


I hope I can get some answers soon. I use BASE SAS V9
Thank you so much in advance for your help.

 

Shmuel
Garnet | Level 18

Attached code contains 3 steps:

1) reading input into dataset TEST

2) the old algoritm - you may ignore it, if not need

3) the new algorithm - producing the expected results from TEST

 

 

ahhh
Obsidian | Level 7

Thank you Shmuel. The code works but for id 7,

7 v1 01jan2007 positive
7 v2 30jun2007 positive
7 v3 08nov2007
7 v4 01feb2008 positive

 

the t1,t2 values are incorrect. The diff value is calculated correctly.

output for your code for id7: 

id diff changed_result      t1                    t2                   flag

7  180  0                       01feb2008       30jun2007         .

 

when it should be :

id diff changed_result      t1                    t2                   flag

7  180  0                       01jan2007      30jun2007         .

Shmuel
Garnet | Level 18

I have fixed the error with ID=7. Please check again.

 

data alg2;
format id diff changed_result t1 t2 flag;
keep id diff changed_result t1 t2 flag;
set test;
by id;
retain type ; /* 1=start negative, 2=start positive, 0=changed */
retain pos1d; /* 1st positive date */
retain flag; /* 1=all negative, 2=all positive, .=changed */
retain t2; /* date used to calc diff */
retain diff changed_result 0;
result = lowcase(result);

if first.id then do;
diff=.;
if result = 'positive' then do; type=2; pos1d = t0; flag=2; end; else
if result = 'negative' then do; type=1; flag=1; end;
else do; type = 0; flag=.; end;

end; else

if result = 'positive' then do;
if type ne 2 /* 1st positive which is not 1st in ID */

and diff=.                            /* <<<<<<<<<<<< line added */
then do; type=2; pos1d = t0; flag=.; end;
else if flag=2 then do;
diff=t0-pos1d; t2=t0;
end;
/* if type=2 no change need */
end; else
if result = 'negative' then do;
if type =2 and diff=. /* 1st negative following positive(s) */
then do; diff = t0 - pos1d; t2=t0; flag=.; changed_result=1; end;
/* if type is 1 or 0 no change need */
end;
else do;
if type =2 /* missing following positive(s) */
then do; ***pos1d=.; t2=.; changed_result=0; flag=.; type=0; end;    /* <<<<< line changed */
end;

if last.id then do;
if flag=1 then do; diff=.; t1=.; t2=.; output; end; else
if flag=2 then do; diff=t0 - pos1d; t1=pos1d; t2=t0; output; end;
else do;
if diff > 0
then do; t1=pos1d; /* changed_result diff and t2 retained */ output; end;
else do; diff=.; t1=.; t2=.; output; end;
end;
end;
format t1 t2 date9.;
run;

 

Question:

what would you like to do in case of two periods with break in ID ?

assume ID=7 has one more line:

   7  v5  20DEC2009 positive   <<< or negative >>> those are two different cases to consider >>>>

Would you prefer the first period? the last one? or the longest ?

ahhh
Obsidian | Level 7

Answer:

Step1: Need to see when they first become positive and note the date as t1

Step2: Follow the id until we encounter a negative or missing value or unknown , we do not care what ever happens after that. Even if there are multiple positive results or negative results after 1-positive 2-missing, we don't care for them as we encountered a missing value. We are calculating how long a person stays positive before encountering a missing/unknown/negative value.

Step3: if the person had a missing/unknown after first positive, we censor them there and diff=1st positive date minus date(missing/unknown) but changed_result=0;

Step4: if the person had a negative after first positive, we censor them there and diff=1st positive date minus date(first negative after 1st positive) but changed_result=1;

 

So, in the example you gave for ID 7, we don't care about v5 as after first positive we encountered a misisng at v3, so diff=01jan2007-08nov2007=311

 

Your latest code working but I still need the t2 value to be printed instead of missing value there.

Thanks a lot!

ahhh
Obsidian | Level 7
So to put it succinctly- the FIRST PERIOD
ahhh
Obsidian | Level 7
I am sorry- the diff=01jan2007-30jun2007 =180

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 27 replies
  • 3309 views
  • 4 likes
  • 5 in conversation