## DATA Step, Macro, Functions and more

Solved
Occasional Contributor
Posts: 8

Hi, I'm working with longitudinal data on patients that may have two types of infections (X & Y). I want to create a new variable for patients that have any "Infection Y" at any time after an "Infection X" event. The interval variable is in months. I have tried a lot of different things, but can't seem to get any to work. Please, if you have any advice, please let me know as I am very frustrated :-) Thank you!

Here is what my dataset looks like:

ID    INTERVAL   INFECT_X   INFECT_Y

1     1.1               1                  0

1     1.4               0                  0

1     1.5               0                  1

2     2.4               0                  1

2     2.6               1                  0

3     1.6               0                  1

3     1.9               1                  0

3     2.0               0                  1

4     2.2               0                  0

4     2.4               0                  0

Here is what I want it to look like:

ID    INTERVAL   INFECT_X   INFECT_Y   NEWVAR

1     1.1               1                  0                 1

1     1.4               0                  0                 1

1     1.5               0                  1                 1

2     2.4               0                  1                 0

2     2.6               1                  0                 0

3     1.6               0                  1                 1

3     1.9               1                  0                 1

3     2.0               0                  1                 1

4     2.2               0                  0                 0

4     2.4               0                  0                 0

Also, note that an Infect_Y can occur before the Infect_X (ID #3), as long as another infect_Y occurs again after infect_X, it counts. Also, the infections are not always consecutive, there can be a long time between them. I appreciate any advice, thanks!

Accepted Solutions
Solution
‎08-14-2017 09:58 PM
Super User
Posts: 8,220

@KurtBremser's code will run better if you change the line:

;

end;

Alternatively, you can produce the same result using a DOW loop:

data want (drop=inx);
do until (last.ID);
set have;
by id;
if infect_x then inx=1;
else if inx*infect_y then newvar=1;
end;
do until (last.ID);
set have;
by id;
if newvar ne 1 then newvar=0;
output;
end;
run;

Art, CEO, AnalystFinder.com

All Replies
Super User
Posts: 10,623

One method with an intermediate table:

data int (keep=id newvar);
set have;
by id;
retain newvar inf_x;
if first.id
then do;
newvar = 0;
inf_x = 0;
;
if infect_x then inf_x = 1;
if inf_x and infect_y then newvar = 1;
if last.id then output;
run;

data want;
merge
have
int
;
by id;
run;

Untested. For tested code, provide your example data in a data step with datalines for easy recreation.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Occasional Contributor
Posts: 8

Thanks Kurt! I ended up having to use Art's iteration, somehow yours gave me zero observations for newvar (I'm 100% sure my own user error, as I'm a newbie), but thank you for your kind suggestions! :-)
Solution
‎08-14-2017 09:58 PM
Super User
Posts: 8,220

@KurtBremser's code will run better if you change the line:

;

end;

Alternatively, you can produce the same result using a DOW loop:

data want (drop=inx);
do until (last.ID);
set have;
by id;
if infect_x then inx=1;
else if inx*infect_y then newvar=1;
end;
do until (last.ID);
set have;
by id;
if newvar ne 1 then newvar=0;
output;
end;
run;

Art, CEO, AnalystFinder.com

Super User
Posts: 10,623

[ Edited ]

As I said, untested

just a quick hack from the hip.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Occasional Contributor
Posts: 8

Thank you SOO much, Art! I tried all the suggested answers, and yours was the one that worked with how my dataset was written. (The others probably would've worked too, I probably messed them up somehow ).

Super User
Posts: 13,950

Just for curiosity sake what does 0.1 actually mean in the interval? Since "month" does not have a standard length in common usage I wonder what decimal values actually represent.

Occasional Contributor
Posts: 8

Hi Ballard, I think its a result of timestamping done by hospital electronic record systems. Fortunately, most of my patients are followed for a few years, so I dont think the little bits of time difference have a huge impact on the results. It was a little misleading the way I wrote the sample dataset, the infections are typically much further apart, I was typing fast. Sorry I don't have a more definitive answer for you!

Contributor
Posts: 54

A possible solution:

data disease;

input ID INTERVAL INFECT_X INFECT_Y;

datalines;

1 1.1 1 0

1 1.4 0 0

1 1.5 0 1

2 2.4 0 1

2 2.6 1 0

3 1.6 0 1

3 1.9 1 0

3 2.0 0 1

4 2.2 0 0

4 2.4 0 0

;

data disease1 (drop=INT);

set disease;

by ID;

retain INT 8;

length NEWVAR 8;

if first.ID then do;

INT = INTERVAL;

NEWVAR = 0;

end;

if INFECT_X = 1 then INT=INTERVAL;

if INFECT_Y = 1 and INTERVAL > INT then NEWVAR = 1; else NEWVAR = 0;

run;

proc sql;

create table NEWVARS as

select distinct(ID), NEWVAR from disease1 where NEWVAR = 1;

create table disease2 as

select a.ID, a.INTERVAL, a.INFECT_X, a.INFECT_Y, b.NEWVAR

from disease1 a left join NEWVARS b on a.ID = b.ID

order by ID, INTERVAL;

update disease2

set NEWVAR = 0 where NEWVAR = .;

quit;

☑ This topic is solved.