Greetings Community,
I would like help with the following issue: is there a way to create arrays when working with datasets in long format?
I usually work with a wide dataset with the following configuration:
data wide;
input id time1 time2 time3 group1 group2 group3 outcome1 outcome2 outcome3;
cards;
1 20 30 40 0 1 0 0 1 0
2 10 15 20 0 0 1 0 0 0
;
On this dataset, I perform time-dependent survival analysis and I am able to censor cases which had developed in previous time periods (i.e., in time period 3, subject 1 would be excluded since he developed the outcome in time period 2). I manage to do this with arrays.
Now, I am working with an Access-derived dataset, which is formatted in long fashion like this:
data long;
input id visit time group outcome;
cards;
1 1 20 0 0
1 2 30 1 1
1 3 40 0 0
2 1 10 0 0
2 2 15 0 0
2 3 20 1 0
;
How would you group the set of outcomes from the same patient, in example to exclude the third observation (and any other after an outcome has occurres, for instance)?
Many thanks,
Manuel
Manuel,
SAS has some standard tools, that I highly recommend you learn about. They begin with understanding the BY statement within a DATA step. Here's an example (untested code):
proc sort data=long;
by id visit;
run;
data want;
set long;
by id visit;
if first.id then delete_the_rest='N';
retain delete_the_rest;
if delete_the_rest='Y' then delete;
if outcome=1 then delete_the_rest='Y';
run;
Just a few notes ... The PROC SORT is not necessary if the data are in order. The BY statement in the data step requires sorted data, but it does require you to run PROC SORT if the data are already in order. And the order of the last two statements is important. If you switch them, then you will also delete the observation having outcome=1.
But to reiterate the most important point ... in the long run, it is vital to understand the BY statement in the DATA step!
Good luck.
do you want to try the code below?
data long;
input id visit time group outcome;
cards;
1 1 20 0 0
1 2 30 1 1
1 3 40 0 0
2 1 10 0 0
2 2 15 0 0
2 3 20 1 0
;
data want;
set long;
by id;
if outcome<=lag(outcome) and lag(outcome)>0 then delete;
run;
proc print;run;
obs id visit time group outcome
1 1 1 20 0 0
2 1 2 30 1 1
3 2 1 10 0 0
4 2 2 15 0 0
5 2 3 20 1 0
It looks like you are transposing data from long to wide.
data long; input id visit time group outcome; cards; 1 1 20 0 0 1 2 30 1 1 1 3 40 0 0 2 1 10 0 0 2 2 15 0 0 2 3 20 1 0 ; run; proc sql noprint; select max(count) into : count from (select count(*) as count from long group by id); quit; data want(keep=id _:); array _visit{&count} ; array _time{&count} ; array _group{&count} ; array _outcome{&count} ; count=0; do until(last.id); set long; by id; count+1; _visit{count}=visit;_time{count}=time; _group{count}=group;_outcome{count}=outcome; end; run;
Ksharp
Manuel,
SAS has some standard tools, that I highly recommend you learn about. They begin with understanding the BY statement within a DATA step. Here's an example (untested code):
proc sort data=long;
by id visit;
run;
data want;
set long;
by id visit;
if first.id then delete_the_rest='N';
retain delete_the_rest;
if delete_the_rest='Y' then delete;
if outcome=1 then delete_the_rest='Y';
run;
Just a few notes ... The PROC SORT is not necessary if the data are in order. The BY statement in the data step requires sorted data, but it does require you to run PROC SORT if the data are already in order. And the order of the last two statements is important. If you switch them, then you will also delete the observation having outcome=1.
But to reiterate the most important point ... in the long run, it is vital to understand the BY statement in the DATA step!
Good luck.
Hi Astounding,
I tested your code for you. It works great! Thank you for sharing! - Linlin
data long;
input id visit time group outcome;
cards;
1 1 20 0 0
1 2 30 1 1
1 3 40 0 0
2 1 10 0 0
2 2 15 0 0
2 3 20 1 0
;
proc sort data=long;
by id visit;
run;
data want;
set long;
by id visit;
if first.id then delete_the_rest='N';
retain delete_the_rest;
if delete_the_rest='Y' then delete;
if outcome=1 then delete_the_rest='Y';
run;
proc print;run;
Obs id visit time group outcome delete_the_rest
1 1 1 20 0 0 N
2 1 2 30 1 1 Y
3 2 1 10 0 0 N
4 2 2 15 0 0 N
5 2 3 20 1 0 N
First off, thank you all.
@Linlin: I didn't know of the LAG function, I guess this is one of the functions I need to properly work with groups of observations referring to the same subject in long format, together with the BY statement.
@Astounding: your code works for me, thanks. I just don't get why SAS deletes the "Y"s since the DELETE statement comes before the "Y" definition: I mean, when SAS reads the DELETE statement no "Y" exists yet, right? I am just trying to understand the mechanics of your code, since it works great.
Iron_grief,
There are two keys to understanding when "Y" exists. First is that the DATA step processes each observation sequentially. The statements in the DATA step execute in order for every observation individually. (The exception in this program is when the DELETE statement executes. In that case, the final statement will not execute for that observation.) The second key is the RETAIN statement. That's what tells SAS, "As you move on to the next observation, don't change DELETE_THE_REST. Just let the current value sit there as the next observation gets processed." All good stuff to know in the long run.
Astounding, I agree with you that it is good to know, that's why I asked for insights.
Thank you again.
Manuel
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.