BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wj2
Quartz | Level 8 wj2
Quartz | Level 8

Hello, I am working with a large data set involving patient IDs (n=~14,000) and diagnosis codes (~100 per patient). The data is in "long" format sorted by patient ID- please see example below. My goal is to create a variable to indicate whether each patient has a certain set of diagnosis codes. 

 

patient_id     diagnosis_code

       1                     aaa

       1                     bbb

       1                     xxx

       1                     ddd

       2                     aaa

       2                     ddd

       2                     eee

       2                     zzz

       ...                     ...

 

I am very new to SAS and my initial thought was to transpose the data and then create the variable from the transposed data set. However, this approach did not work. Please see my code below. Any guidance on how to approach this problem would be much appreciated.

 

proc sort data=have
out=sorted;
by patient_id;
run;

 

proc transpose data=sorted
out=transposed;
by patient_id;
var diagnosis_code;
run;

 

data transposed2;
set transposed;
if diagnosis_code in ('aaa','bbb') then vara=1;
else vara=0;
run;

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

A few suggestions, mostly for readability, to a small extent for speed:

 

data want;
set have;
by patient_id;
if first.patient_id then do;
   vara=0;
   varb=0;
   varc=0;
end;
retain vara varb varc;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
else if diagnosis_code in ('ccc', 'ddd') then varb=1;
else if diagnosis_code in ('zzz', 'eee') then varc=1;
if last.patient_id;
run;

The small speed improvements:  adding ELSE to the list of IF/THEN possibilities to avoid re-checking once a match has been found, and checking for first.patient_id just once per observation instead of for each variable.

 

Depending on the size of your data, you may or may not be able to measure the speed differences.

View solution in original post

5 REPLIES 5
Astounding
PROC Star

You're right.  Transposing the data is unnecessary and just makes things difficult.  Sketch out what you would like the result to look like, and you will receive multiple suggestions about a viable way to approach this using the original data.

 

Here's one approach, but it depends on what you really want as the result.  After sorting:

 

data want;

set have;

by patient_id;

if first.patient_id then vara=0;

retain vara;

if diagnosis_code in ('aaa', 'bbb') then vara=1;

if last.patient_id;

run;

wj2
Quartz | Level 8 wj2
Quartz | Level 8

Thank you, this produced exactly what was needed. I am actually interested in creating multiple (non-mutually exclusive) variables using the same logic. I would assume that I would use the code below to build on what was provided in the previous post. Would this be the most effective way of doing this? 

 

data want;

set have;

by patient_id;

if first.patient_id then vara=0;

retain vara;

if diagnosis_code in ('aaa', 'bbb') then vara=1;

if first.patient_id then varb=0;

retain varb;

if diagnosis_code in ('ccc', 'ddd') then varb=1;

.... etc.

if last.patient_id;

run;

PGStats
Opal | Level 21

One possible approach:

 

data have;
input patient_id  diagnosis_code$;
datalines;
       1                     aaa
       1                     bbb
       1                     xxx
       1                     ddd
       2                     aaa
       2                     ddd
       2                     eee
       2                     zzz
       ;

proc sort data=have
out=sorted;
by patient_id;
run;

proc transpose data=sorted
out=transposed(drop=_name_);
by patient_id;
var diagnosis_code;
id diagnosis_code;
run;

data transposed2;
set transposed;
/* Var_ab : patient has both aaa and bbb diagnoses */
var_ab = cmiss(aaa, bbb) = 0;
run;

proc print; run;
    patient_
        Obs       id       aaa    bbb    xxx    ddd    eee    zzz    var_ab

         1         1       aaa    bbb    xxx    ddd                     1
         2         2       aaa                  ddd    eee    zzz       0
PG
wj2
Quartz | Level 8 wj2
Quartz | Level 8

Thank you for the suggestion; however, given the size of my data set, transposing the data seems to take a large amount of processing time. As a result, it appears that using a first.variable and last.variable approach may be a little more efficient in my case. 

 

In terms of creating multiple variables in the same data step using the first.variable and last.variable approach that @Astounding suggested, what is an effective/efficient way of going about this? Would it be the same code mentioned above but with the "if last.patient_id" statement inserted at the end of the last variable? For example, see below. Is there a better way of doing this?  

 

data want;
set have;
by patient_id;
if first.patient_id then vara=0;
retain vara;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
if first.patient_id then varb=0;
retain varb;
if diagnosis_code in ('ccc', 'ddd') then varb=1;
if first.patient_id then varc=0; retain varc; if diagnosis_code in ('zzz', 'eee') then varc=1; .... etc. if last.patient_id; run;
Astounding
PROC Star

A few suggestions, mostly for readability, to a small extent for speed:

 

data want;
set have;
by patient_id;
if first.patient_id then do;
   vara=0;
   varb=0;
   varc=0;
end;
retain vara varb varc;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
else if diagnosis_code in ('ccc', 'ddd') then varb=1;
else if diagnosis_code in ('zzz', 'eee') then varc=1;
if last.patient_id;
run;

The small speed improvements:  adding ELSE to the list of IF/THEN possibilities to avoid re-checking once a match has been found, and checking for first.patient_id just once per observation instead of for each variable.

 

Depending on the size of your data, you may or may not be able to measure the speed differences.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 991 views
  • 0 likes
  • 3 in conversation