Solved: Re: Creating a new variable from transposed data

wj2 · Posted 02-20-2019 08:55 PM

Hello, I am working with a large data set involving patient IDs (n=~14,000) and diagnosis codes (~100 per patient). The data is in "long" format sorted by patient ID- please see example below. My goal is to create a variable to indicate whether each patient has a certain set of diagnosis codes.

patient_id diagnosis_code

1 aaa

1 bbb

1 xxx

1 ddd

2 aaa

2 ddd

2 eee

2 zzz

... ...

I am very new to SAS and my initial thought was to transpose the data and then create the variable from the transposed data set. However, this approach did not work. Please see my code below. Any guidance on how to approach this problem would be much appreciated.

proc sort data=have
out=sorted;
by patient_id;
run;

proc transpose data=sorted
out=transposed;
by patient_id;
var diagnosis_code;
run;

data transposed2;
set transposed;
if diagnosis_code in ('aaa','bbb') then vara=1;
else vara=0;
run;

Astounding · Posted 02-21-2019 10:17 AM

A few suggestions, mostly for readability, to a small extent for speed:

data want;
set have;
by patient_id;
if first.patient_id then do;
   vara=0;
   varb=0;
   varc=0;
end;
retain vara varb varc;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
else if diagnosis_code in ('ccc', 'ddd') then varb=1;
else if diagnosis_code in ('zzz', 'eee') then varc=1;
if last.patient_id;
run;

The small speed improvements: adding ELSE to the list of IF/THEN possibilities to avoid re-checking once a match has been found, and checking for first.patient_id just once per observation instead of for each variable.

Depending on the size of your data, you may or may not be able to measure the speed differences.

View solution in original post

Astounding · Posted 02-20-2019 09:09 PM

You're right. Transposing the data is unnecessary and just makes things difficult. Sketch out what you would like the result to look like, and you will receive multiple suggestions about a viable way to approach this using the original data.

Here's one approach, but it depends on what you really want as the result. After sorting:

data want;

set have;

by patient_id;

if first.patient_id then vara=0;

retain vara;

if diagnosis_code in ('aaa', 'bbb') then vara=1;

if last.patient_id;

run;

wj2 · Posted 02-20-2019 11:31 PM

Thank you, this produced exactly what was needed. I am actually interested in creating multiple (non-mutually exclusive) variables using the same logic. I would assume that I would use the code below to build on what was provided in the previous post. Would this be the most effective way of doing this?

data want;

set have;

by patient_id;

if first.patient_id then vara=0;

retain vara;

if diagnosis_code in ('aaa', 'bbb') then vara=1;

if first.patient_id then varb=0;

retain varb;

if diagnosis_code in ('ccc', 'ddd') then varb=1;

.... etc.

if last.patient_id;

run;

PGStats · Posted 02-20-2019 11:43 PM

One possible approach:

data have;
input patient_id  diagnosis_code$;
datalines;
       1                     aaa
       1                     bbb
       1                     xxx
       1                     ddd
       2                     aaa
       2                     ddd
       2                     eee
       2                     zzz
       ;

proc sort data=have
out=sorted;
by patient_id;
run;

proc transpose data=sorted
out=transposed(drop=_name_);
by patient_id;
var diagnosis_code;
id diagnosis_code;
run;

data transposed2;
set transposed;
/* Var_ab : patient has both aaa and bbb diagnoses */
var_ab = cmiss(aaa, bbb) = 0;
run;

proc print; run;

    patient_
        Obs       id       aaa    bbb    xxx    ddd    eee    zzz    var_ab

         1         1       aaa    bbb    xxx    ddd                     1
         2         2       aaa                  ddd    eee    zzz       0

PG

wj2 · Posted 02-21-2019 09:52 AM

Thank you for the suggestion; however, given the size of my data set, transposing the data seems to take a large amount of processing time. As a result, it appears that using a first.variable and last.variable approach may be a little more efficient in my case.

In terms of creating multiple variables in the same data step using the first.variable and last.variable approach that @Astounding suggested, what is an effective/efficient way of going about this? Would it be the same code mentioned above but with the "if last.patient_id" statement inserted at the end of the last variable? For example, see below. Is there a better way of doing this?

data want;
set have;
by patient_id;
if first.patient_id then vara=0;
retain vara;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
if first.patient_id then varb=0;
retain varb;
if diagnosis_code in ('ccc', 'ddd') then varb=1;
if first.patient_id then varc=0;
retain varc;
if diagnosis_code in ('zzz', 'eee') then varc=1;
.... etc.
if last.patient_id;
run;

Astounding · Posted 02-21-2019 10:17 AM

A few suggestions, mostly for readability, to a small extent for speed:

data want;
set have;
by patient_id;
if first.patient_id then do;
   vara=0;
   varb=0;
   varc=0;
end;
retain vara varb varc;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
else if diagnosis_code in ('ccc', 'ddd') then varb=1;
else if diagnosis_code in ('zzz', 'eee') then varc=1;
if last.patient_id;
run;

The small speed improvements: adding ELSE to the list of IF/THEN possibilities to avoid re-checking once a match has been found, and checking for first.patient_id just once per observation instead of for each variable.

Depending on the size of your data, you may or may not be able to measure the speed differences.

Creating a new variable from transposed data

Re: Creating a new variable from transposed data

Re: Creating a new variable from transposed data

Re: Creating a new variable from transposed data

Re: Creating a new variable from transposed data

Re: Creating a new variable from transposed data

Re: Creating a new variable from transposed data

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away