Solved: Re: Merging on several variables

SmallKatie · Posted 04-17-2016 08:51 AM

Hi,

I have a large dataset (as in table A) with around 100 variables that represent different diagnosis/procedure codes. Each row represents one patient.

Table A

var1 var2 var3 … var100

diag1 diag2 diag4

diag3 diag5 diag4 ... diag800

diag4 diag16

I also have a separate table, with values for each of the diagnosis/procedure codes (altogether more than 10 000 different codes).

The values range from 1-40, so many diagnosis codes have the same value.

Table B

diagnosis_code value

diag1 12

diag2 4

diag3 21

diag4 39

I would like to merge this into my original value for each variable, to get something like this as a result:

Table C

var1 var2 var3 … var100 value_var1 value_var2 ...

diag1 diag2 diag4 12 4

diag3 diag5 diag4 ... diag800 21 13

diag4 diag16 39 19

Of course I can merge in the Table B for each of the 100 variables, but this will take too long.

Therefore I am thinking of constructing an array and checking for each of the values (its 40 possible values):

array diagnosis{i} {100} var1 - var100{

if diagnosis{i} in list{all variables with value 1}

then value_diagnosis{i} = 1

...

if diagnosis{i} in list{all variables with value 40}

then value_diagnosis{i} = 40

But this involves a lot of manual work - I am currenlty copy pasting the list of variables with a particular value.

Is there any more efficient way?

How could I create the list from the Table B in a better way?

I am using SAS Enterprise Guide, version 7.1

Many thanks indeed!

FreelanceReinh · Posted 04-17-2016 10:21 AM

Hi @SmallKatie,

There are many situations where data processing is facilitated by datasets in a "long" format (many observations, but few variables).
The situation you describe is a case in point, but your Table A is in "wide" format (many variables, relatively few observations).

So, one approach would be to transpose your data and then match the values to the diagnosis codes (see SAS code below). For the latter, I use PROC SQL, but after prior sorting (or indexing) you can use a data step (with a MERGE statement) as well.

/* Create test data */

data have_A;
input patid (var1 var2 var3) ($);
cards;
1 diag1  diag2  diag4
2 diag3  diag5  diag4
3 diag4  diag16 .
;

data have_B;
input diagnosis_code $ value;
cards;
diag1  12
diag2   4
diag3  21
diag4  39
diag5  13
diag16 19
;

/* Reshape dataset HAVE_A from wide to long format */

proc transpose data=have_A out=trans_A(rename=(col1=diagnosis_code)) name=diag_var;
by patid;
var var:;
run;

/* Match values to diagnosis codes */

proc sql;
create table want_long as
select a.*, b.value
from trans_A a natural left join have_B b
order by patid, diag_var;
quit;

/* If really necessary, return to wide format */

proc transpose data=want_long out=want_trans(drop=_:) prefix=value_;
by patid;
var value;
id diag_var;
run;

data want_wide;
merge have_A
      want_trans;
by patid;
run;

A different approach that could be applied to both "long" and "wide" format data is to use a user-defined informat to assign the values:

/* Create informat from dataset HAVE_B */

data infmt;
retain fmtname '@diagv';
set have_B(rename=(diagnosis_code=start value=label));
run;

proc format cntlin=infmt;
run;

/* Assign values using the informat */

data want_wide2;
set have_A;
array       var[3]; /* Please replace 3 by 100 */
array value_var[3]; /* for your real data.     */
do i=1 to dim(var);
  value_var[i]=input(var[i], diagv.);
end;
drop i;
run;

The result is the same as dataset WANT_WIDE.

View solution in original post

FreelanceReinh · Posted 04-17-2016 10:21 AM

Hi @SmallKatie,

There are many situations where data processing is facilitated by datasets in a "long" format (many observations, but few variables).
The situation you describe is a case in point, but your Table A is in "wide" format (many variables, relatively few observations).

So, one approach would be to transpose your data and then match the values to the diagnosis codes (see SAS code below). For the latter, I use PROC SQL, but after prior sorting (or indexing) you can use a data step (with a MERGE statement) as well.

/* Create test data */

data have_A;
input patid (var1 var2 var3) ($);
cards;
1 diag1  diag2  diag4
2 diag3  diag5  diag4
3 diag4  diag16 .
;

data have_B;
input diagnosis_code $ value;
cards;
diag1  12
diag2   4
diag3  21
diag4  39
diag5  13
diag16 19
;

/* Reshape dataset HAVE_A from wide to long format */

proc transpose data=have_A out=trans_A(rename=(col1=diagnosis_code)) name=diag_var;
by patid;
var var:;
run;

/* Match values to diagnosis codes */

proc sql;
create table want_long as
select a.*, b.value
from trans_A a natural left join have_B b
order by patid, diag_var;
quit;

/* If really necessary, return to wide format */

proc transpose data=want_long out=want_trans(drop=_:) prefix=value_;
by patid;
var value;
id diag_var;
run;

data want_wide;
merge have_A
      want_trans;
by patid;
run;

A different approach that could be applied to both "long" and "wide" format data is to use a user-defined informat to assign the values:

/* Create informat from dataset HAVE_B */

data infmt;
retain fmtname '@diagv';
set have_B(rename=(diagnosis_code=start value=label));
run;

proc format cntlin=infmt;
run;

/* Assign values using the informat */

data want_wide2;
set have_A;
array       var[3]; /* Please replace 3 by 100 */
array value_var[3]; /* for your real data.     */
do i=1 to dim(var);
  value_var[i]=input(var[i], diagv.);
end;
drop i;
run;

The result is the same as dataset WANT_WIDE.

SmallKatie · Posted 04-18-2016 05:33 AM

Thank you @FreelanceReinh!

Your solution worked perfectly! I ended up using the second solution as my data is quite big (more than 50million observation) and transpose made it huge.

Thanks again!

FreelanceReinh · Posted 04-18-2016 06:12 AM

Hi @SmallKatie,

Glad to hear that my solution worked for you. I wouldn't have thought that you have >50 million observations. My background regarding patient data is clinical research, where large studies would have several thousand patients. I guess, your data is more related to health economics or insurance.

Indeed, for datasets of this size the assignment of values from (not too large) look-up tables using formats or informats is particularly convenient as it doesn't require sorting, indexing or transposing. Your specific case of a character-to-numeric assignment was ideal for a numeric informat.

RW9 · Posted 04-18-2016 04:51 AM

I would also agree with @FreelanceReinh, your problem is caused by the choice in data structure. Whatever code you write against that type of dataset is going to be suboptimal. Normalise your data, then process it. If for some report at teh end you need transposed data then transpose it then. Remember the data you program with should be structured for your benefit, to make your code efficient and simple. There is a reason why CDISC for instance choose to use normalised structures, for example.

Merging on several variables

Re: Merging on several variables

Re: Merging on several variables

Re: Merging on several variables

Re: Merging on several variables

Re: Merging on several variables

The 2025 SAS Hackathon has begun!