SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Merging on several variables

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

Merging on several variables

Hi,

 

I have a large dataset (as in table A) with around 100 variables that represent different diagnosis/procedure codes. Each row represents one patient.

 

Table A

var1    var2    var3     …    var100

diag1  diag2  diag4

diag3  diag5  diag4     ...    diag800

diag4  diag16

 

I also have a separate table, with values for each of the diagnosis/procedure codes (altogether more than 10 000 different codes).

The values range from 1-40, so many diagnosis codes have the same value.

 

Table B

diagnosis_code     value

diag1                     12

diag2                      4

diag3                      21

diag4                      39

 

I would like to merge this into my original value for each variable, to get something like this as a result:

 

Table C

var1    var2    var3     …    var100  value_var1 value_var2 ...

diag1  diag2  diag4                              12               4

diag3  diag5  diag4     ...    diag800      21             13

diag4  diag16                                       39              19

 

Of course I can merge in the Table B for each of the 100 variables, but this will take too long.

 

Therefore I am thinking of constructing an array and checking for each of the values (its 40 possible values):

 

array diagnosis{i}  {100} var1 - var100{

if diagnosis{i} in list{all variables with value 1}

       then value_diagnosis{i} = 1 

...

if diagnosis{i} in list{all variables with value 40}

       then value_diagnosis{i} = 40

 

But this involves a lot of manual work - I am currenlty copy pasting the list of variables with a particular value.

Is there any more efficient way?

How could I create the list from the Table B in a better way?

 

I am using SAS Enterprise Guide, version 7.1

 

Many thanks indeed!

 

 


Accepted Solutions
Solution
‎04-18-2016 05:28 AM
Trusted Advisor
Posts: 1,118

Re: Merging on several variables

Posted in reply to SmallKatie

Hi @SmallKatie,

 

There are many situations where data processing is facilitated by datasets in a "long" format (many observations, but few variables).
The situation you describe is a case in point, but your Table A is in "wide" format (many variables, relatively few observations).

 

So, one approach would be to transpose your data and then match the values to the diagnosis codes (see SAS code below). For the latter, I use PROC SQL, but after prior sorting (or indexing) you can use a data step (with a MERGE statement) as well.

/* Create test data */

data have_A;
input patid (var1 var2 var3) ($);
cards;
1 diag1  diag2  diag4
2 diag3  diag5  diag4
3 diag4  diag16 .
;

data have_B;
input diagnosis_code $ value;
cards;
diag1  12
diag2   4
diag3  21
diag4  39
diag5  13
diag16 19
;

/* Reshape dataset HAVE_A from wide to long format */

proc transpose data=have_A out=trans_A(rename=(col1=diagnosis_code)) name=diag_var;
by patid;
var var:;
run;

/* Match values to diagnosis codes */

proc sql;
create table want_long as
select a.*, b.value
from trans_A a natural left join have_B b
order by patid, diag_var;
quit;

/* If really necessary, return to wide format */

proc transpose data=want_long out=want_trans(drop=_:) prefix=value_;
by patid;
var value;
id diag_var;
run;

data want_wide;
merge have_A
      want_trans;
by patid;
run;

A different approach that could be applied to both "long" and "wide" format data is to use a user-defined informat to assign the values:

/* Create informat from dataset HAVE_B */

data infmt;
retain fmtname '@diagv';
set have_B(rename=(diagnosis_code=start value=label));
run;

proc format cntlin=infmt;
run;

/* Assign values using the informat */

data want_wide2;
set have_A;
array       var[3]; /* Please replace 3 by 100 */
array value_var[3]; /* for your real data.     */
do i=1 to dim(var);
  value_var[i]=input(var[i], diagv.);
end;
drop i;
run;

The result is the same as dataset WANT_WIDE.

View solution in original post


All Replies
Solution
‎04-18-2016 05:28 AM
Trusted Advisor
Posts: 1,118

Re: Merging on several variables

Posted in reply to SmallKatie

Hi @SmallKatie,

 

There are many situations where data processing is facilitated by datasets in a "long" format (many observations, but few variables).
The situation you describe is a case in point, but your Table A is in "wide" format (many variables, relatively few observations).

 

So, one approach would be to transpose your data and then match the values to the diagnosis codes (see SAS code below). For the latter, I use PROC SQL, but after prior sorting (or indexing) you can use a data step (with a MERGE statement) as well.

/* Create test data */

data have_A;
input patid (var1 var2 var3) ($);
cards;
1 diag1  diag2  diag4
2 diag3  diag5  diag4
3 diag4  diag16 .
;

data have_B;
input diagnosis_code $ value;
cards;
diag1  12
diag2   4
diag3  21
diag4  39
diag5  13
diag16 19
;

/* Reshape dataset HAVE_A from wide to long format */

proc transpose data=have_A out=trans_A(rename=(col1=diagnosis_code)) name=diag_var;
by patid;
var var:;
run;

/* Match values to diagnosis codes */

proc sql;
create table want_long as
select a.*, b.value
from trans_A a natural left join have_B b
order by patid, diag_var;
quit;

/* If really necessary, return to wide format */

proc transpose data=want_long out=want_trans(drop=_:) prefix=value_;
by patid;
var value;
id diag_var;
run;

data want_wide;
merge have_A
      want_trans;
by patid;
run;

A different approach that could be applied to both "long" and "wide" format data is to use a user-defined informat to assign the values:

/* Create informat from dataset HAVE_B */

data infmt;
retain fmtname '@diagv';
set have_B(rename=(diagnosis_code=start value=label));
run;

proc format cntlin=infmt;
run;

/* Assign values using the informat */

data want_wide2;
set have_A;
array       var[3]; /* Please replace 3 by 100 */
array value_var[3]; /* for your real data.     */
do i=1 to dim(var);
  value_var[i]=input(var[i], diagv.);
end;
drop i;
run;

The result is the same as dataset WANT_WIDE.

New Contributor
Posts: 2

Re: Merging on several variables

Posted in reply to FreelanceReinhard

Thank you @FreelanceReinhard

 

 

Trusted Advisor
Posts: 1,118

Re: Merging on several variables

Posted in reply to SmallKatie

Hi @SmallKatie,

 

Glad to hear that my solution worked for you. I wouldn't have thought that you have >50 million observations. My background regarding patient data is clinical research, where large studies would have several thousand patients. I guess, your data is more related to health economics or insurance.

 

Indeed, for datasets of this size the assignment of values from (not too large) look-up tables using formats or informats is particularly convenient as it doesn't require sorting, indexing or transposing. Your specific case of a character-to-numeric assignment was ideal for a numeric informat.

 

 

Super User
Super User
Posts: 7,988

Re: Merging on several variables

Posted in reply to SmallKatie

I would also agree with @FreelanceReinhard, your problem is caused by the choice in data structure.  Whatever code you write against that type of dataset is going to be suboptimal.  Normalise your data, then process it.  If for some report at teh end you need transposed data then transpose it then.  Remember the data you program with should be structured for your benefit, to make your code efficient and simple.  There is a reason why CDISC for instance choose to use normalised structures, for example.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 394 views
  • 6 likes
  • 3 in conversation