Hi,
I have a large dataset (as in table A) with around 100 variables that represent different diagnosis/procedure codes. Each row represents one patient.
Table A
var1 var2 var3 … var100
diag1 diag2 diag4
diag3 diag5 diag4 ... diag800
diag4 diag16
I also have a separate table, with values for each of the diagnosis/procedure codes (altogether more than 10 000 different codes).
The values range from 1-40, so many diagnosis codes have the same value.
Table B
diagnosis_code value
diag1 12
diag2 4
diag3 21
diag4 39
I would like to merge this into my original value for each variable, to get something like this as a result:
Table C
var1 var2 var3 … var100 value_var1 value_var2 ...
diag1 diag2 diag4 12 4
diag3 diag5 diag4 ... diag800 21 13
diag4 diag16 39 19
Of course I can merge in the Table B for each of the 100 variables, but this will take too long.
Therefore I am thinking of constructing an array and checking for each of the values (its 40 possible values):
array diagnosis{i} {100} var1 - var100{
if diagnosis{i} in list{all variables with value 1}
then value_diagnosis{i} = 1
...
if diagnosis{i} in list{all variables with value 40}
then value_diagnosis{i} = 40
But this involves a lot of manual work - I am currenlty copy pasting the list of variables with a particular value.
Is there any more efficient way?
How could I create the list from the Table B in a better way?
I am using SAS Enterprise Guide, version 7.1
Many thanks indeed!
Hi @SmallKatie,
There are many situations where data processing is facilitated by datasets in a "long" format (many observations, but few variables).
The situation you describe is a case in point, but your Table A is in "wide" format (many variables, relatively few observations).
So, one approach would be to transpose your data and then match the values to the diagnosis codes (see SAS code below). For the latter, I use PROC SQL, but after prior sorting (or indexing) you can use a data step (with a MERGE statement) as well.
/* Create test data */
data have_A;
input patid (var1 var2 var3) ($);
cards;
1 diag1 diag2 diag4
2 diag3 diag5 diag4
3 diag4 diag16 .
;
data have_B;
input diagnosis_code $ value;
cards;
diag1 12
diag2 4
diag3 21
diag4 39
diag5 13
diag16 19
;
/* Reshape dataset HAVE_A from wide to long format */
proc transpose data=have_A out=trans_A(rename=(col1=diagnosis_code)) name=diag_var;
by patid;
var var:;
run;
/* Match values to diagnosis codes */
proc sql;
create table want_long as
select a.*, b.value
from trans_A a natural left join have_B b
order by patid, diag_var;
quit;
/* If really necessary, return to wide format */
proc transpose data=want_long out=want_trans(drop=_:) prefix=value_;
by patid;
var value;
id diag_var;
run;
data want_wide;
merge have_A
want_trans;
by patid;
run;
A different approach that could be applied to both "long" and "wide" format data is to use a user-defined informat to assign the values:
/* Create informat from dataset HAVE_B */
data infmt;
retain fmtname '@diagv';
set have_B(rename=(diagnosis_code=start value=label));
run;
proc format cntlin=infmt;
run;
/* Assign values using the informat */
data want_wide2;
set have_A;
array var[3]; /* Please replace 3 by 100 */
array value_var[3]; /* for your real data. */
do i=1 to dim(var);
value_var[i]=input(var[i], diagv.);
end;
drop i;
run;
The result is the same as dataset WANT_WIDE.
Hi @SmallKatie,
There are many situations where data processing is facilitated by datasets in a "long" format (many observations, but few variables).
The situation you describe is a case in point, but your Table A is in "wide" format (many variables, relatively few observations).
So, one approach would be to transpose your data and then match the values to the diagnosis codes (see SAS code below). For the latter, I use PROC SQL, but after prior sorting (or indexing) you can use a data step (with a MERGE statement) as well.
/* Create test data */
data have_A;
input patid (var1 var2 var3) ($);
cards;
1 diag1 diag2 diag4
2 diag3 diag5 diag4
3 diag4 diag16 .
;
data have_B;
input diagnosis_code $ value;
cards;
diag1 12
diag2 4
diag3 21
diag4 39
diag5 13
diag16 19
;
/* Reshape dataset HAVE_A from wide to long format */
proc transpose data=have_A out=trans_A(rename=(col1=diagnosis_code)) name=diag_var;
by patid;
var var:;
run;
/* Match values to diagnosis codes */
proc sql;
create table want_long as
select a.*, b.value
from trans_A a natural left join have_B b
order by patid, diag_var;
quit;
/* If really necessary, return to wide format */
proc transpose data=want_long out=want_trans(drop=_:) prefix=value_;
by patid;
var value;
id diag_var;
run;
data want_wide;
merge have_A
want_trans;
by patid;
run;
A different approach that could be applied to both "long" and "wide" format data is to use a user-defined informat to assign the values:
/* Create informat from dataset HAVE_B */
data infmt;
retain fmtname '@diagv';
set have_B(rename=(diagnosis_code=start value=label));
run;
proc format cntlin=infmt;
run;
/* Assign values using the informat */
data want_wide2;
set have_A;
array var[3]; /* Please replace 3 by 100 */
array value_var[3]; /* for your real data. */
do i=1 to dim(var);
value_var[i]=input(var[i], diagv.);
end;
drop i;
run;
The result is the same as dataset WANT_WIDE.
Thank you @FreelanceReinh!
Your solution worked perfectly! I ended up using the second solution as my data is quite big (more than 50million observation) and transpose made it huge.
Thanks again!
Hi @SmallKatie,
Glad to hear that my solution worked for you. I wouldn't have thought that you have >50 million observations. My background regarding patient data is clinical research, where large studies would have several thousand patients. I guess, your data is more related to health economics or insurance.
Indeed, for datasets of this size the assignment of values from (not too large) look-up tables using formats or informats is particularly convenient as it doesn't require sorting, indexing or transposing. Your specific case of a character-to-numeric assignment was ideal for a numeric informat.
I would also agree with @FreelanceReinh, your problem is caused by the choice in data structure. Whatever code you write against that type of dataset is going to be suboptimal. Normalise your data, then process it. If for some report at teh end you need transposed data then transpose it then. Remember the data you program with should be structured for your benefit, to make your code efficient and simple. There is a reason why CDISC for instance choose to use normalised structures, for example.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.