Hi
I appreciate some help with this task, which I find a bit challenging
I have two tables:
1- Table 1: Containing patients and their diagnosis coded according to ICD code system (the table has several thousand rows in reality)
Patient | Diagnosis |
JOHN | A1 |
ALEX | A12 |
ZAK | K12 |
RICHARD | A25 |
SAM | G3 |
MANNY | H2 |
2- Table 2: basically contains a number of ICD diagnosis groups, for the sake of simplicity let us say we have three columns A, B, C each of which has rows representing a number of ICD diagnosis codes, here is a simplification:
Group A | Group B | Group C |
A12 | A21 | C13 |
A25 | A33 | J12 |
H12 | G1 | A1 |
K12 | K77 | L5 |
G1 | J12 | A12 |
D3 | A25 | G3 |
What I would like to do is to create a third table which is essentially Table 1 plus three more columns each for group A, B and C. The code would then go through every row and see if the ICD diagnosis in table 1 fits any of the diagnoses in Groups A, B and C and if so it would fill the corresponding cell with 1 otherwise 0. Here is the final table just to explain:
Patient | Diagnosis | Group A | Group B | Group C |
JOHN | A1 | 0 | 0 | 1 |
ALEX | A12 | 1 | 0 | 1 |
ZAK | K12 | 1 | 0 | 0 |
RICHARD | A25 | 1 | 1 | 0 |
SAM | G3 | 0 | 0 | 1 |
MANNY | H2 | 0 | 0 | 0 |
Could anyone please suggest how this can be done? The tables in reality are of course much much larger
Best regards
Ok, I think I did solve it at the end
This is how I did it, might not be the best solution but it did work on my 1.25 million records and rather quickly
1- Added a fourth column (flag) to table 2, filling all the cells with a value of "1"
2- Used proc sql and a left join to lookup the contents of Group A from table 2 in Diagnosis in table 1, if a match if found the "1" from the additional column in table 2 is included in the new table
proc sql;
create table want as
select a.*, b.flag
from patients as a
left join groups as b
on a.diagnosis=b.groupA;
quit;
3- Final step: using "coalesce" to replace missing value in the "flag" column in the "want" table (ie non existing match) with "0"
then repeating this for Group B and Group C
Any comments or suggestions to this solution?
Kind regards
This code below gives you a normalised dataset which can be merged onto your original data to get group information. As to your required output having the 1's and 0's, this is an exact duplicate of the question here:
As there, I will say it again, the separate columns - groupa, groupb etc. do not add anything to data other than size, so why do you need it, by processing is far easier and quicker.
data group;
groupa="A12"; groupb="A21"; groupc="C13"; output;
groupa="A25"; groupb="A33"; groupc="J12"; output;
run;
proc transpose data=group out=t_group;
by group:;
var group:;
run;
Thank you RW9, though not sure this is the best solution as both tables are rather large. I would prefer if there can be a function to compare contentd
Thanks again
Best regards
Well, there is vname. However what you will end up with is lots of code to add no value. Merging is pretty quick, all SAS functions and SQL functions have by/class processing and that is the faster way to perform calculations/functions. So why bother going to all the hassel of creating many columns, then needing to keep lists of those columns to do things with them etc. when you could just create a normalised table and by group it?
Thanks again RW9
As mentioned above, Table 2 ( the diagnoses) is 8 groups x 40 different diagnoses, and during the course of the analysis we might need to add/remove some diagnoses. It is easier to do that through a table then edit there instead of a long code to include all these diagnoses or am I mistaken?
Best regards
I am not sure that a hash solution will meet the load of your data set. Ignore this if it is difficult.
Here is the approach:
Store Diagnosis of each Group(A,B,C) into a hash table while reading the second data set(TWO).
Then for each Diagnosis of first(ONE), check its presence in each hash table, and if present, flag it to one(1).
data one;
input Patient $ Diagnosis $;
datalines;
JOHN A1
ALEX A12
ZAK K12
RICHARD A25
SAM G3
MANNY H2
;
run;
data two;
input A$ B$ C$;
datalines;
A12 A21 C13
A25 A33 J12
H12 G1 A1
K12 K77 L5
G1 J12 A12
D3 A25 G3
;
run;
data want;
if _n_ = 1 then do;
if 0 then set two;
declare hash ha();
ha.definekey('A');
ha.definedone();
declare hash hb();
hb.definekey('B');
hb.definedone();
declare hash hc();
hc.definekey('C');
hc.definedone();
do until (eof);
set two end = eof;
ha.add();
hb.add();
hc.add();
end;
end;
set one;
array k
do i = 1 to dim(k);
k = 0;
end;
A = Diagnosis;
if ha.find() = 0 then k[1] = 1;
B = Diagnosis;
if hb.find() = 0 then k[2] = 1;
C = Diagnosis;
if hc.find() = 0 then k[3] = 1;
keep Patient Diagnosis ga gb gc;
run;
proc print data = want;
run;
Thank you all,
Here is some update:
Table 1 contains 1.25 million observations
Table 2 contains 8 groups and 40 different ICD diagnoses per row
I am not sure the solution suggested would manage the data load
Any other suggestions?
Thanks again
I would be tempted to use Table 2 to create custom format or informat to create the coded values. If the coded values are to be numeric then something like this might be a starting point.
proc format;
invalue groupa
'A12','A25','H12','K12','G1','D3' = 1
other=0;
run;
data want;
set have;
groupacode= input(diagnosis,groupa.);
run;
You would likely want to make a CNTLIN dataset from Table 2 and save the resulting formats in a permanent library.
The 1.25 million data set(ONE in my example) is not an issue because it is a DRIVER. Only the data from TWO, that too, only Diagnosis will be stored in the hash table.
I believe the solution can be tried with some sampled data.
Thank you datasp, will definitely try it
Do you think there are advantages to using your solution over proc sql as I suggested?
Thank you again
So to update
I did try datasp solution and it worked briliantly too
One further detail to complicate things:
Some if the ICD diagnosis in the patient table had an additional fourth digit:
example: some had a diagnosis code of A219, this is actually a subgroup of A21
in my solution, I bypassed this by substr:
proc sql;
create table want as
select a.*, b.flag
from patients as a
left join groups as b
on substr(a.diagnosis,1,3)=b.groupA;
quit;
Can it be done too in your solution datasp? It wouldnt work in the "format" type of solusion
Thanks again
Additional 4-th digit is not an issue. At present the use of $ sign is by default takes 8 characters. Even if it is longer than 8, use the appropriate width to $. No need
to substring it.
Ok, I think I did solve it at the end
This is how I did it, might not be the best solution but it did work on my 1.25 million records and rather quickly
1- Added a fourth column (flag) to table 2, filling all the cells with a value of "1"
2- Used proc sql and a left join to lookup the contents of Group A from table 2 in Diagnosis in table 1, if a match if found the "1" from the additional column in table 2 is included in the new table
proc sql;
create table want as
select a.*, b.flag
from patients as a
left join groups as b
on a.diagnosis=b.groupA;
quit;
3- Final step: using "coalesce" to replace missing value in the "flag" column in the "want" table (ie non existing match) with "0"
then repeating this for Group B and Group C
Any comments or suggestions to this solution?
Kind regards
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.