Solved: Re: Categorising a table based on contents of columns in a second tabl...

ammarhm · Posted 06-09-2015 09:55 AM

Hi

I appreciate some help with this task, which I find a bit challenging

I have two tables:

1- Table 1: Containing patients and their diagnosis coded according to ICD code system (the table has several thousand rows in reality)

Patient	Diagnosis
JOHN	A1
ALEX	A12
ZAK	K12
RICHARD	A25
SAM	G3
MANNY	H2

2- Table 2: basically contains a number of ICD diagnosis groups, for the sake of simplicity let us say we have three columns A, B, C each of which has rows representing a number of ICD diagnosis codes, here is a simplification:

Group A	Group B	Group C
A12	A21	C13
A25	A33	J12
H12	G1	A1
K12	K77	L5
G1	J12	A12
D3	A25	G3

What I would like to do is to create a third table which is essentially Table 1 plus three more columns each for group A, B and C. The code would then go through every row and see if the ICD diagnosis in table 1 fits any of the diagnoses in Groups A, B and C and if so it would fill the corresponding cell with 1 otherwise 0. Here is the final table just to explain:

Patient	Diagnosis	Group A	Group B	Group C
JOHN	A1	0	0	1
ALEX	A12	1	0	1
ZAK	K12	1	0	0
RICHARD	A25	1	1	0
SAM	G3	0	0	1
MANNY	H2	0	0	0

Could anyone please suggest how this can be done? The tables in reality are of course much much larger

Best regards

ammarhm · Posted 06-09-2015 01:34 PM

Ok, I think I did solve it at the end

This is how I did it, might not be the best solution but it did work on my 1.25 million records and rather quickly

1- Added a fourth column (flag) to table 2, filling all the cells with a value of "1"

2- Used proc sql and a left join to lookup the contents of Group A from table 2 in Diagnosis in table 1, if a match if found the "1" from the additional column in table 2 is included in the new table

proc sql;

create table want as

select a.*, b.flag

from patients as a

left join groups as b

on a.diagnosis=b.groupA;

quit;

3- Final step: using "coalesce" to replace missing value in the "flag" column in the "want" table (ie non existing match) with "0"

then repeating this for Group B and Group C

Any comments or suggestions to this solution?

Kind regards

View solution in original post

RW9 · Posted 06-09-2015 10:17 AM

This code below gives you a normalised dataset which can be merged onto your original data to get group information. As to your required output having the 1's and 0's, this is an exact duplicate of the question here:

As there, I will say it again, the separate columns - groupa, groupb etc. do not add anything to data other than size, so why do you need it, by processing is far easier and quicker.

data group;

groupa="A12"; groupb="A21"; groupc="C13"; output;

groupa="A25"; groupb="A33"; groupc="J12"; output;

run;

proc transpose data=group out=t_group;

by group:;

var group:;

run;

ammarhm · Posted 06-09-2015 10:23 AM

Thank you RW9, though not sure this is the best solution as both tables are rather large. I would prefer if there can be a function to compare contentd

Thanks again

Best regards

RW9 · Posted 06-09-2015 10:29 AM

Well, there is vname. However what you will end up with is lots of code to add no value. Merging is pretty quick, all SAS functions and SQL functions have by/class processing and that is the faster way to perform calculations/functions. So why bother going to all the hassel of creating many columns, then needing to keep lists of those columns to do things with them etc. when you could just create a normalised table and by group it?

ammarhm · Posted 06-09-2015 12:16 PM

Thanks again RW9

As mentioned above, Table 2 ( the diagnoses) is 8 groups x 40 different diagnoses, and during the course of the analysis we might need to add/remove some diagnoses. It is easier to do that through a table then edit there instead of a long code to include all these diagnoses or am I mistaken?

Best regards

KachiM · Posted 06-09-2015 11:02 AM

I am not sure that a hash solution will meet the load of your data set. Ignore this if it is difficult.

Here is the approach:

Store Diagnosis of each Group(A,B,C) into a hash table while reading the second data set(TWO).

Then for each Diagnosis of first(ONE), check its presence in each hash table, and if present, flag it to one(1).

data one;

input Patient $ Diagnosis $;

datalines;

JOHN A1

ALEX A12

ZAK K12

RICHARD A25

SAM G3

MANNY H2

;

run;

data two;

input A$ B$ C$;

datalines;

A12 A21 C13

A25 A33 J12

H12 G1 A1

K12 K77 L5

G1 J12 A12

D3 A25 G3

;

run;

data want;

if _n_ = 1 then do;

if 0 then set two;

declare hash ha();

ha.definekey('A');

ha.definedone();

declare hash hb();

hb.definekey('B');

hb.definedone();

declare hash hc();

hc.definekey('C');

hc.definedone();

do until (eof);

set two end = eof;

ha.add();

hb.add();

hc.add();

end;

set one;

array k

ga gb gc;

do i = 1 to dim(k);

k = 0;

end;

A = Diagnosis;

if ha.find() = 0 then k[1] = 1;

B = Diagnosis;

if hb.find() = 0 then k[2] = 1;

C = Diagnosis;

if hc.find() = 0 then k[3] = 1;

keep Patient Diagnosis ga gb gc;

run;

proc print data = want;

run;

ammarhm · Posted 06-09-2015 12:13 PM

Thank you all,

Here is some update:

Table 1 contains 1.25 million observations

Table 2 contains 8 groups and 40 different ICD diagnoses per row

I am not sure the solution suggested would manage the data load

Any other suggestions?

Thanks again

ballardw · Posted 06-09-2015 01:18 PM

I would be tempted to use Table 2 to create custom format or informat to create the coded values. If the coded values are to be numeric then something like this might be a starting point.

proc format;

invalue groupa

'A12','A25','H12','K12','G1','D3' = 1

other=0;

run;

data want;

set have;

groupacode= input(diagnosis,groupa.);

run;

You would likely want to make a CNTLIN dataset from Table 2 and save the resulting formats in a permanent library.

KachiM · Posted 06-09-2015 01:41 PM

The 1.25 million data set(ONE in my example) is not an issue because it is a DRIVER. Only the data from TWO, that too, only Diagnosis will be stored in the hash table.

I believe the solution can be tried with some sampled data.

ammarhm · Posted 06-09-2015 01:44 PM

Thank you datasp, will definitely try it

Do you think there are advantages to using your solution over proc sql as I suggested?

Thank you again

ammarhm · Posted 06-09-2015 01:56 PM

So to update

I did try datasp solution and it worked briliantly too

One further detail to complicate things:

Some if the ICD diagnosis in the patient table had an additional fourth digit:

example: some had a diagnosis code of A219, this is actually a subgroup of A21

in my solution, I bypassed this by substr:

proc sql;

create table want as

select a.*, b.flag

from patients as a

left join groups as b

on substr(a.diagnosis,1,3)=b.groupA;

quit;

Can it be done too in your solution datasp? It wouldnt work in the "format" type of solusion

Thanks again

KachiM · Posted 06-09-2015 02:24 PM

Additional 4-th digit is not an issue. At present the use of $ sign is by default takes 8 characters. Even if it is longer than 8, use the appropriate width to $. No need

to substring it.

ammarhm · Posted 06-09-2015 01:34 PM

Ok, I think I did solve it at the end

This is how I did it, might not be the best solution but it did work on my 1.25 million records and rather quickly

1- Added a fourth column (flag) to table 2, filling all the cells with a value of "1"

2- Used proc sql and a left join to lookup the contents of Group A from table 2 in Diagnosis in table 1, if a match if found the "1" from the additional column in table 2 is included in the new table

proc sql;

create table want as

select a.*, b.flag

from patients as a

left join groups as b

on a.diagnosis=b.groupA;

quit;

3- Final step: using "coalesce" to replace missing value in the "flag" column in the "want" table (ie non existing match) with "0"

then repeating this for Group B and Group C

Any comments or suggestions to this solution?

Kind regards

Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table

Re: Categorising a table based on contents of columns in a second table