## Categorising a table based on contents of columns in a second table

Solved
Regular Contributor
Posts: 169

# Categorising a table based on contents of columns in a second table

Hi

I appreciate some help with this task, which I find a bit challenging

I have two tables:

1- Table 1: Containing patients and their diagnosis coded according to ICD code system (the table has several thousand rows in reality)

 Patient Diagnosis JOHN A1 ALEX A12 ZAK K12 RICHARD A25 SAM G3 MANNY H2

2- Table 2: basically contains a number of ICD diagnosis groups, for the sake of simplicity let us say we have three columns A, B, C each of which has rows representing a number of ICD diagnosis codes, here is a simplification:

 Group A Group B Group C A12 A21 C13 A25 A33 J12 H12 G1 A1 K12 K77 L5 G1 J12 A12 D3 A25 G3

What I would like to do is to create a third table which is essentially Table 1 plus three more columns each for group A, B and C. The code would then go through every row and see if the ICD diagnosis in table 1 fits any of the diagnoses in Groups A, B and C and if so it would fill the corresponding cell with 1 otherwise 0. Here is the final table just to explain:

 Patient Diagnosis Group A Group B Group C JOHN A1 0 0 1 ALEX A12 1 0 1 ZAK K12 1 0 0 RICHARD A25 1 1 0 SAM G3 0 0 1 MANNY H2 0 0 0

Could anyone please suggest how this can be done? The tables in reality are of course much much larger

Best regards

Accepted Solutions
Solution
‎06-09-2015 01:34 PM
Regular Contributor
Posts: 169

## Re: Categorising a table based on contents of columns in a second table

Ok, I think I did solve it at the end

This is how I did it, might not be the best solution but it did work on my 1.25 million records and rather quickly

1- Added a fourth column (flag) to table 2, filling all the cells with a value of "1"

2- Used proc sql and a left join to lookup the contents of Group A from table 2 in Diagnosis in table 1, if a match if found the "1" from the additional column in table 2 is included in the new table

proc sql;

create table want as

select a.*, b.flag

from patients as a

left join groups as b

on a.diagnosis=b.groupA;

quit;

3- Final step: using "coalesce" to replace missing value in the "flag" column in the "want" table (ie non existing match) with "0"

then repeating this for Group B and Group C

Any comments or suggestions to this solution?

Kind regards

All Replies
Super User
Posts: 8,380

## Re: Categorising a table based on contents of columns in a second table

This code below gives you a normalised dataset which can be merged onto your original data to get group information.  As to your required output having the 1's and 0's, this is an exact duplicate of the question here:

As there, I will say it again, the separate columns - groupa, groupb etc. do not add anything to data other than size, so why do you need it, by processing is far easier and quicker.

data group;

groupa="A12"; groupb="A21"; groupc="C13"; output;

groupa="A25"; groupb="A33"; groupc="J12"; output;

run;

proc transpose data=group out=t_group;

by group:;

var group:;

run;

Regular Contributor
Posts: 169

## Re: Categorising a table based on contents of columns in a second table

Thank you RW9, though not sure this is the best solution as both tables are rather large. I would prefer if there can be a function to compare contentd

Thanks again

Best regards

Super User
Posts: 8,380

## Re: Categorising a table based on contents of columns in a second table

Well, there is vname.  However what you will end up with is lots of code to add no value.  Merging is pretty quick, all SAS functions and SQL functions have by/class processing and that is the faster way to perform calculations/functions.  So why bother going to all the hassel of creating many columns, then needing to keep lists of those columns to do things with them etc. when you could just create a normalised table and by group it?

Regular Contributor
Posts: 169

## Re: Categorising a table based on contents of columns in a second table

Thanks again RW9

As mentioned above, Table 2 ( the diagnoses) is 8 groups x 40 different diagnoses, and during the course of the analysis we might need to add/remove some diagnoses. It is easier to do that through a table then edit there instead of a long code to include all these diagnoses or am I mistaken?

Best regards

Super Contributor
Posts: 308

## Re: Categorising a table based on contents of columns in a second table

I  am not sure that a hash solution will meet the load of your data set. Ignore this if it is difficult.

Here is the approach:

Store Diagnosis of each Group(A,B,C) into a hash table while reading the second data set(TWO).

Then for each Diagnosis of first(ONE), check its presence in each hash table, and if present, flag it to one(1).

data one;

input Patient \$ Diagnosis \$;

datalines;

JOHN    A1

ALEX    A12

ZAK     K12

RICHARD A25

SAM     G3

MANNY   H2

;

run;

data two;

input A\$ B\$ C\$;

datalines;

A12 A21 C13

A25 A33 J12

H12  G1  A1

K12 K77  L5

G1  J12  A12

D3  A25   G3

;

run;

data want;

if _n_ = 1 then do;

if 0 then set two;

declare hash ha();

ha.definekey('A');

ha.definedone();

declare hash hb();

hb.definekey('B');

hb.definedone();

declare hash hc();

hc.definekey('C');

hc.definedone();

do until (eof);

set two end = eof;

end;

end;

set one;

array k

• ga gb gc;
•    do i = 1 to dim(k);

k = 0;

end;

A = Diagnosis;

if ha.find() = 0 then k[1] = 1;

B = Diagnosis;

if hb.find() = 0 then k[2] = 1;

C = Diagnosis;

if hc.find() = 0 then k[3] = 1;

keep Patient Diagnosis ga gb gc;

run;

proc print data = want;

run;

Regular Contributor
Posts: 169

## Re: Categorising a table based on contents of columns in a second table

Thank you all,

Here is some update:

Table 1 contains 1.25 million observations

Table 2 contains 8 groups and 40 different ICD diagnoses per row

I am not sure the solution suggested would manage the data load

Any other suggestions?

Thanks again

Super User
Posts: 11,810

## Re: Categorising a table based on contents of columns in a second table

I would be tempted to use Table 2 to create custom format or informat to create the coded values. If the coded values are to be numeric then something like this might be a starting point.

proc format;

invalue groupa

'A12','A25','H12','K12','G1','D3' = 1

other=0;

run;

data want;

set have;

groupacode= input(diagnosis,groupa.);

run;

You would likely want to make a CNTLIN dataset from Table 2 and save the resulting formats in a permanent library.

Super Contributor
Posts: 308

## Re: Categorising a table based on contents of columns in a second table

The 1.25 million data set(ONE in my example) is not an issue because it is a DRIVER. Only the data from TWO, that too, only Diagnosis will be stored in the hash table.

I believe the solution can be tried with some sampled data.

Regular Contributor
Posts: 169

## Re: Categorising a table based on contents of columns in a second table

Thank you datasp, will definitely try it

Do you think there are advantages to using your solution over proc sql as I suggested?

Thank you again

Regular Contributor
Posts: 169

## Re: Categorising a table based on contents of columns in a second table

So to update

I did try datasp solution and it worked briliantly too

One further detail to complicate things:

Some if the ICD diagnosis in the patient table had an additional fourth digit:

example: some had a diagnosis code of A219, this is actually a subgroup of A21

in my solution, I bypassed this by substr:

proc sql;

create table want as

select a.*, b.flag

from patients as a

left join groups as b

on substr(a.diagnosis,1,3)=b.groupA;

quit;

Can it be done too in  your solution datasp? It wouldnt work in the "format" type of solusion

Thanks again

Super Contributor
Posts: 308

## Re: Categorising a table based on contents of columns in a second table

Additional 4-th digit is not an issue. At present the use of \$ sign is by default takes 8 characters. Even if it is longer than 8, use the appropriate width to \$. No need

to substring it.

Solution
‎06-09-2015 01:34 PM
Regular Contributor
Posts: 169

## Re: Categorising a table based on contents of columns in a second table

Ok, I think I did solve it at the end

This is how I did it, might not be the best solution but it did work on my 1.25 million records and rather quickly

1- Added a fourth column (flag) to table 2, filling all the cells with a value of "1"

2- Used proc sql and a left join to lookup the contents of Group A from table 2 in Diagnosis in table 1, if a match if found the "1" from the additional column in table 2 is included in the new table

proc sql;

create table want as

select a.*, b.flag

from patients as a

left join groups as b

on a.diagnosis=b.groupA;

quit;

3- Final step: using "coalesce" to replace missing value in the "flag" column in the "want" table (ie non existing match) with "0"

then repeating this for Group B and Group C

Any comments or suggestions to this solution?

Kind regards

🔒 This topic is solved and locked.