Programming the statistical procedures from SAS

Categorising a table based on contents of columns in a second table

Accepted Solution Solved
Reply
Regular Contributor
Posts: 163
Accepted Solution

Categorising a table based on contents of columns in a second table

Hi

I appreciate some help with this task, which I find a bit challenging

I have two tables:

1- Table 1: Containing patients and their diagnosis coded according to ICD code system (the table has several thousand rows in reality)

Patient

Diagnosis

JOHN

A1

ALEX

A12

ZAK

K12

RICHARD

A25

SAM

G3

MANNY

H2

2- Table 2: basically contains a number of ICD diagnosis groups, for the sake of simplicity let us say we have three columns A, B, C each of which has rows representing a number of ICD diagnosis codes, here is a simplification:

Group A

Group B

Group C

A12

A21

C13

A25

A33

J12

H12

G1

A1

K12

K77

L5

G1

J12

A12

D3

A25

G3

What I would like to do is to create a third table which is essentially Table 1 plus three more columns each for group A, B and C. The code would then go through every row and see if the ICD diagnosis in table 1 fits any of the diagnoses in Groups A, B and C and if so it would fill the corresponding cell with 1 otherwise 0. Here is the final table just to explain:

Patient

Diagnosis

Group A

Group B

Group C

JOHN

A1

0

0

1

ALEX

A12

1

0

1

ZAK

K12

1

0

0

RICHARD

A25

1

1

0

SAM

G3

0

0

1

MANNY

H2

0

0

0

Could anyone please suggest how this can be done? The tables in reality are of course much much larger

Best regards


Accepted Solutions
Solution
‎06-09-2015 01:34 PM
Regular Contributor
Posts: 163

Re: Categorising a table based on contents of columns in a second table

Ok, I think I did solve it at the end

This is how I did it, might not be the best solution but it did work on my 1.25 million records and rather quickly

1- Added a fourth column (flag) to table 2, filling all the cells with a value of "1"

2- Used proc sql and a left join to lookup the contents of Group A from table 2 in Diagnosis in table 1, if a match if found the "1" from the additional column in table 2 is included in the new table

proc sql;

create table want as

select a.*, b.flag

from patients as a

left join groups as b

on a.diagnosis=b.groupA;

quit;

3- Final step: using "coalesce" to replace missing value in the "flag" column in the "want" table (ie non existing match) with "0"

then repeating this for Group B and Group C

Any comments or suggestions to this solution?

Kind regards

View solution in original post


All Replies
Super User
Super User
Posts: 7,565

Re: Categorising a table based on contents of columns in a second table

This code below gives you a normalised dataset which can be merged onto your original data to get group information.  As to your required output having the 1's and 0's, this is an exact duplicate of the question here:

As there, I will say it again, the separate columns - groupa, groupb etc. do not add anything to data other than size, so why do you need it, by processing is far easier and quicker.

data group;

  groupa="A12"; groupb="A21"; groupc="C13"; output;

  groupa="A25"; groupb="A33"; groupc="J12"; output;

run;

proc transpose data=group out=t_group;

  by group:;

  var group:;

run;

Regular Contributor
Posts: 163

Re: Categorising a table based on contents of columns in a second table

Thank you RW9, though not sure this is the best solution as both tables are rather large. I would prefer if there can be a function to compare contentd

Thanks again

Best regards

Super User
Super User
Posts: 7,565

Re: Categorising a table based on contents of columns in a second table

Well, there is vname.  However what you will end up with is lots of code to add no value.  Merging is pretty quick, all SAS functions and SQL functions have by/class processing and that is the faster way to perform calculations/functions.  So why bother going to all the hassel of creating many columns, then needing to keep lists of those columns to do things with them etc. when you could just create a normalised table and by group it?

Regular Contributor
Posts: 163

Re: Categorising a table based on contents of columns in a second table

Thanks again RW9

As mentioned above, Table 2 ( the diagnoses) is 8 groups x 40 different diagnoses, and during the course of the analysis we might need to add/remove some diagnoses. It is easier to do that through a table then edit there instead of a long code to include all these diagnoses or am I mistaken?

Best regards

Super Contributor
Posts: 273

Re: Categorising a table based on contents of columns in a second table

I  am not sure that a hash solution will meet the load of your data set. Ignore this if it is difficult.

Here is the approach:

Store Diagnosis of each Group(A,B,C) into a hash table while reading the second data set(TWO).

Then for each Diagnosis of first(ONE), check its presence in each hash table, and if present, flag it to one(1).

data one;

input Patient $ Diagnosis $;

datalines;

JOHN    A1

ALEX    A12

ZAK     K12

RICHARD A25

SAM     G3

MANNY   H2

;

run;

data two;

input A$ B$ C$;

datalines;

A12 A21 C13

A25 A33 J12

H12  G1  A1

K12 K77  L5

G1  J12  A12

D3  A25   G3

;

run;

data want;

   if _n_ = 1 then do;

      if 0 then set two;

      declare hash ha();

      ha.definekey('A');

      ha.definedone();

      declare hash hb();

      hb.definekey('B');

      hb.definedone();

      declare hash hc();

      hc.definekey('C');

      hc.definedone();

      do until (eof);

         set two end = eof;

         ha.add();

         hb.add();

         hc.add();

      end;

   end;

   set one;

   array k

  • ga gb gc;
  •    do i = 1 to dim(k);

          k = 0;

       end;

       A = Diagnosis;

       if ha.find() = 0 then k[1] = 1;

       B = Diagnosis;

       if hb.find() = 0 then k[2] = 1;

       C = Diagnosis;

       if hc.find() = 0 then k[3] = 1;

    keep Patient Diagnosis ga gb gc;

    run;

    proc print data = want;

    run;

    Regular Contributor
    Posts: 163

    Re: Categorising a table based on contents of columns in a second table

    Thank you all,

    Here is some update:

    Table 1 contains 1.25 million observations Smiley Sad

    Table 2 contains 8 groups and 40 different ICD diagnoses per row Smiley Sad

    I am not sure the solution suggested would manage the data load Smiley Sad Smiley Sad

    Any other suggestions?

    Thanks again

    Super User
    Posts: 10,866

    Re: Categorising a table based on contents of columns in a second table

    I would be tempted to use Table 2 to create custom format or informat to create the coded values. If the coded values are to be numeric then something like this might be a starting point.

    proc format;

    invalue groupa

    'A12','A25','H12','K12','G1','D3' = 1

    other=0;

    run;

    data want;

         set have;

         groupacode= input(diagnosis,groupa.);

    run;

    You would likely want to make a CNTLIN dataset from Table 2 and save the resulting formats in a permanent library.

    Super Contributor
    Posts: 273

    Re: Categorising a table based on contents of columns in a second table

    The 1.25 million data set(ONE in my example) is not an issue because it is a DRIVER. Only the data from TWO, that too, only Diagnosis will be stored in the hash table.

    I believe the solution can be tried with some sampled data.

    Regular Contributor
    Posts: 163

    Re: Categorising a table based on contents of columns in a second table

    Thank you datasp, will definitely try it

    Do you think there are advantages to using your solution over proc sql as I suggested?

    Thank you again

    Regular Contributor
    Posts: 163

    Re: Categorising a table based on contents of columns in a second table

    So to update

    I did try datasp solution and it worked briliantly too

    One further detail to complicate things:

    Some if the ICD diagnosis in the patient table had an additional fourth digit:

    example: some had a diagnosis code of A219, this is actually a subgroup of A21

    in my solution, I bypassed this by substr:

    proc sql;

    create table want as

    select a.*, b.flag

    from patients as a

    left join groups as b

    on substr(a.diagnosis,1,3)=b.groupA;

    quit;

    Can it be done too in  your solution datasp? It wouldnt work in the "format" type of solusion

    Thanks again

    Super Contributor
    Posts: 273

    Re: Categorising a table based on contents of columns in a second table

    Additional 4-th digit is not an issue. At present the use of $ sign is by default takes 8 characters. Even if it is longer than 8, use the appropriate width to $. No need

    to substring it.

    Solution
    ‎06-09-2015 01:34 PM
    Regular Contributor
    Posts: 163

    Re: Categorising a table based on contents of columns in a second table

    Ok, I think I did solve it at the end

    This is how I did it, might not be the best solution but it did work on my 1.25 million records and rather quickly

    1- Added a fourth column (flag) to table 2, filling all the cells with a value of "1"

    2- Used proc sql and a left join to lookup the contents of Group A from table 2 in Diagnosis in table 1, if a match if found the "1" from the additional column in table 2 is included in the new table

    proc sql;

    create table want as

    select a.*, b.flag

    from patients as a

    left join groups as b

    on a.diagnosis=b.groupA;

    quit;

    3- Final step: using "coalesce" to replace missing value in the "flag" column in the "want" table (ie non existing match) with "0"

    then repeating this for Group B and Group C

    Any comments or suggestions to this solution?

    Kind regards

    🔒 This topic is solved and locked.

    Need further help from the community? Please ask a new question.

    Discussion stats
    • 12 replies
    • 486 views
    • 7 likes
    • 4 in conversation