BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ammarhm
Lapis Lazuli | Level 10

Hi

I appreciate some help with this task, which I find a bit challenging

I have two tables:

1- Table 1: Containing patients and their diagnosis coded according to ICD code system (the table has several thousand rows in reality)

Patient

Diagnosis

JOHN

A1

ALEX

A12

ZAK

K12

RICHARD

A25

SAM

G3

MANNY

H2

2- Table 2: basically contains a number of ICD diagnosis groups, for the sake of simplicity let us say we have three columns A, B, C each of which has rows representing a number of ICD diagnosis codes, here is a simplification:

Group A

Group B

Group C

A12

A21

C13

A25

A33

J12

H12

G1

A1

K12

K77

L5

G1

J12

A12

D3

A25

G3

What I would like to do is to create a third table which is essentially Table 1 plus three more columns each for group A, B and C. The code would then go through every row and see if the ICD diagnosis in table 1 fits any of the diagnoses in Groups A, B and C and if so it would fill the corresponding cell with 1 otherwise 0. Here is the final table just to explain:

Patient

Diagnosis

Group A

Group B

Group C

JOHN

A1

0

0

1

ALEX

A12

1

0

1

ZAK

K12

1

0

0

RICHARD

A25

1

1

0

SAM

G3

0

0

1

MANNY

H2

0

0

0

Could anyone please suggest how this can be done? The tables in reality are of course much much larger

Best regards

1 ACCEPTED SOLUTION

Accepted Solutions
ammarhm
Lapis Lazuli | Level 10

Ok, I think I did solve it at the end

This is how I did it, might not be the best solution but it did work on my 1.25 million records and rather quickly

1- Added a fourth column (flag) to table 2, filling all the cells with a value of "1"

2- Used proc sql and a left join to lookup the contents of Group A from table 2 in Diagnosis in table 1, if a match if found the "1" from the additional column in table 2 is included in the new table

proc sql;

create table want as

select a.*, b.flag

from patients as a

left join groups as b

on a.diagnosis=b.groupA;

quit;

3- Final step: using "coalesce" to replace missing value in the "flag" column in the "want" table (ie non existing match) with "0"

then repeating this for Group B and Group C

Any comments or suggestions to this solution?

Kind regards

View solution in original post

12 REPLIES 12
RW9
Diamond | Level 26 RW9
Diamond | Level 26

This code below gives you a normalised dataset which can be merged onto your original data to get group information.  As to your required output having the 1's and 0's, this is an exact duplicate of the question here:

As there, I will say it again, the separate columns - groupa, groupb etc. do not add anything to data other than size, so why do you need it, by processing is far easier and quicker.

data group;

  groupa="A12"; groupb="A21"; groupc="C13"; output;

  groupa="A25"; groupb="A33"; groupc="J12"; output;

run;

proc transpose data=group out=t_group;

  by group:;

  var group:;

run;

ammarhm
Lapis Lazuli | Level 10

Thank you RW9, though not sure this is the best solution as both tables are rather large. I would prefer if there can be a function to compare contentd

Thanks again

Best regards

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, there is vname.  However what you will end up with is lots of code to add no value.  Merging is pretty quick, all SAS functions and SQL functions have by/class processing and that is the faster way to perform calculations/functions.  So why bother going to all the hassel of creating many columns, then needing to keep lists of those columns to do things with them etc. when you could just create a normalised table and by group it?

ammarhm
Lapis Lazuli | Level 10

Thanks again RW9

As mentioned above, Table 2 ( the diagnoses) is 8 groups x 40 different diagnoses, and during the course of the analysis we might need to add/remove some diagnoses. It is easier to do that through a table then edit there instead of a long code to include all these diagnoses or am I mistaken?

Best regards

KachiM
Rhodochrosite | Level 12

I  am not sure that a hash solution will meet the load of your data set. Ignore this if it is difficult.

Here is the approach:

Store Diagnosis of each Group(A,B,C) into a hash table while reading the second data set(TWO).

Then for each Diagnosis of first(ONE), check its presence in each hash table, and if present, flag it to one(1).

data one;

input Patient $ Diagnosis $;

datalines;

JOHN    A1

ALEX    A12

ZAK     K12

RICHARD A25

SAM     G3

MANNY   H2

;

run;

data two;

input A$ B$ C$;

datalines;

A12 A21 C13

A25 A33 J12

H12  G1  A1

K12 K77  L5

G1  J12  A12

D3  A25   G3

;

run;

data want;

   if _n_ = 1 then do;

      if 0 then set two;

      declare hash ha();

      ha.definekey('A');

      ha.definedone();

      declare hash hb();

      hb.definekey('B');

      hb.definedone();

      declare hash hc();

      hc.definekey('C');

      hc.definedone();

      do until (eof);

         set two end = eof;

         ha.add();

         hb.add();

         hc.add();

      end;

   end;

   set one;

   array k

  • ga gb gc;
  •    do i = 1 to dim(k);

          k = 0;

       end;

       A = Diagnosis;

       if ha.find() = 0 then k[1] = 1;

       B = Diagnosis;

       if hb.find() = 0 then k[2] = 1;

       C = Diagnosis;

       if hc.find() = 0 then k[3] = 1;

    keep Patient Diagnosis ga gb gc;

    run;

    proc print data = want;

    run;

    ammarhm
    Lapis Lazuli | Level 10

    Thank you all,

    Here is some update:

    Table 1 contains 1.25 million observations Smiley Sad

    Table 2 contains 8 groups and 40 different ICD diagnoses per row Smiley Sad

    I am not sure the solution suggested would manage the data load Smiley Sad Smiley Sad

    Any other suggestions?

    Thanks again

    ballardw
    Super User

    I would be tempted to use Table 2 to create custom format or informat to create the coded values. If the coded values are to be numeric then something like this might be a starting point.

    proc format;

    invalue groupa

    'A12','A25','H12','K12','G1','D3' = 1

    other=0;

    run;

    data want;

         set have;

         groupacode= input(diagnosis,groupa.);

    run;

    You would likely want to make a CNTLIN dataset from Table 2 and save the resulting formats in a permanent library.

    KachiM
    Rhodochrosite | Level 12

    The 1.25 million data set(ONE in my example) is not an issue because it is a DRIVER. Only the data from TWO, that too, only Diagnosis will be stored in the hash table.

    I believe the solution can be tried with some sampled data.

    ammarhm
    Lapis Lazuli | Level 10

    Thank you datasp, will definitely try it

    Do you think there are advantages to using your solution over proc sql as I suggested?

    Thank you again

    ammarhm
    Lapis Lazuli | Level 10

    So to update

    I did try datasp solution and it worked briliantly too

    One further detail to complicate things:

    Some if the ICD diagnosis in the patient table had an additional fourth digit:

    example: some had a diagnosis code of A219, this is actually a subgroup of A21

    in my solution, I bypassed this by substr:

    proc sql;

    create table want as

    select a.*, b.flag

    from patients as a

    left join groups as b

    on substr(a.diagnosis,1,3)=b.groupA;

    quit;

    Can it be done too in  your solution datasp? It wouldnt work in the "format" type of solusion

    Thanks again

    KachiM
    Rhodochrosite | Level 12

    Additional 4-th digit is not an issue. At present the use of $ sign is by default takes 8 characters. Even if it is longer than 8, use the appropriate width to $. No need

    to substring it.

    ammarhm
    Lapis Lazuli | Level 10

    Ok, I think I did solve it at the end

    This is how I did it, might not be the best solution but it did work on my 1.25 million records and rather quickly

    1- Added a fourth column (flag) to table 2, filling all the cells with a value of "1"

    2- Used proc sql and a left join to lookup the contents of Group A from table 2 in Diagnosis in table 1, if a match if found the "1" from the additional column in table 2 is included in the new table

    proc sql;

    create table want as

    select a.*, b.flag

    from patients as a

    left join groups as b

    on a.diagnosis=b.groupA;

    quit;

    3- Final step: using "coalesce" to replace missing value in the "flag" column in the "want" table (ie non existing match) with "0"

    then repeating this for Group B and Group C

    Any comments or suggestions to this solution?

    Kind regards

    sas-innovate-2024.png

    Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

    Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

     

    Register now!

    What is ANOVA?

    ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

    Find more tutorials on the SAS Users YouTube channel.

    Discussion stats
    • 12 replies
    • 1806 views
    • 7 likes
    • 4 in conversation