working with data from two different tables at the same time to catego...

ronnie_h · Posted 08-16-2024 08:52 AM

I have a table called 'subjects' that has 86k observations - each observation represents a single individual - there is a variable called ICD_code e.g. 1244 which holds the icd code for their cause of death if they are dead or its left empty if the person is still alive.

I have another table called diseases which is 70 observations long - each observation represents a disease defined by a set of pairs of icd_codes

e.g.

'lung cancer' 3, 1237 7584 , 6854 7845, 4579 5879 ...

the first variable is the name of the disease grouping - lung cancer

the next is a number which is an index that gives the number of pairs of icd_codes that define the disease grouping.

Then i have a number of icd_code pairs each of which represent a range - each value is in a separate variable e.g. L1, U1, L2, U2, l3, u3 ...

I have created 70 new variables v1 - v70 in the subject table one for each disease grouping and set them all to zero.

What i need to do now is for each subject who is dead and has an icd_code - i need to check their ICD_code to see which of the 70 groupings contains that code and then set the value of the appropriate varable v* to 1.

So in my example of lung cancer because the icd_code value 1244 IS in the first pair of codes (i.e. its GE 1237 and LE 7584) I want to change the zero in the relevant V variable to a 1 to recognise this.

I can't work out how to do this! - i used to be a fortran programmer and in that language i coudl have done it - but i can;t seem to work out in SAS how to use data from two different tables at the same time - i can't join them together as they are not the same types of data.

Any suggestions greatly appreciated.

ballardw · Posted 08-16-2024 10:15 AM

There are going to be some details. In your Subjects data set what are the properties of the variable ICD_code, as in type, numeric or character, and if character the defined length.

This will be needed to correctly parse out details of your other dataset. If you can't provide actual example data then copy the text used to create the data and paste it into a text box opened on this forum with using the </> icon that appears above the main message windows.

Question: what are you really going to do with those 70 variables, especially when next week/month/report cycle someone adds another 1/3/17 diseases of interest?

Depending on what your actual ICD_codes look like it may be easier to get the disease directly into one variable instead of managing so many indicator variables.

SAS has multiple different look up tools: Formats, Proc SQL (which is most definitely not FORTRAN in any form) Hash objects or even a function (though I wouldn't start down that path unless everything is very static).

FWIW, the last FORTRAN program I wrote is used SAS to prototype the code because the IO was cleaner and the logic involved a bunch of date manipulation that was easier to test the logic in SAS before translating to FORTRAN.

Tom · Posted 08-16-2024 10:17 AM

Convert your second dataset into a FORMAT that converts ICDCODE into DISEASE_GROUPING. Then use the format to classify your larger dataset.

The first thing you will want to do is convert your wide structure into a tall structure so each disease/ICD pairing is on a separate observation. While you are doing that make a dataset that the PROC IMPORT can understand.

Let's assume that DISEASES looks like this with numeric ICD codes.

data diseases ;
  length disease $30 count l1-l5 u1-u5 8;
  array lower l1-l5;
  array upper u1-u5;
  infile cards dsd truncover;
  input disease count @;
  do count=1 to count;
    input lower[count] upper[count] @;
  end;
cards;
'lung cancer', 3, 1237 ,1245 , 6854 ,7845, 4579 ,5879 
;;;;

Then you would want to make a numeric format. So you could call it ICDGROUP perhaps? So something like this:

data format;
  set diseases end=eof;
  fmtname='ICDGROUP';
  array lower l1-l5;
  array upper u1-u5;
  label=disease;
  hlo=' ';
  do count=1 to count;
    start=lower[count];
    end=upper[count];
    output;
  end;
  if eof then do;
    call missing(start,end);
    label='UNKNOWN';
    hlo='O';
    output;
  end;
run;
proc format cntlin=format;
run;

Now let's make a SUBJECTS file.

data subjects ;
  input subjid icd_code;
cards;
1 1238
2 .
3 6850
4 6900
5 4580
;

And use the ICDGROUP format to make a DISEASE variable.

data want;
 set subjects;
 if icd_code then disease=put(icd_code,icdgroup.);
run;

Result

ronnie_h · Posted 08-16-2024 10:54 AM

i think i can follow that - i'll give it a try next week.

many thanks!

ronnie_h · Posted 08-20-2024 05:59 AM

hi - thanks for your help last week - i;ve been trying it out this morning but it didn;t quite work.

It gets to to the bit where it runs proc format but then complains i have repeated or overlapping ranges.

it is true that codes do occur in multiple disease groupings but not within a single disease group.

Have i missunderstood something - again!

i've enclosed the files so you can see

Many thanks

Tom · Posted 08-20-2024 08:53 AM

If you have multiple decodes for the same code then you need to make a multi-label format.

But I am not sure how you can do your original problem if one disease code could be included in two or more different disease groupings. Which disease grouping would you want to pick?

But you could convert your disease groupings into separate formats for each disease group. Which you could then use to detect if the a particular code indicates the presence of that disease group.

proc format;
value diseaseA 1-5 = 'YES' other='NO';
value diseaseB 3-7 = 'YES' other='NO';
run;

data want;
  set have;
  diseaseA = put(code,diseasea.);
  diseaseB = put(code,diseaseb.);
run;

ballardw · Posted 08-20-2024 02:06 PM

Your diseases file also has duplicate entries. For example Lung_cancer with range of 1620 1639 is repeated.

With so much duplication of codes maybe a return to your first approach is reasonable.

Maybe this is helpful. I made a small dummy set of just person and Icd_code as an example. Note this reads the code into a slightly different form (long beats wide most times).

data work.diseases;
  length disease $30 count 4 ;
  infile "x:\deaths_short.txt" delimiter=' ' dsd;
  input disease;
	input count;
 	do count2=1 to count;
  	  input lower upper;
     output;
  	end;
   drop count count2;
run;


data work.deaths;
   input personid icd_code;
datalines;
1  10
2  125
3  .
4  1622
5  2240
;
 
proc sql;
   create table work.cod as
   select a.*,b.disease,1 as dummy
   from work.deaths as a
        ,
        (select distinct disease, lower, upper from work.diseases) as b
   where b.lower le a.icd_code le b.upper
   order by a.personid,a.icd_code,b.disease
   ;
quit;

proc transpose data=work.cod 
               out=work.cause_of_death (drop=_name_)
;
   by personid icd_code;
   id disease;
   var dummy;
run;

I didn't see any clear description of how the "alive" people were treated. The above drops them at the work.cod step as they don't have a code in range.

ronnie_h · Posted 08-21-2024 04:20 AM

Hi thanks for this! - it was only yesterday that i realised lung cancer occurerd twice - it shouldn't have but i don;t think that should have been a show stopper. i will have to check the 'deaths_long'; file for duplicates - that has 170 groupings but obviously i want to get this working for the smaller file first.

i will give this a go today.

In answer to your last question i will be merging the file of dead people with an even bigger file (300K) of alive people who will have zero for each of the death grouping variables.

I still wish i could do this all in fortran!

ronnie_h · Posted 08-21-2024 06:33 AM

HI, i tried your suggestion and its getting closer but its not dealing with disease groups where there are more than one pair of codes - as while it must be reading all the pairs its only retaining the last pair in group.

e.g. in the diseases table there is only one line for all_mailignancies _ex_nmsc and that has codes 1740 and 2089.

'All_malignancies_ex_NMSC'
2
1400 1729
1740 2089

Can the final cause_of_death table have all of the disease groupings as columns in it - not just the ones where the persons have died of a code in that disease group? - so i can set those to zero?

That way it will be easy to merger into a single table these records with the other table of alive poeple who will have zeros in all the disease groups that i have already set up.

Thanks again for your invaluable help.

ballardw · Posted 08-21-2024 10:20 AM

@ronnie_h wrote:

HI, i tried your suggestion and its getting closer but its not dealing with disease groups where there are more than one pair of codes - as while it must be reading all the pairs its only retaining the last pair in group.

e.g. in the diseases table there is only one line for all_mailignancies _ex_nmsc and that has codes 1740 and 2089.

'All_malignancies_ex_NMSC'
2
1400 1729
1740 2089

Can the final cause_of_death table have all of the disease groupings as columns in it - not just the ones where the persons have died of a code in that disease group? - so i can set those to zero?

That way it will be easy to merger into a single table these records with the other table of alive poeple who will have zeros in all the disease groups that i have already set up.

Thanks again for your invaluable help.

Please show the code that YOU submitted, with the log, that only shows the "last" of a pairing.

when I ran my code with your short text file I had multiple ranges. All_cancers_ex_lung_leuk for example has 5 ranges in my work.diseases

Here are the first 20 observations from my work.diseases:

Obs    disease                     lower    upper

  1    All_causes                     10     9999
  2    All_malignancies             1400     2089
  3    All_malignancies_ex_NMSC     1400     1729
  4    All_malignancies_ex_NMSC     1740     2089
  5    All_cancers                  1400     2399
  6    All_cancers_ex_lung_leuk     1400     1619
  7    All_cancers_ex_lung_leuk     1640     2023
  8    All_cancers_ex_lung_leuk     2025     2030
  9    All_cancers_ex_lung_leuk     2032     2039
 10    All_cancers_ex_lung_leuk     2090     2399
 11    Lung_cancer                  1620     1639
 12    Breast_cancer                1740     1749
 13    Ovarian_cancer               1830     1839
 14    Uterine_cancer               1790     1829
 15    All_cancers_ex_leuk          1400     2023
 16    All_cancers_ex_leuk          2025     2030
 17    All_cancers_ex_leuk          2032     2039
 18    All_cancers_ex_leuk          2090     2399
 19    Mal_Neo_ex_leuk              1400     2023
 20    Mal_Neo_ex_leuk              2025     2030

Note: Showing inputs without actual output does not help with diagnosing a problem. For example, look very closely at your code. Do you have the OUTPUT statement inside the DO loop for Count2? Your description would match not having that statement.

ronnie_h · Posted 08-21-2024 11:00 AM

yes - you were right i had lost the 'output' statement - how did I miss that!

it now works and gives the output in the 'cause of death' file below.

Can i ask one more thing?

The reason i'm doing this is to prepare the data to go into another program that does poisson regression modelling to relate disease risk to radiation dose.

for that my table of data has to have a single line for each person - as it does in the current outout but i need all of the diseases groups across the top with a 1 in the column if the person died with an icd code in that disease group and a zero if they didn't.

i don;t need the actual disease name as the titles - just a name to indicate the order d1 - dn is fine as my analysis code won;t accept the varables as long names.

the crutial thing is that the disease variables really need to be in the same order as in the diseases file

Is that a difficult thing to do?

many thanks!

data diseases;
  length disease $30 count 4;
  infile "&sas_dir.\deaths_short.txt" delimiter=' ' dsd;
  input disease;
input count;
 	do count2=1 to count;
  	  input lower upper;
	  output;
  	end;
	drop count count2;
run;



data work.deaths; 
  input personid icd_code;
datalines;
1 10
2 125
3 .
4 1622
5 2240
;
run;

proc sql;
   create table work.cod as
   select a.*,b.disease,1 as dummy
   from work.deaths as a
        ,
        (select distinct disease, lower, upper from work.diseases) as b
   where b.lower le a.icd_code le b.upper
   order by a.personid,a.icd_code,b.disease
   ;
quit;

proc transpose data=work.cod 
               out=work.cause_of_death (drop=_name_)
;
   by personid icd_code;
   id disease;
   var dummy;
run;

ballardw · Posted 08-21-2024 12:07 PM

Already showed how to place the name of the disease as a variable so just need to add a different variable that "counts". NOTE: Better fix the duplicate disease entries prior to reading them.

data work.diseases;
  length disease $30 count 4;
  infile "X:\deaths_short.txt" delimiter=' ' dsd;
  input disease;
  retain dnum 0;
input count;
   dnum+1;
   IDvar= cats("D",put(dnum,z3.));
 	do count2=1 to count;
  	  input lower upper;
	  output;
  	end;
	drop count count2;
run;
data work.deaths; 
  input personid icd_code;
datalines;
1 10
2 125
3 .
4 1622
5 2240
;
run;

proc sql;
   create table work.cod as
   select a.*,b.disease, b.idvar, 1 as dummy
   from work.deaths as a
        ,
        (select distinct idvar, disease, lower, upper from work.diseases) as b
   where b.lower le a.icd_code le b.upper
   order by a.personid,a.icd_code,b.idvar
   ;
quit;

proc transpose data=work.cod 
               out=work.cause_of_death (drop=_name_)
;
   by personid icd_code;
   id idvar;
   var dummy;
run;

The RETAIN in the first data step keeps values of a variable across the data step boundary, so is one way to create a serial counter. I create IDVAR with 3 digits so a text value can SORT if needed easily.

If you need a set with "all" the Dnnn variables that's easy: Replace the 999 with actual number of disease codes that are in the disease file (Exercise for the interested reader: you can get that from the disease data set and place into a macro variable to be used here). The array creates the d variables in order before the SET statement.

Then when the values are read from the set that is the order they have. Then use the array to set the missing values to 0 if that is needed.

data work.final;
   array d(*) D001 - D999; 
   set work.cause_of_death
   ;
   do i=1 to dim(d);
     if missing( d[i]) then d[i]=0;
  end;
  drop i;
run;

ronnie_h · Posted 08-23-2024 08:37 AM

that did it! - many thanks - i was able to run it successfully for the 28K people with version 9 of the ICD coding of their death cause now i need to do the same with the 38K people who have an ICD10 version which is alpha numeric - i.e. c456

but i hope i can do that on my own - at least i want to try first - or i will never learn.

Many thanks for your help!

ballardw · Posted 08-23-2024 09:47 AM

@ronnie_h wrote:

that did it! - many thanks - i was able to run it successfully for the 28K people with version 9 of the ICD coding of their death cause now i need to do the same with the 38K people who have an ICD10 version which is alpha numeric - i.e. c456

but i hope i can do that on my own - at least i want to try first - or i will never learn.

Many thanks for your help!

I have worked a very little with ICD codes. I do know there have a been just enough differences, ICD-10 greatly expanding some areas, that you may encounter meaning differences in Dxxx names between two different disease variables just following my code. If the text of the disease name categories don't change, though the ICD10 have additional ones, it may be worth doing the process for both files with the disease name and then taking a step to "standarize" the the variable names if that Dxxx is truly desirable.

With a known list of values, such as from the disease data and the numbered version it is quite easy to generate Proc Datasets code to rename the variables.

ronnie_h · Posted 08-28-2024 10:55 AM

hi - thanks in no small part to you i have have made good progress with my coding and have managed to sort the icd10 stuff out anbd am now dealing with the doses - which was going well but i have one small probelm i don;t seem to be able to solve without using brut force!

I am trying to generate a table of the individuals doses - i have it in a long table called work.all_doses (work.all_doses.xlsx)

the possible range of years is 1937 to 2022.

i have changed it to a wide table using the transpose procedure - all ok so far! (work.all_doses_trans.xlsx)

but now i need to rearrange it so that the dose columns are in year order from 1937 to 2022 across the table - ideally with variable names d1937 - d2022.

i could do the reordering by brut force using retain id '1937'n '1938'n ... but i'd prefer a better way! and i still need to change the column names to d1937 - d2022 which i can't figure out how to do in one go and finally i need to replace all the missing values by zeros.

also - just to complicate things its not garenteed that every year will be represented in the starting table but i need every year in the final table

i tried to adapt the code you showed me earlier but the problem of generating the array beat me.

any suggestions ?

many thanks

proc sort data=work.all_doses;
by id;
run;

proc transpose data=work.all_doses 
               out=work.all_doses_trans (drop=_name_);
   by id;
   id dose_yr;
   var dose_value;
run;

data work.all_doses_revised;
	  retain id '1937'n ; 
	set work.all_doses_trans; 
run;

working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

Re: working with data from two different tables at the same time to categorise death codes.

SAS Innovate 2025: Call for Content

Classroom Training Available!