topic Re: working with data from two different tables at the same time to categorise death codes. in SAS Programming

working with data from two different tables at the same time to categorise death codes.

ronnie_h — Fri, 16 Aug 2024 12:52:17 GMT

I have a table called 'subjects' that has 86k observations - each observation represents a single individual - there is a variable called ICD_code e.g. 1244 which holds the icd code for their cause of death if they are dead or its left empty if the person is still alive.

I have another table called diseases which is 70 observations long - each observation represents a disease defined by a set of pairs of icd_codes

e.g.

'lung cancer' 3, 1237 7584 , 6854 7845, 4579 5879 ...

the first variable is the name of the disease grouping - lung cancer

the next is a number which is an index that gives the number of pairs of icd_codes that define the disease grouping.

Then i have a number of icd_code pairs each of which represent a range - each value is in a separate variable e.g. L1, U1, L2, U2, l3, u3 ...

I have created 70 new variables v1 - v70 in the subject table one for each disease grouping and set them all to zero.

What i need to do now is for each subject who is dead and has an icd_code - i need to check their ICD_code to see which of the 70 groupings contains that code and then set the value of the appropriate varable v* to 1.

So in my example of lung cancer because the icd_code value 1244 IS in the first pair of codes (i.e. its GE 1237 and LE 7584) I want to change the zero in the relevant V variable to a 1 to recognise this.

I can't work out how to do this! - i used to be a fortran programmer and in that language i coudl have done it - but i can;t seem to work out in SAS how to use data from two different tables at the same time - i can't join them together as they are not the same types of data.

Any suggestions greatly appreciated.

Re: working with data from two different tables at the same time to categorise death codes.

ballardw — Fri, 16 Aug 2024 14:15:42 GMT

There are going to be some details. In your Subjects data set what are the properties of the variable ICD_code, as in type, numeric or character, and if character the defined length.

This will be needed to correctly parse out details of your other dataset. If you can't provide actual example data then copy the text used to create the data and paste it into a text box opened on this forum with using the </> icon that appears above the main message windows.

Question: what are you really going to do with those 70 variables, especially when next week/month/report cycle someone adds another 1/3/17 diseases of interest?

Depending on what your actual ICD_codes look like it may be easier to get the disease directly into one variable instead of managing so many indicator variables.

SAS has multiple different look up tools: Formats, Proc SQL (which is most definitely not FORTRAN in any form) Hash objects or even a function (though I wouldn't start down that path unless everything is very static).

FWIW, the last FORTRAN program I wrote is used SAS to prototype the code because the IO was cleaner and the logic involved a bunch of date manipulation that was easier to test the logic in SAS before translating to FORTRAN.

Re: working with data from two different tables at the same time to categorise death codes.

Tom — Fri, 16 Aug 2024 14:17:37 GMT

Convert your second dataset into a FORMAT that converts ICDCODE into DISEASE_GROUPING. Then use the format to classify your larger dataset.

The first thing you will want to do is convert your wide structure into a tall structure so each disease/ICD pairing is on a separate observation. While you are doing that make a dataset that the PROC IMPORT can understand.

Let's assume that DISEASES looks like this with numeric ICD codes.

data diseases ;
  length disease $30 count l1-l5 u1-u5 8;
  array lower l1-l5;
  array upper u1-u5;
  infile cards dsd truncover;
  input disease count @;
  do count=1 to count;
    input lower[count] upper[count] @;
  end;
cards;
'lung cancer', 3, 1237 ,1245 , 6854 ,7845, 4579 ,5879 
;;;;

Then you would want to make a numeric format. So you could call it ICDGROUP perhaps? So something like this:

data format;
  set diseases end=eof;
  fmtname='ICDGROUP';
  array lower l1-l5;
  array upper u1-u5;
  label=disease;
  hlo=' ';
  do count=1 to count;
    start=lower[count];
    end=upper[count];
    output;
  end;
  if eof then do;
    call missing(start,end);
    label='UNKNOWN';
    hlo='O';
    output;
  end;
run;
proc format cntlin=format;
run;

Now let's make a SUBJECTS file.

data subjects ;
  input subjid icd_code;
cards;
1 1238
2 .
3 6850
4 6900
5 4580
;

And use the ICDGROUP format to make a DISEASE variable.

data want;
 set subjects;
 if icd_code then disease=put(icd_code,icdgroup.);
run;

Result

Re: working with data from two different tables at the same time to categorise death codes.

ronnie_h — Fri, 16 Aug 2024 14:54:20 GMT

i think i can follow that - i'll give it a try next week.

many thanks!

Re: working with data from two different tables at the same time to categorise death codes.

ronnie_h — Tue, 20 Aug 2024 09:59:17 GMT

hi - thanks for your help last week - i;ve been trying it out this morning but it didn;t quite work.

It gets to to the bit where it runs proc format but then complains i have repeated or overlapping ranges.

it is true that codes do occur in multiple disease groupings but not within a single disease group.

Have i missunderstood something - again!

i've enclosed the files so you can see

Many thanks

Re: working with data from two different tables at the same time to categorise death codes.

Tom — Tue, 20 Aug 2024 12:53:58 GMT

If you have multiple decodes for the same code then you need to make a multi-label format.

But I am not sure how you can do your original problem if one disease code could be included in two or more different disease groupings. Which disease grouping would you want to pick?

But you could convert your disease groupings into separate formats for each disease group. Which you could then use to detect if the a particular code indicates the presence of that disease group.

proc format;
value diseaseA 1-5 = 'YES' other='NO';
value diseaseB 3-7 = 'YES' other='NO';
run;

data want;
  set have;
  diseaseA = put(code,diseasea.);
  diseaseB = put(code,diseaseb.);
run;

Re: working with data from two different tables at the same time to categorise death codes.

ballardw — Tue, 20 Aug 2024 18:06:03 GMT

Your diseases file also has duplicate entries. For example Lung_cancer with range of 1620 1639 is repeated.

With so much duplication of codes maybe a return to your first approach is reasonable.

Maybe this is helpful. I made a small dummy set of just person and Icd_code as an example. Note this reads the code into a slightly different form (long beats wide most times).

data work.diseases;
  length disease $30 count 4 ;
  infile "x:\deaths_short.txt" delimiter=' ' dsd;
  input disease;
	input count;
 	do count2=1 to count;
  	  input lower upper;
     output;
  	end;
   drop count count2;
run;


data work.deaths;
   input personid icd_code;
datalines;
1  10
2  125
3  .
4  1622
5  2240
;
 
proc sql;
   create table work.cod as
   select a.*,b.disease,1 as dummy
   from work.deaths as a
        ,
        (select distinct disease, lower, upper from work.diseases) as b
   where b.lower le a.icd_code le b.upper
   order by a.personid,a.icd_code,b.disease
   ;
quit;

proc transpose data=work.cod 
               out=work.cause_of_death (drop=_name_)
;
   by personid icd_code;
   id disease;
   var dummy;
run;

I didn't see any clear description of how the "alive" people were treated. The above drops them at the work.cod step as they don't have a code in range.

Re: working with data from two different tables at the same time to categorise death codes.

ronnie_h — Wed, 21 Aug 2024 08:20:23 GMT

Hi thanks for this! - it was only yesterday that i realised lung cancer occurerd twice - it shouldn't have but i don;t think that should have been a show stopper. i will have to check the 'deaths_long'; file for duplicates - that has 170 groupings but obviously i want to get this working for the smaller file first.

i will give this a go today.

In answer to your last question i will be merging the file of dead people with an even bigger file (300K) of alive people who will have zero for each of the death grouping variables.

I still wish i could do this all in fortran!

Re: working with data from two different tables at the same time to categorise death codes.

ronnie_h — Wed, 21 Aug 2024 10:33:48 GMT

HI, i tried your suggestion and its getting closer but its not dealing with disease groups where there are more than one pair of codes - as while it must be reading all the pairs its only retaining the last pair in group.

e.g. in the diseases table there is only one line for all_mailignancies _ex_nmsc and that has codes 1740 and 2089.

'All_malignancies_ex_NMSC'
2
1400 1729
1740 2089

Can the final cause_of_death table have all of the disease groupings as columns in it - not just the ones where the persons have died of a code in that disease group? - so i can set those to zero?

That way it will be easy to merger into a single table these records with the other table of alive poeple who will have zeros in all the disease groups that i have already set up.

Thanks again for your invaluable help.

Re: working with data from two different tables at the same time to categorise death codes.

ballardw — Wed, 21 Aug 2024 14:20:28 GMT

@ronnie_h wrote:

HI, i tried your suggestion and its getting closer but its not dealing with disease groups where there are more than one pair of codes - as while it must be reading all the pairs its only retaining the last pair in group.

e.g. in the diseases table there is only one line for all_mailignancies _ex_nmsc and that has codes 1740 and 2089.

'All_malignancies_ex_NMSC'
2
1400 1729
1740 2089

Can the final cause_of_death table have all of the disease groupings as columns in it - not just the ones where the persons have died of a code in that disease group? - so i can set those to zero?

That way it will be easy to merger into a single table these records with the other table of alive poeple who will have zeros in all the disease groups that i have already set up.

Thanks again for your invaluable help.

Please show the code that YOU submitted, with the log, that only shows the "last" of a pairing.

when I ran my code with your short text file I had multiple ranges. All_cancers_ex_lung_leuk for example has 5 ranges in my work.diseases

Here are the first 20 observations from my work.diseases:

Obs    disease                     lower    upper

  1    All_causes                     10     9999
  2    All_malignancies             1400     2089
  3    All_malignancies_ex_NMSC     1400     1729
  4    All_malignancies_ex_NMSC     1740     2089
  5    All_cancers                  1400     2399
  6    All_cancers_ex_lung_leuk     1400     1619
  7    All_cancers_ex_lung_leuk     1640     2023
  8    All_cancers_ex_lung_leuk     2025     2030
  9    All_cancers_ex_lung_leuk     2032     2039
 10    All_cancers_ex_lung_leuk     2090     2399
 11    Lung_cancer                  1620     1639
 12    Breast_cancer                1740     1749
 13    Ovarian_cancer               1830     1839
 14    Uterine_cancer               1790     1829
 15    All_cancers_ex_leuk          1400     2023
 16    All_cancers_ex_leuk          2025     2030
 17    All_cancers_ex_leuk          2032     2039
 18    All_cancers_ex_leuk          2090     2399
 19    Mal_Neo_ex_leuk              1400     2023
 20    Mal_Neo_ex_leuk              2025     2030

Note: Showing inputs without actual output does not help with diagnosing a problem. For example, look very closely at your code. Do you have the OUTPUT statement inside the DO loop for Count2? Your description would match not having that statement.

Re: working with data from two different tables at the same time to categorise death codes.

ronnie_h — Wed, 21 Aug 2024 15:00:22 GMT

yes - you were right i had lost the 'output' statement - how did I miss that!

it now works and gives the output in the 'cause of death' file below.

Can i ask one more thing?

The reason i'm doing this is to prepare the data to go into another program that does poisson regression modelling to relate disease risk to radiation dose.

for that my table of data has to have a single line for each person - as it does in the current outout but i need all of the diseases groups across the top with a 1 in the column if the person died with an icd code in that disease group and a zero if they didn't.

i don;t need the actual disease name as the titles - just a name to indicate the order d1 - dn is fine as my analysis code won;t accept the varables as long names.

the crutial thing is that the disease variables really need to be in the same order as in the diseases file

Is that a difficult thing to do?

many thanks!

data diseases;
  length disease $30 count 4;
  infile "&sas_dir.\deaths_short.txt" delimiter=' ' dsd;
  input disease;
input count;
 	do count2=1 to count;
  	  input lower upper;
	  output;
  	end;
	drop count count2;
run;



data work.deaths; 
  input personid icd_code;
datalines;
1 10
2 125
3 .
4 1622
5 2240
;
run;

proc sql;
   create table work.cod as
   select a.*,b.disease,1 as dummy
   from work.deaths as a
        ,
        (select distinct disease, lower, upper from work.diseases) as b
   where b.lower le a.icd_code le b.upper
   order by a.personid,a.icd_code,b.disease
   ;
quit;

proc transpose data=work.cod 
               out=work.cause_of_death (drop=_name_)
;
   by personid icd_code;
   id disease;
   var dummy;
run;

Re: working with data from two different tables at the same time to categorise death codes.

ballardw — Wed, 21 Aug 2024 16:07:18 GMT

Already showed how to place the name of the disease as a variable so just need to add a different variable that "counts". NOTE: Better fix the duplicate disease entries prior to reading them.

data work.diseases;
  length disease $30 count 4;
  infile "X:\deaths_short.txt" delimiter=' ' dsd;
  input disease;
  retain dnum 0;
input count;
   dnum+1;
   IDvar= cats("D",put(dnum,z3.));
 	do count2=1 to count;
  	  input lower upper;
	  output;
  	end;
	drop count count2;
run;
data work.deaths; 
  input personid icd_code;
datalines;
1 10
2 125
3 .
4 1622
5 2240
;
run;

proc sql;
   create table work.cod as
   select a.*,b.disease, b.idvar, 1 as dummy
   from work.deaths as a
        ,
        (select distinct idvar, disease, lower, upper from work.diseases) as b
   where b.lower le a.icd_code le b.upper
   order by a.personid,a.icd_code,b.idvar
   ;
quit;

proc transpose data=work.cod 
               out=work.cause_of_death (drop=_name_)
;
   by personid icd_code;
   id idvar;
   var dummy;
run;

The RETAIN in the first data step keeps values of a variable across the data step boundary, so is one way to create a serial counter. I create IDVAR with 3 digits so a text value can SORT if needed easily.

If you need a set with "all" the Dnnn variables that's easy: Replace the 999 with actual number of disease codes that are in the disease file (Exercise for the interested reader: you can get that from the disease data set and place into a macro variable to be used here). The array creates the d variables in order before the SET statement.

Then when the values are read from the set that is the order they have. Then use the array to set the missing values to 0 if that is needed.

data work.final;
   array d(*) D001 - D999; 
   set work.cause_of_death
   ;
   do i=1 to dim(d);
     if missing( d[i]) then d[i]=0;
  end;
  drop i;
run;

Re: working with data from two different tables at the same time to categorise death codes.

ronnie_h — Fri, 23 Aug 2024 12:37:26 GMT

that did it! - many thanks - i was able to run it successfully for the 28K people with version 9 of the ICD coding of their death cause now i need to do the same with the 38K people who have an ICD10 version which is alpha numeric - i.e. c456

but i hope i can do that on my own - at least i want to try first - or i will never learn.

Many thanks for your help!

Re: working with data from two different tables at the same time to categorise death codes.

ballardw — Fri, 23 Aug 2024 13:47:54 GMT

@ronnie_h wrote:

that did it! - many thanks - i was able to run it successfully for the 28K people with version 9 of the ICD coding of their death cause now i need to do the same with the 38K people who have an ICD10 version which is alpha numeric - i.e. c456

but i hope i can do that on my own - at least i want to try first - or i will never learn.

Many thanks for your help!

I have worked a very little with ICD codes. I do know there have a been just enough differences, ICD-10 greatly expanding some areas, that you may encounter meaning differences in Dxxx names between two different disease variables just following my code. If the text of the disease name categories don't change, though the ICD10 have additional ones, it may be worth doing the process for both files with the disease name and then taking a step to "standarize" the the variable names if that Dxxx is truly desirable.

With a known list of values, such as from the disease data and the numbered version it is quite easy to generate Proc Datasets code to rename the variables.

Re: working with data from two different tables at the same time to categorise death codes.

ronnie_h — Wed, 28 Aug 2024 14:55:41 GMT

hi - thanks in no small part to you i have have made good progress with my coding and have managed to sort the icd10 stuff out anbd am now dealing with the doses - which was going well but i have one small probelm i don;t seem to be able to solve without using brut force!

I am trying to generate a table of the individuals doses - i have it in a long table called work.all_doses (work.all_doses.xlsx)

the possible range of years is 1937 to 2022.

i have changed it to a wide table using the transpose procedure - all ok so far! (work.all_doses_trans.xlsx)

but now i need to rearrange it so that the dose columns are in year order from 1937 to 2022 across the table - ideally with variable names d1937 - d2022.

i could do the reordering by brut force using retain id '1937'n '1938'n ... but i'd prefer a better way! and i still need to change the column names to d1937 - d2022 which i can't figure out how to do in one go and finally i need to replace all the missing values by zeros.

also - just to complicate things its not garenteed that every year will be represented in the starting table but i need every year in the final table

i tried to adapt the code you showed me earlier but the problem of generating the array beat me.

any suggestions ?

many thanks

proc sort data=work.all_doses;
by id;
run;

proc transpose data=work.all_doses 
               out=work.all_doses_trans (drop=_name_);
   by id;
   id dose_yr;
   var dose_value;
run;

data work.all_doses_revised;
	  retain id '1937'n ; 
	set work.all_doses_trans; 
run;

Re: working with data from two different tables at the same time to categorise death codes.

ballardw — Wed, 28 Aug 2024 15:42:53 GMT

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

If you have a numeric variable that want as the suffix you are most of the way there.

Try:

proc transpose data=work.all_doses 
               out=work.all_doses_trans (drop=_name_)
       prefix=D
;
   by id;
   id dose_yr;
   var dose_value;
run;

That will start the value with the letter D and append the dose_yr.

I seldom worry about the actual order of values in a data set and use a reporting procedure to write the values to something like Excel as needed. But in this case, assuming all of your years are present in the data you can use something like

data work.all_doses_revised;
	length d1937-d2022 8;
	set work.all_doses_trans; 
run;

The list will create all of the years even if some years are missing in the data. Which would mean one or more variables with all missing values. If that is not desirable you could use sublists on the length statement such as if 1957 for some reason had no values

length d1937-d1956 d1958-d2022 8;

Re: working with data from two different tables at the same time to categorise death codes.

ronnie_h — Wed, 04 Sep 2024 13:33:06 GMT

Hi,

Thanks to your help i've managed to get all the issues with the ICD codes in my dataset sorted and i've now got a basic file to test the Poisson regression risk modelling with - which is really really good.

I'm now moving on to dealing with some issues regarding the data quality and i would appreciate your suggestions as to how to tackle them as what i've learned so far doesn;t seem to give me any ideas.

Each of my 179Kish people have at least one annual dose measurement - sometimes many years of doses (these are radiation doses) with employers who participate in the study.

However, an individual may also have received doses with other employers not in the study prior to being employed by a participating employer.

For those people I don't know their annual doses prior to joing the study only the total dose received up until the year they joined the study - I call this the transfer dose.

below is an example of a typical person with a transfer dose - i have placed the transfer dose in the year prior to the first 'proper' dose and made it negative to identify it as such a dose.

e.g.
ID year Dose
8 1975 -11.0
8 1976 0.24
8 1977 0.57
8 1978 0.24
8 1979 0.13
8 1980 0.24
8 1981 0.13
8 1982 0.9
8 1983 0.79
8 1984 0.02

The problem is that if i use this persons data as it is then their initial dose rate is way to high and this will mess up the modelling.

What i want to do is create some years of simulated doses to better approximate what i expect this persons doses prior to 1976 to have been.

e.g.

year dose
1964 1.0
1965 1.0
1966 1.0
---
1975 1.0 - this being the year the trasfer dose was recorded

so that the simulated doses sum to the transfer dose value.

I have several criteria to use to decide how to spread the transfer dose over an appropriate number of years with appropriate values.

The main one is to calculate the average of the 'real' doses and then divide the transfer dose by that value to get a number of years (n). I would then record the average dose in the n years prior to the first real dose.

However, i do need to check if this means assigning simulated doses to years when they were less than 18 years of age - which would not be plausible and i would exclude the person.

i have created a table which i hope can be the basis of the calculations i need called work.correct_doses - see attached

Here is what think i want/need to do:

1) work through the table by ID until I find someone with a transfer dose - i.e. a negative dose value
2) read all their dose information - which is a variable number of lines.
3) calculate the average of the 'real' dose information
4) calculate how many year (n) i need to accumulate the transfer dose at the average real dose rate.
5) check if that number of years would take start of exposure before age 18
ie. is 'age at 1st real dose - n' < 18
6) if yes - mark the ID for exclusion
7) if no - create the required number of extra dose years and add them to the 'real' years of doses in the table.

😎 return to point 1 until end of file.

There are just over 7K poeple with transfer doses and the full dose table is 1.7million lines - so its a bit of a big job. i attach a smaller sample of the data!

So any suggestions are very welcomed - if only i could do this in fortran it would be done by tea time! SAS is so much more complicated.