Hello everyone, long time lurker, first time poster here.
I have a medical claims data base with multiple encounters (up to 1000 or so) for each individual. In this data set I have 24 diagnosis character variables with the naming convention diag_1 - diag_24, and these variables are filled sequentially (doctor listed between 0 and 24 diagnoses per encounter) so that if the doctor identified 8 diagnosis codes, these would appear individually in diag_1, diag_2, etc. until diag_8, after which diag_9 - diag_24 would be blank. Please note that there are often duplicate diagnosis codes across the observations for the same individual, however never within observations.
My challenge is to create a continuous variable of "clinical complexity" of a certain type. This variable would include the total number of unique diagnoses included in any one of the patient's records in the 24 diagnosis variables. I have the relevant list of diagnoses in a macro variable (&diag_list).
A highly simplified version of the data set would look like this:
data encounters(label='medical encounters');
infile datalines dsd truncover;
input study_id:8. diag_1:$200. diag_2:$200. diag_3:$200.;
datalines;
101 F341 F41 F340
101 F340 F49
101 F341 F22 F12
102 F4689 F410 F011
102 F341 F410 F340
;;;;
Assuming all of these diagnosis codes are included in the $diag_list macro variable, I would like to create a continuous variable that would be 4 for study_id 101 and 5 for study_id 102.
I did find some code from @Astounding suggested on a similar question where the data set was only one observation per individual. This code works to count unique diagnoses within observations, but not across observations as well:
data want; set have; array col {24} dsc_diag_1-dsc_diag_24; array new {24} $20 _temporary_; do _n_=1 to 24; new{_n_} = col{_n_}; end; call sortc(of new{*}); count = (new{1} > ' '); do _n_=2 to 24; if new{_n_} ne new{_n_-1} then count + 1; end; run;
Any help with this would be greatly appreciated, I've looked at many posts to see if this topic had been covered yet. I'm sure it has but I cannot find anything that helps me in this particular situation. This is the first time I've been really stumped even after reading prior posts and various documents, SUGI's, etc.
Thanks!
One way:
data encounters(label='medical encounters'); infile datalines truncover; input study_id:8. diag_1:$200. diag_2:$200. diag_3:$200.; datalines; 101 F341 F41 F340 101 F340 F49 101 F341 F22 F12 102 F4689 F410 F011 102 F341 F410 F340 ;;;; data long; set encounters; array d diag_: ; length onediag $ 200; do i=1 to dim(d); if not missing(d[i]) then do; onediag=d[i]; output; end; end; keep study_id onediag; run; proc sql; create table dcount as select study_id, count(*) as diagcount from (select distinct study_id,onediag from long) group by study_id ; quit;
I removed the DSD option from the Infile because with this data and the assigned lengths of your variable it caused all of the data to be "invalid".
The long set is a bit different than proc transpose would generate with only one code per record.
Another way to do the count would be two proc freqs.
proc freq data=long noprint; tables study_id*onediag/ out=temp; run; proc freq data=temp noprint; tables study_id /out=freqcount(drop=percent) ; run;
or sort Long to remove duplicates of study_id onediag
Sort of depends on other things you might want to know about the data in between step.
One way:
data encounters(label='medical encounters'); infile datalines truncover; input study_id:8. diag_1:$200. diag_2:$200. diag_3:$200.; datalines; 101 F341 F41 F340 101 F340 F49 101 F341 F22 F12 102 F4689 F410 F011 102 F341 F410 F340 ;;;; data long; set encounters; array d diag_: ; length onediag $ 200; do i=1 to dim(d); if not missing(d[i]) then do; onediag=d[i]; output; end; end; keep study_id onediag; run; proc sql; create table dcount as select study_id, count(*) as diagcount from (select distinct study_id,onediag from long) group by study_id ; quit;
I removed the DSD option from the Infile because with this data and the assigned lengths of your variable it caused all of the data to be "invalid".
The long set is a bit different than proc transpose would generate with only one code per record.
Another way to do the count would be two proc freqs.
proc freq data=long noprint; tables study_id*onediag/ out=temp; run; proc freq data=temp noprint; tables study_id /out=freqcount(drop=percent) ; run;
or sort Long to remove duplicates of study_id onediag
Sort of depends on other things you might want to know about the data in between step.
I found this to be the best and easiest solution to implement. I just added a few extra steps to make sure that individuals with 0 diagnoses were identified as 0 and not as missing.
A HASH object is very effective for counting unique values (.NUM_ITEMS) when the variable is a key, and for entering into membership via .ADD()
Example:
Data for 1,000 patients having random number of visits and random number of random diagnoses.
data have; call streaminit(123); do patid = 1 to 100; date = today() - rand('integer',250,1300); top = rand('integer',5,1000); do index = 1 by 1 while (date <= today() and index <= top); array diag(24); do dindex = 1 to rand('integer',1,rand('integer', dim(diag))); * possible repeated diagnosis in row dont matter in this example; * so dont try to prevent them; diag(dindex) = rand('integer',1,1000); end; visitid + 1; output; date + rand('integer',0,3); call missing (of diag(*)); end; end; keep patid date diag: visitid; format date yymmdd10.; run;
* visit count distribution, just a look see;
proc sql;
create table freq1 as
select
visit_count, count(*) as freq from
( select
patid, count(*) as visit_count from have group by patid
)
group by visit_count
;
Per patient, count number of diagnoses matching the study list across all diagnoses of all visits.
%let study_diagnoses = 2,5,11,17,23,31,43,53,61,71,79,89; data want; if _n_ = 1 then do; call missing(dx); declare hash study_dx(); declare hash other_dx(); study_dx.defineKey('dx'); study_dx.defineDone(); other_dx.defineKey('dx'); other_dx.defineDone(); end; do until (last.patid); set have; by patid; array dxs diag:; do index = 1 to dim(dxs) while (not missing(dxs(index))); dx = dxs(index); if dx in (&study_diagnoses) then rc = study_dx.add(); else rc = other_dx.add(); end; end; dx_in_study_count = study_dx.num_items; dx_not_in_study_count = other_dx.num_items; study_dx.clear(); other_dx.clear(); keep patid dx_:; run;
For the sake of simplifying sample data generation (in the spoiler) diagnoses are just integers. Likewise simplifying the assignment of the macro variable whose value is a list of the 'values of interest' needed for some study.
%let study_diagnoses = 2,5,11,17,23,31,43,53,61,71,79,89;
For the case of character diagnoses it might look like
%let study_diagnoses = `F341', 'F41', 'F340', 'F998', 'F1234', 'F123', 'F12';
If the list (i.e the relevant ones) has a more patterned or complex origin you might be using DATA / SYMPUT('<macrovar>',
or SQL / INTO :<macrovar>
dx
is implicitly added to the PDV as a number, and is the host variable for the hash key and will be used to 'extract' values from the dxs
array when interacting with the hash object.
call missing(dx);
For the case of character diagnoses (DIAG:
) the host variable type needs to correspond. Replace the call MISSING
with a statement such as
length dx $8;
The elements of array dxs
are the variables whose names start with diag
array dxs diag:;
The diagnoses is my generated sample data are numeric, so the original sample code won't be appropriate to your actual data until the dx
type is changed to character.
When looping over the dxs
array pull out a diagnosis, check for it's relevance, and track its presence as a hash entry.
dx = dxs(index); if dx in (&study_diagnoses) then rc = study_dx.add(); else rc = other_dx.add();
Hope this explanation clarifies the technique and makes it applicable to your use case.
First, transpose your dataset from wide to long, by study_id and a variable that identifies the encounter (remove the missing values with a where= dataset option). Then you can use a count(distinct) in SQL.
Long datasets are always easier to work with.
Hi Kurt,
Than ks for your reply! When you say long form data set, I'm a bit confused as there are already multiple observations for every individual (my definition of long). Do you mean essentially to transpose the 24 diagnosis variables to basically make the data set 23 variables 'less-wide' and 23 observations longer?
If so, I think this makes good sense. Thanks for your input; I'm repeatedly asked to make this data one record per individual, so I did not think to make it longer.
Best,
Barrett
Here's an illustration:
data encounters(label='medical encounters');
infile datalines dsd dlm=" " truncover;
input study_id :$3. (diag_1-diag_3) (:$10.);
n = _n_; /* add an identifier for individual observations */
datalines;
101 F341 F41 F340
101 F340 F49
101 F341 F22 F12
102 F4689 F410 F011
102 F341 F410 F340
;
proc transpose
data=encounters
out=long (drop=n _name_ where=(col1 ne " "))
;
by study_id n;
var diag:;
run;
proc sql;
create table want as
select
study_id,
count(distinct col1) as complexity
from long
group by study_id
;
quit;
You only need to add the selection of diagnosis codes; since you didn't show the layout of your macro variable, I did not include that.
But you can see the advantage of the long structure for all kinds of counting, summing or other group-based analysis. It is also expected in SAS procedures; as an example, you can't tell SGPLOT to do horizontal sums before plotting.
I have encountered just two reasons for a wide layout: human consumption and regression analysis, where lots of yes/no indicators are needed for an individual object. So the wide layout usually shows up at the end of my programs, when needed.
Very helpful, thanks for these bits of wisdom!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.