BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
montgomerybarre
Obsidian | Level 7

Hello everyone, long time lurker, first time poster here. 

 

I have a medical claims data base with multiple encounters (up to 1000 or so) for each individual. In this data set I have 24 diagnosis character variables with the naming convention diag_1 - diag_24, and these variables are filled sequentially (doctor listed between 0 and 24 diagnoses per encounter) so that if the doctor identified 8 diagnosis codes, these would appear individually in diag_1, diag_2, etc. until diag_8, after which diag_9 - diag_24 would be blank. Please note that there are often duplicate diagnosis codes across the observations for the same individual, however never within observations.

 

My challenge is to create a continuous variable of "clinical complexity" of a certain type. This variable would include the total number of unique diagnoses included in any one of the patient's records in the 24 diagnosis variables. I have the relevant list of diagnoses in a macro variable (&diag_list).

 

A highly simplified version of the data set would look like this:

 data encounters(label='medical encounters');
   infile datalines dsd truncover;
   input study_id:8. diag_1:$200. diag_2:$200. diag_3:$200.;
 datalines;
 101 F341 F41 F340
 101 F340 F49 
 101 F341 F22 F12
 102 F4689 F410 F011
 102 F341 F410 F340
 ;;;;

Assuming all of these diagnosis codes are included in the $diag_list macro variable, I would like to create a continuous variable that would be 4 for study_id 101 and 5 for study_id 102.

 

I did find some code from @Astounding suggested on a similar question where the data set was only one observation per individual. This code works to count unique diagnoses within observations, but not across observations as well:

data want;
	set have;
	array col {24} dsc_diag_1-dsc_diag_24;
	array new {24} $20  _temporary_;
		do _n_=1 to 24;
		   new{_n_} = col{_n_};
		end;
	call sortc(of new{*});
	count = (new{1} > ' ');
		do _n_=2 to 24;
		   if new{_n_}  ne new{_n_-1} then count + 1;
		end;
run;

Any help with this would be greatly appreciated, I've looked at many posts to see if this topic had been covered yet. I'm sure it has but I cannot find anything that helps me in this particular situation. This is the first time I've been really stumped even after reading prior posts and various documents, SUGI's, etc.

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

One way:

data encounters(label='medical encounters');
   infile datalines  truncover;
   input study_id:8. diag_1:$200. diag_2:$200. diag_3:$200.;
datalines;
101 F341 F41 F340
101 F340 F49 
101 F341 F22 F12
102 F4689 F410 F011
102 F341 F410 F340
;;;;

data long;
   set encounters;
   array d diag_: ;
   length onediag $ 200;
   do i=1 to dim(d);
      if not missing(d[i]) then do;
         onediag=d[i];
         output;
      end;
   end;
   keep study_id onediag;
run;

proc sql;
   create table dcount as
   select study_id, count(*) as diagcount
   from (select distinct study_id,onediag
          from long)
   group by study_id
   ;
quit;

I removed the DSD option from the Infile because with this data and the assigned lengths of your variable it caused all of the data to be "invalid".

The long set is a bit different than proc transpose would generate with only one code per record.

 

Another way to do the count would be two proc freqs.

proc freq data=long noprint;
   tables study_id*onediag/ out=temp;
run;
proc freq data=temp noprint;
   tables study_id /out=freqcount(drop=percent) ;
run;

or sort Long to remove duplicates of study_id onediag

Sort of depends on other things you might want to know about the data in between step.

View solution in original post

9 REPLIES 9
ballardw
Super User

One way:

data encounters(label='medical encounters');
   infile datalines  truncover;
   input study_id:8. diag_1:$200. diag_2:$200. diag_3:$200.;
datalines;
101 F341 F41 F340
101 F340 F49 
101 F341 F22 F12
102 F4689 F410 F011
102 F341 F410 F340
;;;;

data long;
   set encounters;
   array d diag_: ;
   length onediag $ 200;
   do i=1 to dim(d);
      if not missing(d[i]) then do;
         onediag=d[i];
         output;
      end;
   end;
   keep study_id onediag;
run;

proc sql;
   create table dcount as
   select study_id, count(*) as diagcount
   from (select distinct study_id,onediag
          from long)
   group by study_id
   ;
quit;

I removed the DSD option from the Infile because with this data and the assigned lengths of your variable it caused all of the data to be "invalid".

The long set is a bit different than proc transpose would generate with only one code per record.

 

Another way to do the count would be two proc freqs.

proc freq data=long noprint;
   tables study_id*onediag/ out=temp;
run;
proc freq data=temp noprint;
   tables study_id /out=freqcount(drop=percent) ;
run;

or sort Long to remove duplicates of study_id onediag

Sort of depends on other things you might want to know about the data in between step.

montgomerybarre
Obsidian | Level 7


I found this to be the best and easiest solution to implement. I just added a few extra steps to make sure that individuals with 0 diagnoses were identified as 0 and not as missing.

RichardDeVen
Barite | Level 11

A HASH object is very effective for counting unique values (.NUM_ITEMS) when the variable is a key, and for entering into membership via .ADD()

 

  • A logic statement with a compiled IN operation is said to be one of the fastest methods for checking set membership.
    • Use IN to see if a doctors diagnosis is one that is of interest (for say a study)
  • Use HASH to track unique diagnosis in the study list
  • Use another HASH to track non-study diagnoses (for yucks)

Example:

Data for 1,000 patients having random number of visits and random number of random diagnoses.

Spoiler
data have;
  call streaminit(123);
  do patid = 1 to 100;
    date = today() - rand('integer',250,1300);
    top = rand('integer',5,1000);
    do index = 1 by 1 while (date <= today() and index <= top);
      array diag(24);
      do dindex = 1 to rand('integer',1,rand('integer', dim(diag)));
        * possible repeated diagnosis in row dont matter in this example;
        * so dont try to prevent them;
        diag(dindex) = rand('integer',1,1000);
      end;
      visitid + 1;
      output;
      date + rand('integer',0,3);
      call missing (of diag(*));
    end;
  end;
  keep patid date diag: visitid;
  format date yymmdd10.;
run;

* visit count distribution, just a look see;
proc sql;
create table freq1 as
select
visit_count, count(*) as freq from
( select
patid, count(*) as visit_count from have group by patid
)
group by visit_count
;

Per patient, count number of diagnoses matching the study list across all diagnoses of all visits.

%let study_diagnoses = 2,5,11,17,23,31,43,53,61,71,79,89;

data want;
  if _n_ = 1 then do;
    call missing(dx);
    declare hash study_dx();
    declare hash other_dx();
    study_dx.defineKey('dx');
    study_dx.defineDone();
    other_dx.defineKey('dx');
    other_dx.defineDone();
  end;

  do until (last.patid);

    set have;
    by patid;

    array dxs diag:;

    do index = 1 to dim(dxs) while (not missing(dxs(index)));

      dx = dxs(index);
      if dx in (&study_diagnoses)
        then rc = study_dx.add();
        else rc = other_dx.add();

    end;

  end;

  dx_in_study_count = study_dx.num_items;
  dx_not_in_study_count = other_dx.num_items;

  study_dx.clear();
  other_dx.clear();

  keep patid dx_:;
run;

RichardADeVenezia_0-1591994707065.png

 

montgomerybarre
Obsidian | Level 7
I can't get this to work. The main reason is that I'm not sure what your dx variable is referring to? I have 24 diagnosis variables that are included in your program with the diag:, therefore I'm not sure what the dx and dxs are referring to? SAS tells me I'm defining an array with zero elements, and I agree!
RichardDeVen
Barite | Level 11

For the sake of simplifying sample data generation (in the spoiler) diagnoses are just integers.  Likewise simplifying the assignment of the macro variable whose value is a list of the 'values of interest' needed for some study.

 

%let study_diagnoses = 2,5,11,17,23,31,43,53,61,71,79,89;

 

For the case of character diagnoses it might look like

 

%let study_diagnoses = `F341', 'F41', 'F340', 'F998', 'F1234', 'F123', 'F12';

 

If the list (i.e the relevant ones) has a more patterned or complex origin you might be using DATA / SYMPUT('<macrovar>', or SQL / INTO :<macrovar>

 

dx is implicitly added to the PDV as a number, and is the host variable for the hash key and will be used to 'extract' values from the dxs array when interacting with the hash object.

call missing(dx);

For the case of character diagnoses (DIAG:) the host variable type needs to correspond. Replace the call MISSING with a statement such as

 

length dx $8;

 

 

The elements of array dxs are the variables whose names start with diag

 array dxs diag:;

The diagnoses is my generated sample data are numeric, so the original sample code won't be appropriate to your actual data until the dx type is changed to character.

 

When looping over the dxs array pull out a diagnosis, check for it's relevance, and track its presence as a hash entry.

      dx = dxs(index);
      if dx in (&study_diagnoses)
        then rc = study_dx.add();
        else rc = other_dx.add();

 

 

Hope this explanation clarifies the technique and makes it applicable to your use case.

 

Kurt_Bremser
Super User

First, transpose your dataset from wide to long, by study_id and a variable that identifies the encounter (remove the missing values with a where= dataset option). Then you can use a count(distinct) in SQL.

Long datasets are always easier to work with.

montgomerybarre
Obsidian | Level 7

Hi Kurt,

 

Than ks for your reply! When you say long form data set, I'm a bit confused as there are already multiple observations for every individual (my definition of long). Do you mean essentially to transpose the 24 diagnosis variables to basically make the data set 23 variables 'less-wide' and 23 observations longer?

 

If so, I think this makes good sense. Thanks for your input; I'm repeatedly asked to make this data one record per individual, so I did not think to make it longer.

 

Best,

Barrett

Kurt_Bremser
Super User

Here's an illustration:

data encounters(label='medical encounters');
infile datalines dsd dlm=" " truncover;
input study_id :$3. (diag_1-diag_3) (:$10.);
n = _n_; /* add an identifier for individual observations */
datalines;
101 F341 F41 F340
101 F340 F49 
101 F341 F22 F12
102 F4689 F410 F011
102 F341 F410 F340
;

proc transpose
  data=encounters
  out=long (drop=n _name_ where=(col1 ne " "))
;
by study_id n;
var diag:;
run;

proc sql;
create table want as
  select
    study_id,
    count(distinct col1) as complexity
  from long
  group by study_id
;
quit;

You only need to add the selection of diagnosis codes; since you didn't show the layout of your macro variable, I did not include that.

But you can see the advantage of the long structure for all kinds of counting, summing or other group-based analysis. It is also expected in SAS procedures; as an example, you can't tell SGPLOT to do horizontal sums before plotting.

I have encountered just two reasons for a wide layout: human consumption and regression analysis, where lots of yes/no indicators are needed for an individual object. So the wide layout usually shows up at the end of my programs, when needed.

montgomerybarre
Obsidian | Level 7

Very helpful, thanks for these bits of wisdom!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1815 views
  • 7 likes
  • 4 in conversation