Solved: How to count unique character variable values across multiple variable...

montgomerybarre · Posted 06-12-2020 03:36 PM

Hello everyone, long time lurker, first time poster here.

I have a medical claims data base with multiple encounters (up to 1000 or so) for each individual. In this data set I have 24 diagnosis character variables with the naming convention diag_1 - diag_24, and these variables are filled sequentially (doctor listed between 0 and 24 diagnoses per encounter) so that if the doctor identified 8 diagnosis codes, these would appear individually in diag_1, diag_2, etc. until diag_8, after which diag_9 - diag_24 would be blank. Please note that there are often duplicate diagnosis codes across the observations for the same individual, however never within observations.

My challenge is to create a continuous variable of "clinical complexity" of a certain type. This variable would include the total number of unique diagnoses included in any one of the patient's records in the 24 diagnosis variables. I have the relevant list of diagnoses in a macro variable (&diag_list).

A highly simplified version of the data set would look like this:

 data encounters(label='medical encounters');
   infile datalines dsd truncover;
   input study_id:8. diag_1:$200. diag_2:$200. diag_3:$200.;
 datalines;
 101 F341 F41 F340
 101 F340 F49 
 101 F341 F22 F12
 102 F4689 F410 F011
 102 F341 F410 F340
 ;;;;

Assuming all of these diagnosis codes are included in the $diag_list macro variable, I would like to create a continuous variable that would be 4 for study_id 101 and 5 for study_id 102.

I did find some code from @Astounding suggested on a similar question where the data set was only one observation per individual. This code works to count unique diagnoses within observations, but not across observations as well:

data want;
	set have;
	array col {24} dsc_diag_1-dsc_diag_24;
	array new {24} $20  _temporary_;
		do _n_=1 to 24;
		   new{_n_} = col{_n_};
		end;
	call sortc(of new{*});
	count = (new{1} > ' ');
		do _n_=2 to 24;
		   if new{_n_}  ne new{_n_-1} then count + 1;
		end;
run;

Any help with this would be greatly appreciated, I've looked at many posts to see if this topic had been covered yet. I'm sure it has but I cannot find anything that helps me in this particular situation. This is the first time I've been really stumped even after reading prior posts and various documents, SUGI's, etc.

Thanks!

ballardw · Posted 06-12-2020 04:08 PM

One way:

data encounters(label='medical encounters');
   infile datalines  truncover;
   input study_id:8. diag_1:$200. diag_2:$200. diag_3:$200.;
datalines;
101 F341 F41 F340
101 F340 F49 
101 F341 F22 F12
102 F4689 F410 F011
102 F341 F410 F340
;;;;

data long;
   set encounters;
   array d diag_: ;
   length onediag $ 200;
   do i=1 to dim(d);
      if not missing(d[i]) then do;
         onediag=d[i];
         output;
      end;
   end;
   keep study_id onediag;
run;

proc sql;
   create table dcount as
   select study_id, count(*) as diagcount
   from (select distinct study_id,onediag
          from long)
   group by study_id
   ;
quit;

I removed the DSD option from the Infile because with this data and the assigned lengths of your variable it caused all of the data to be "invalid".

The long set is a bit different than proc transpose would generate with only one code per record.

Another way to do the count would be two proc freqs.

proc freq data=long noprint;
   tables study_id*onediag/ out=temp;
run;
proc freq data=temp noprint;
   tables study_id /out=freqcount(drop=percent) ;
run;

or sort Long to remove duplicates of study_id onediag

Sort of depends on other things you might want to know about the data in between step.

View solution in original post

ballardw · Posted 06-12-2020 04:08 PM

One way:

data encounters(label='medical encounters');
   infile datalines  truncover;
   input study_id:8. diag_1:$200. diag_2:$200. diag_3:$200.;
datalines;
101 F341 F41 F340
101 F340 F49 
101 F341 F22 F12
102 F4689 F410 F011
102 F341 F410 F340
;;;;

data long;
   set encounters;
   array d diag_: ;
   length onediag $ 200;
   do i=1 to dim(d);
      if not missing(d[i]) then do;
         onediag=d[i];
         output;
      end;
   end;
   keep study_id onediag;
run;

proc sql;
   create table dcount as
   select study_id, count(*) as diagcount
   from (select distinct study_id,onediag
          from long)
   group by study_id
   ;
quit;

I removed the DSD option from the Infile because with this data and the assigned lengths of your variable it caused all of the data to be "invalid".

The long set is a bit different than proc transpose would generate with only one code per record.

Another way to do the count would be two proc freqs.

proc freq data=long noprint;
   tables study_id*onediag/ out=temp;
run;
proc freq data=temp noprint;
   tables study_id /out=freqcount(drop=percent) ;
run;

or sort Long to remove duplicates of study_id onediag

Sort of depends on other things you might want to know about the data in between step.

montgomerybarre · Posted 06-19-2020 01:26 PM

I found this to be the best and easiest solution to implement. I just added a few extra steps to make sure that individuals with 0 diagnoses were identified as 0 and not as missing.

RichardDeVen · Posted 06-12-2020 04:45 PM

A HASH object is very effective for counting unique values (.NUM_ITEMS) when the variable is a key, and for entering into membership via .ADD()

A logic statement with a compiled IN operation is said to be one of the fastest methods for checking set membership.
- Use IN to see if a doctors diagnosis is one that is of interest (for say a study)
Use HASH to track unique diagnosis in the study list
Use another HASH to track non-study diagnoses (for yucks)

Example:

Data for 1,000 patients having random number of visits and random number of random diagnoses.

Spoiler

data have;
  call streaminit(123);
  do patid = 1 to 100;
    date = today() - rand('integer',250,1300);
    top = rand('integer',5,1000);
    do index = 1 by 1 while (date <= today() and index <= top);
      array diag(24);
      do dindex = 1 to rand('integer',1,rand('integer', dim(diag)));
        * possible repeated diagnosis in row dont matter in this example;
        * so dont try to prevent them;
        diag(dindex) = rand('integer',1,1000);
      end;
      visitid + 1;
      output;
      date + rand('integer',0,3);
      call missing (of diag(*));
    end;
  end;
  keep patid date diag: visitid;
  format date yymmdd10.;
run;

* visit count distribution, just a look see;
proc sql;
  create table freq1 as 
  select 
    visit_count, count(*) as freq from 
    ( select 
      patid, count(*) as visit_count from have group by patid
    )
    group by visit_count
  ;

data have; call streaminit(123); do patid = 1 to 100; date = today() - rand('integer',250,1300); top = rand('integer',5,1000); do index = 1 by 1 while (date <= today() and index <= top); array diag(24); do dindex = 1 to rand('integer',1,rand('integer', dim(diag))); * possible repeated diagnosis in row dont matter in this example; * so dont try to prevent them; diag(dindex) = rand('integer',1,1000); end; visitid + 1; output; date + rand('integer',0,3); call missing (of diag(*)); end; end; keep patid date diag: visitid; format date yymmdd10.; run;* visit count distribution, just a look see;proc sql; create table freq1 as select visit_count, count(*) as freq from ( select patid, count(*) as visit_count from have group by patid ) group by visit_count ;

Per patient, count number of diagnoses matching the study list across all diagnoses of all visits.

%let study_diagnoses = 2,5,11,17,23,31,43,53,61,71,79,89;

data want;
  if _n_ = 1 then do;
    call missing(dx);
    declare hash study_dx();
    declare hash other_dx();
    study_dx.defineKey('dx');
    study_dx.defineDone();
    other_dx.defineKey('dx');
    other_dx.defineDone();
  end;

  do until (last.patid);

    set have;
    by patid;

    array dxs diag:;

    do index = 1 to dim(dxs) while (not missing(dxs(index)));

      dx = dxs(index);
      if dx in (&study_diagnoses)
        then rc = study_dx.add();
        else rc = other_dx.add();

    end;

  end;

  dx_in_study_count = study_dx.num_items;
  dx_not_in_study_count = other_dx.num_items;

  study_dx.clear();
  other_dx.clear();

  keep patid dx_:;
run;

montgomerybarre · Posted 06-18-2020 03:41 PM

I can't get this to work. The main reason is that I'm not sure what your dx variable is referring to? I have 24 diagnosis variables that are included in your program with the diag:, therefore I'm not sure what the dx and dxs are referring to? SAS tells me I'm defining an array with zero elements, and I agree!

RichardDeVen · Posted 06-19-2020 04:59 PM

For the sake of simplifying sample data generation (in the spoiler) diagnoses are just integers. Likewise simplifying the assignment of the macro variable whose value is a list of the 'values of interest' needed for some study.

%let study_diagnoses = 2,5,11,17,23,31,43,53,61,71,79,89;

For the case of character diagnoses it might look like

%let study_diagnoses = `F341', 'F41', 'F340', 'F998', 'F1234', 'F123', 'F12';

If the list (i.e the relevant ones) has a more patterned or complex origin you might be using DATA / SYMPUT('<macrovar>', or SQL / INTO :<macrovar>

dx is implicitly added to the PDV as a number, and is the host variable for the hash key and will be used to 'extract' values from the dxs array when interacting with the hash object.

call missing(dx);

For the case of character diagnoses (DIAG:) the host variable type needs to correspond. Replace the call MISSING with a statement such as

length dx $8;

The elements of array dxs are the variables whose names start with diag

 array dxs diag:;

The diagnoses is my generated sample data are numeric, so the original sample code won't be appropriate to your actual data until the dx type is changed to character.

When looping over the dxs array pull out a diagnosis, check for it's relevance, and track its presence as a hash entry.

      dx = dxs(index);
      if dx in (&study_diagnoses)
        then rc = study_dx.add();
        else rc = other_dx.add();

Hope this explanation clarifies the technique and makes it applicable to your use case.

Kurt_Bremser · Posted 06-13-2020 12:14 AM

First, transpose your dataset from wide to long, by study_id and a variable that identifies the encounter (remove the missing values with a where= dataset option). Then you can use a count(distinct) in SQL.

Long datasets are always easier to work with.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

montgomerybarre · Posted 06-16-2020 12:22 PM

Hi Kurt,

Than ks for your reply! When you say long form data set, I'm a bit confused as there are already multiple observations for every individual (my definition of long). Do you mean essentially to transpose the 24 diagnosis variables to basically make the data set 23 variables 'less-wide' and 23 observations longer?

If so, I think this makes good sense. Thanks for your input; I'm repeatedly asked to make this data one record per individual, so I did not think to make it longer.

Best,

Barrett

Kurt_Bremser · Posted 06-17-2020 04:16 AM

Here's an illustration:

data encounters(label='medical encounters');
infile datalines dsd dlm=" " truncover;
input study_id :$3. (diag_1-diag_3) (:$10.);
n = _n_; /* add an identifier for individual observations */
datalines;
101 F341 F41 F340
101 F340 F49 
101 F341 F22 F12
102 F4689 F410 F011
102 F341 F410 F340
;

proc transpose
  data=encounters
  out=long (drop=n _name_ where=(col1 ne " "))
;
by study_id n;
var diag:;
run;

proc sql;
create table want as
  select
    study_id,
    count(distinct col1) as complexity
  from long
  group by study_id
;
quit;

You only need to add the selection of diagnosis codes; since you didn't show the layout of your macro variable, I did not include that.

But you can see the advantage of the long structure for all kinds of counting, summing or other group-based analysis. It is also expected in SAS procedures; as an example, you can't tell SGPLOT to do horizontal sums before plotting.

I have encountered just two reasons for a wide layout: human consumption and regression analysis, where lots of yes/no indicators are needed for an individual object. So the wide layout usually shows up at the end of my programs, when needed.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

montgomerybarre · Posted 06-18-2020 03:34 PM

Very helpful, thanks for these bits of wisdom!

How to count unique character variable values across multiple variables and observations

Re: How to count unique character variable values across multiple variables and observations

Re: How to count unique character variable values across multiple variables and observations

Re: How to count unique character variable values across multiple variables and observations

Re: How to count unique character variable values across multiple variables and observations

Re: How to count unique character variable values across multiple variables and observations

Re: How to count unique character variable values across multiple variables and observations

Re: How to count unique character variable values across multiple variables and observations

Re: How to count unique character variable values across multiple variables and observations

Re: How to count unique character variable values across multiple variables and observations

Re: How to count unique character variable values across multiple variables and observations

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away