BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
DMMD
Obsidian | Level 7

Hi,

 

Related to: https://communities.sas.com/t5/Base-SAS-Programming/summarize-diagnosis-multiple-columns-and-multipl...

 

I have a very large data set (>3 million observations) for which would like to count the number of patients (not total observations) have one of four possible diagnosis codes. Patients could have multiple observations denoting multiple visits and have up to ten diagnosis variable fields.

 

My four codes of interest are: 72632 or 72631 or 71922 or 71923.

 

I would like to create a command that searches DX01 to DX10 for all observations for each patient for presence of the above codes.

 

Example:

Patientid      dx01    dx02    dx03    dx04    dx05    …       dx10    date

1             72632   72631   71922   71923                          5/5/2009

1              72632   71922   70000   70001                          6/6/2008

2              70001   70000                                                    1/1/2007

2              71900   71901                                                  2/2/2007

3              72632   71900                                                   2/2/2004

3              72631   70000                                                   3/3/2004

3              71922                                                                4/4/2004

 

From the above example, I would like to produce output like the following

1. By Patient ID

                               Patient ID            Has (72632 OR 72631 OR 71922 OR 71923) [1=Yes, 2=No]

                                            1             1

                                           2              0

                                           3              1

2. By Total Number of Patients
                               Total Count of Patients with (72632 OR 72631 OR 71922 OR 71923)

                                          2

 

Separately, I would also like a count of the number of observations each patient has for the diagnoses (72632 OR 72631 OR 71922 OR 71923) that produces output like:

                              Patient ID            Has (72632 OR 72631 OR 71922 OR 71923)             Number of Visits Addressing Diagnosis

                                            1             1                                                                                   2

                                           2              0                                                                                   0

                                           3              1                                                                                    2

Thanks for your help.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User
data inter (keep=id diag);
set have;
array diags {*} dx1-dx10;
diag = 0;
do i = 1 to dim(diags);
  if diags{i} in (72632,72631,71922 71923) then diag = 1;
end;
run;

proc sort data=inter;
by id;
run;

data result (keep=id hasdiag visitcount) totalcount (keep=total_patients);
set inter end=done;
retain total_patients 0 visitcount hasdiag;
if first.id
then do;
  visitcount = 0;
  hasdiag = 0;
end;
if diag then do;
  visitcount + 1;
  hasdiag = 1;
end;
if last.id then do;
  output result;
  total_patients + hasdiag;
end;
if done then output totalcount;
run;

If your initial dataset is already sorted by id, you can omit the sort step.

 

Some edits done for typos.

View solution in original post

25 REPLIES 25
DarthPathos
Lapis Lazuli | Level 10

Hi

 

I am sure there are more efficient ways to handle this, but here's how I would do it using SQL:

 

 

proc sql;
/* This will put the data into categories of Yes = 1 and No = 0 */ create table work.DATA2 as  select patient_id,  case when (dx1 in (72632, 72631, 71922, 71923)                or (dx2 in (72632, 72631, 71922, 71923)  . . .             or (dx10 in (72632, 72631, 71922, 71923))  then 1  else 0 end as Yes_No from work.DATA;
/* This will sum the Yes_No column to give you a TOTAL number of instances that met
the Yes criteria above */ select sum(Yes_No) as Total_Count from work.DATA2; 
/* This will give you the total per patient ID */ select patient_ID, sum(Yes_No) as Pt_Count from work.DATA2 group by patient_ID order by patient_ID; quit;

This may require some fine tuning, and there are plenty of papers and articles (including a few written by me!) on PROC SQL, but this should get you started.


Good luck and happy coding!

Chris

Has my article or post helped? Please mark as Solution or Like the article!
Shmuel
Garnet | Level 18

Alternative code using data step:

 

%let ds_check = 72632 72631 71922 71923;  /* easy to change list if needed */

 

data want;

 set have;

  by patient_id;

       retain yes_no;

       if first.patient_id then yes_no = 0;

       

       array dx dx1-dx10;

       do i=1 to 10;

            if dx(i) in (&ds_check) then yes_no = 1;

        end;

 

        if last.patient_id then output;

        keep patient_id  yes_no;

        label yes_no = " HAS (&ds_check)";

run;

 

The TOTAL number can be achieved by various ways

     (e.g. by PROC MEANS or PROC SQL or PROC PRINT with SUM )

 

 

 

 

DMMD
Obsidian | Level 7

How do I deal with leading zeros in my diagnoses categories. For example, DX01, DX02, DX03...Dx10?

 

Thank you so much!

Shmuel
Garnet | Level 18

you may change the array definition into:

ARRAY DX DX01-DX10;

 

or alternatively do:

 

  SET WANT (Rename=(dx01=dx1 dx02=dx2 ... dx09=dx9));

DarthPathos
Lapis Lazuli | Level 10

Unless I'm misunderstanding, the DX01 is a column header and at least for the SQL version, won't have an impact.  The only time you'd need to worry about leading 0s is if they're in your data (00745454 versus 745454).  Sometimes the number of leading 0s is important, and then you would have to treat the column as TEXT rather than NUMERIC.

 

hope this helps!

Chris

Has my article or post helped? Please mark as Solution or Like the article!
Kurt_Bremser
Super User

@DMMD wrote:

How do I deal with leading zeros in my diagnoses categories. For example, DX01, DX02, DX03...Dx10?

 

Thank you so much!


Just include the zero in the variable list. Just tested this short example:

data have;
input x01 x02 x03;
cards;
1 2 3
;
run;

data want;
set have;
array xx {*} x01-x03;
do i = 1 to dim(xx);
  put xx{i}=;
end;
run;
DMMD
Obsidian | Level 7

Fantastic! The problem was the diagnoses codes were character, not number variables as there are some codes starting with letters. Putting the codes in quotations allowed it to work. Thank you so much! This helped to answer #1 and #3 of my output.

DarthPathos
Lapis Lazuli | Level 10

Hey!  I think the issue is the PROC FREQ; if I remember correctly, PROC FREQ counts instances, or something.  At a quick glance, your SQL looks good, and i'm not seeing any issues with your brackets or other logic.  I'll wait and see what @Shmuel says, and if I can i will do some testing tomorrow when I'm at work.

 

Good luck 🙂
Chris

Has my article or post helped? Please mark as Solution or Like the article!
DMMD
Obsidian | Level 7

Thanks @DarthPathos, @Shmuel, and @Kurt_Bremser! SAS can be a stressful experience to say the least.

 

Best,

David

DarthPathos
Lapis Lazuli | Level 10
Yeah, I've got to admit i'm stumped LOL. Hopefully one of my esteemed co-users can point you in the right direction, and if I think of anything I'll you know. One of the benefits of being on the community - you get a wide variety of views 😄
Don't give up - I've been using SAS for 6 years, and have come a long, long way. You'll get there!
Chris
Has my article or post helped? Please mark as Solution or Like the article!
Shmuel
Garnet | Level 18

Within my code ( @shmuel ) I conunt per EUPIDE. 

 

It seems that same eupide can exist in more than one observation.

 

@Kurt_Bremser counts each observation.

 

I have no time now to check the 3rd program.

Shmuel
Garnet | Level 18

I have analized the 3 programs , here is the explanation for the differences:

 

Let's have an input like:

        EUPIDE      YES-NO         my remark                      counts by:  SH        KB          DP 

              1               0                non of dx in list                                                0            1

              1               1               one or more dx in list                                       1            1               

              1               1                                                                           1          1            1  

                                                        TOTAL                                         1          2            3  

 

in words:

         I count each EUPIDE once only. Therefore my total is the least and I have missings in freq output.

         @Kurt_Bremser assigns each line as 0 or 1 and sums the colomn yes-no.

         @DarthPathos counts each line as 1 if any line of same EUPIDE is 1.

 

@DMMD now it is your turn to choose which of the 3 fits more to you.

            

      

DMMD
Obsidian | Level 7

Thanks so much @Shmuel for helping to explain the function of each code.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 2367 views
  • 15 likes
  • 4 in conversation