Hi all:
The question I need to answer is: for people who have a defined DX1 value (in this case, 5990), what are the most frequent codes that exist across DX2 through DX30? I'm trying to build out a profile of the "most common case" based on these codes. In other words, for people with 5990, the ten most frequent codes are "A (freq, x%) ,B (freq, x%) etc..." I think I need two steps:
Many thanks for your help! (I use 9.4, if that helps.)
DX1 | DX2 | DX3 | DX4 | DX5 | DX6 | DX7 | DX8 |
5990 | 2859 | 1302 | V5869 | ||||
5990 | |||||||
5990 | 2768 | 42789 | 49390 | 25000 | 2720 | V5869 | V5866 |
5990 | 4149 | 25000 | 4019 | V5866 | V5869 | ||
5990 | |||||||
5990 | 4019 | 5715 | 45621 | ||||
5990 | 7881 | 4019 | 2720 | 53081 | V5869 | V4589 | V5866 |
5990 | 42731 | 25000 | V5869 | ||||
5990 | 27650 | V5869 | 53081 | 4019 | 2720 | ||
5990 | 78701 | 78791 | |||||
5990 | 4019 | 2720 | V5869 | ||||
5990 | 78820 | 78097 | 29420 | V1309 | |||
5990 | 413 | 4264 | 78607 | V1254 | 33829 | ||
5990 | 40391 | 5856 | V4511 | V4573 | V1051 | 1970 | V1011 |
Well, the simplest way would be to normalise your data and run a proc freq on it:
data have; infile datalines dlm="," dsd; input dx1 $ dx2 $ dx3 $ dx4 $ dx5 $ dx6 $ dx7 $ dx8 $; datalines; 5990,2859,1302,V5869,,,, 5990,,,,,,,, 5990,2768,42789,49390,25000,2720,V5869,V5866 ; run; data inter (keep=dx1 v); set have; array vals{7} dx2-dx8; do i=1 to 7; if vals{i} ne "" then do; v=vals{i}; output; end; end; run;
/* Note ouptu is basic and output window only, so modify to your needs */ proc freq data=inter; run;
And for future reference please post test data in the form of a datastep so we don't have to guess at your data's structure.
"determine the most frequent values across all 29 variables and across all other observations. "
If you show the required output for the sample data, it is possible to visualize what you want as in the above line. Then a better way can be shown.
Thanks for your replies. Below is the datastep (if that helps; I can provide more detailed data if needed) I used to create my data and also the chart I'd like to be able to populate. DX1 is used as the initial filter to create the derrivative dataset. From that secondary dataset (workdata.discharge_5990), I'd like to be able to determine the most frequent DX codes that appear in DX2 - DX30 for all observations, rather than just a single row.
The dataset "workdata.discharge_5990" has more than 100,000 observations, if that helps guide the approach. It does not have a variable that functions as a unique identifier though all the observations are unique.
Top Ten Most Common DX for Paients with DX1 of 5990 | ||
N | % | |
DX A | ||
DX B | ||
DX C | ||
DX D | ||
DX E | ||
DX F | ||
DX G | ||
DX H | ||
DX I | ||
DX J |
data workdata.discharge_5990;
set workdata.emergency_department;
where dx1 = '5990';
run;
Do the Top 10 DX refers to all DXs(amalgamated)? In other words, can DX2 to DX30 be clubbed for counting purpose?
Great question. Yes, the Top 10 DX refers to counts from all DXs combined (DX2-DX30). They can definitely be clubbed for counting purposes.
Must "5990" appear only in DX1, not in any of the other codes?
The very first reply you received contains the right approach. It might still need a little work, along the lines of:
But it's a very straightforward, accurate way to approach the problem. Did you attempt to use it?
Run next short code and check output dataset.
You may find it easy to find in what you looking for:
proc freq data=have; /* or data=have(where=(dx1=5990)); */
table dx1 * (dx2 - dx30); /* try also, depending sas version: dx1 * (dx: ) */
output out=freq;
run;
Thanks for replying. I ran this code - seems to give a distribution of the most frequent DX2 codes across all observations (and so on for DX3-30). I think the next step is figuring out which DX values are the mostcommon across DX2-30 (not just within each DX). Does that make sense?
A simple way, but I am not sure that this will meet your ultimate goal.
Break DX2 to DX30 as 30 single STRING variables and save it to a data set. You may ignore the missing values. Sort the data set by STRING in descending order and take first 10 rows.
Does next code helps you ?
data have;
infile datalines truncover;
input dx1 $ dx2 $ dx3 $ dx4 $ dx5 $ dx6 $ dx7 $ dx8 $ ;
/*DX1 DX2 DX3 DX4 DX5 DX6 DX7 DX8 */
DATALINES;
5990 2859 1302 V5869
5990
5990 2768 42789 49390 25000 2720 V5869 V5866
5990 4149 25000 4019 V5866 V5869
5990
5990 4019 5715 45621
5990 7881 4019 2720 53081 V5869 V4589 V5866
5990 42731 25000 V5869
5990 27650 V5869 53081 4019 2720
5990 78701 78791
5990 4019 2720 V5869
5990 78820 78097 29420 V1309
5990 413 4264 78607 V1254 33829
5990 40391 5856 V4511 V4573 V1051 1970 V1011
; run;
data temp;
set have;
by dx1;
array dx $ dx2-dx8;
do i=1 to 7;
if dx(i) ne ' ' then do;
dxn = dx(i);
output;
end; end;
keep dx1 dxn;
run;
proc sql;
create table freq as
select dx1, dxn, count(dxn) as freq
from temp
group by dx1, dxn;
quit;
proc sort data=freq; by descending freq; run;
proc print data=freq(obs=10); run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.