This is a large dataset, including over 100k obs. What I need to do is to classify individuals into body system categories based on their icd-9 categories.
For example, this is a sample code.
if substr(dx{i},1,3) in ('080', '081', '082'), then mar_head=1
Therefore, I have generated 31 different categories.
After that, I need to generate a frequency table of this population, if that can be listed in an order based on the frequency of each different cateogory. For example, lung diseases is the first major problem; head diseases is the second most frequent problem. That is what I expect. Since it is a large dataset, and there are so many categories. I use transpose.
proc transpose data=sam1 out=samid;
var mar_head--mar_gyne;
by id; run;
I then have a table generated with newid, previousid, and the name of formal variable, including mar_head, mar_colon.
Then, I hope to get advice from you how to count the frequency of those variables, which is now listed under one column, called “name of formal variable”.
I want to have a following table:
Body system percent N
Musculoskeletal 26.2% 12345
Respiratory 25.5% 12322
Neurological 24.8% 11112
Mental health 18.6% 9687
Respiratory 14.6% 9456
Eye Diseases 4.4% 3233
Digestive 4.3% 2459
Gynecological 6.2% 4562
All Other 10.3% 5688
I do not know how to do this. Also, I need to only count each individual once. Thank you for your advice
You wrote that all of the mar_xxxx variables had value 0. So far, you have only shown code snippets which set mar_xxxx to 1 under certain conditions. What code do you use to set them to 0? When does this happen? In the same data step that contains the IF/THEN statements we've seen? If so, before those IF/THEN statements or after?
PROC FREQ?
Please post some sample data and expected output.
It's easy to say count an individual once, but what does that mean? What happens when people have multiple issues-highly likely as a lot of issues are related.
There are various methods for generating tables, its likely your company has their own standards/macro libararies. You can also find some examples in the PHUSE library:
https://github.com/phuse-org/phuse-scripts/wiki/Standard-Script-Index
Without test data (as a datastep) its hard to provide code.
It looks like you are making indicator variables. You can summarize the indicators directly to get N DENOM and PCT.
data mar;
input id:$3. trt mar_head mar_gyne;
cards;
001 1 . 1
001 1 0 1
001 1 0 1
002 2 . .
003 2 1 0
003 2 1 1
004 1 . .
005 1 . .
;;;;
run;
proc summary data=mar nway;
class id trt;
output out=max max(mar_:)=;
run;
proc stdize missing=0 reponly data=max out=max0;
var mar_:;
run;
ods select none;
ods output Summary=Summary;
proc means data=max0 nway stackods sum n mean;
class trt;
var mar_:;
run;
ods select all;
proc print label;
format sum f5. mean percentn.;
label n='Denom' sum='N' mean='Pct';
run;
Thanks. What does "trt" stand for?
Hi @Bal23,
Another option would be not to create 31 indicator variables, but only one variable containing the body system code (e.g. 1, 2, ..., 31 with a format assigning the codes to body system names).
Example:
if substr(dx{i},1,3) in ('080', '081', '082') then bscode=8;
Without any intermediate datasets you can then obtain the desired counts and percentages using PROC SQL: count(distinct id) gives the number of individuals in the respective group. Your sample table (with percentages summing up to 134.9) indicates that overlaps between the body systems are anticipated. I assume that the percentages refer to some comprehensive dataset (called ALLIDS below) which may contain individuals who don't occur in any of the body system categories (like IDs 31 and 33 in the example below). If all of your individuals are classified, you can get the total from dataset HAVE.
/* Create sample data */
proc format;
value bscfmt
5='Musculoskeletal'
6='Neurological'
7='Respiratory'
31='All Other';
run;
data have;
bscode=5;
do id=1 to 20;
output;
if id>17 then output; /* IDs with 2 muscul. diseases */
end;
bscode=7;
do id=16 to 30;
output;
end;
bscode=6;
do id=16 to 36 by 2;
output;
end;
bscode=31;
do id=35 to 48;
output;
end;
format bscode bscfmt.;
label bscode='Body system';
run;
data allids;
do id=1 to 48;
output;
end;
run;
/* Perform counts and sort the results */
proc sql;
select count(distinct id) into :total
from allids;
create table bscounts as
select bscode, count(distinct id) as N,
calculated N/&total as percent format=percent8.1
from have
group by bscode
order by (bscode=31), N desc; /* This keeps category "All Other" at the end. */
quit;
proc print data=bscounts label noobs;
run;
Thank you very much. I like it. I have not tried yet, because I find there is something wrong with my code. That is, all these variables I generated, including mar_colon, mar_head, have value "0";
would any of you suggest what that happens?
In the IF/THEN statement in your initial post you had a comma between the condition and "then". If it's this, the log should contain error messages.
You will need to show example program and data.
no comma, no error message
below is the sample code
if substr(dx{i},1,2) in ('77')
or substr(dx{i},1,6) in ('EYES -', 'EYES-G')
then do;
mar_eye=1;
end;
run;
proc contents data=have;
run; /*I realize I generate numerical variables, instead of categorical one that I prefer. Would you please let me know how to fix that as well?*/
proc means data=have;
var mar_:;
run;
Still too little information, for me to do anything but guess.
There is no syntax error in your code provided that the array variables dx{i} are character variables.
You wrote that "all these variables" mar_xxx have value 0. This means that there must be a substantial issue which should be easy to find: Just print an arbitrary dx{i} value from your dataset without formatting (format _all_;) and then check why the corresponding IF condition is not met. (Possible reasons include leading blanks, upper/lower case mismatch, formatted vs. unformatted values, ...)
Thank you very much. I am sorry I do not understand it very well. Would you please explain a little bit more? Or give an example like a sample sas code>
I do find some notes from the log
Character values have been converted to numeric values at the places given by:
(Line):(Column).
2286:25 2653:12
NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column).
These automatic conversion messages can be harmless in some cases and indicate serious issues in others. Therefore, I recommend to avoid them whenever possible by performing the conversions explicitly:
Regarding the ICD-9 codes, all you have shown so far are those IF conditions involving applications of the SUBSTR function to array elements dx{i}. Array elements, however, are valid references only during a data step where the array is defined. So, the real names of the variables you referred to as dx{i} could be dx1, dx2 etc., but could be completely different as well, e.g. icdcode_BL, icdcode_wk4, ....
Please just look up some of the said real names from the array statement, let us know of what type and length these variables are (see PROC CONTENTS output for the dataset the variables belong to) and print a few unformatted values of these variables:
proc print data=have(obs=5);
format _all_;
var dx1 dx2 icdcode_BL; /* or whatever the real names are */
run;
Proc contents
they are all numerical, with length of 8
proc print data=have(obs=5);
format _all_;
var dx1 dx2 icdcode_BL; /* or whatever the real names are */
run;
I did run this, the values of all those var are missing
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.