Solved: Re: how to count the frequency of variables that I have generated

Bal23 · Posted 02-23-2016 11:18 AM

This is a large dataset, including over 100k obs. What I need to do is to classify individuals into body system categories based on their icd-9 categories.

For example, this is a sample code.
if substr(dx{i},1,3) in ('080', '081', '082'), then mar_head=1

Therefore, I have generated 31 different categories.

After that, I need to generate a frequency table of this population, if that can be listed in an order based on the frequency of each different cateogory. For example, lung diseases is the first major problem; head diseases is the second most frequent problem. That is what I expect. Since it is a large dataset, and there are so many categories. I use transpose.

proc transpose data=sam1 out=samid;
var mar_head--mar_gyne;
by id; run;

I then have a table generated with newid, previousid, and the name of formal variable, including mar_head, mar_colon.

Then, I hope to get advice from you how to count the frequency of those variables, which is now listed under one column, called “name of formal variable”.

I want to have a following table:
Body system percent N
Musculoskeletal 26.2% 12345
Respiratory 25.5% 12322
Neurological 24.8% 11112
Mental health 18.6% 9687
Respiratory 14.6% 9456
Eye Diseases 4.4% 3233
Digestive 4.3% 2459
Gynecological 6.2% 4562
All Other 10.3% 5688

I do not know how to do this. Also, I need to only count each individual once. Thank you for your advice

FreelanceReinh · Posted 02-24-2016 02:34 PM

You wrote that all of the mar_xxxx variables had value 0. So far, you have only shown code snippets which set mar_xxxx to 1 under certain conditions. What code do you use to set them to 0? When does this happen? In the same data step that contains the IF/THEN statements we've seen? If so, before those IF/THEN statements or after?

View solution in original post

Reeza · Posted 02-23-2016 11:22 AM

PROC FREQ?

Please post some sample data and expected output.

It's easy to say count an individual once, but what does that mean? What happens when people have multiple issues-highly likely as a lot of issues are related.

RW9 · Posted 02-23-2016 11:29 AM

There are various methods for generating tables, its likely your company has their own standards/macro libararies. You can also find some examples in the PHUSE library:

https://github.com/phuse-org/phuse-scripts/wiki/Standard-Script-Index

Without test data (as a datastep) its hard to provide code.

data_null__ · Posted 02-23-2016 12:35 PM

It looks like you are making indicator variables. You can summarize the indicators directly to get N DENOM and PCT.

data mar;
   input id:$3. trt mar_head mar_gyne;
   cards;
001 1 . 1
001 1 0 1
001 1 0 1
002 2 . .
003 2 1 0
003 2 1 1 
004 1 . .
005 1 . .
;;;;
   run;
proc summary data=mar nway;
   class id trt;
   output out=max max(mar_:)=;
   run;
proc stdize missing=0 reponly data=max out=max0;
   var mar_:;
   run;

ods select none;
ods output Summary=Summary;
proc means data=max0 nway stackods sum n mean;
   class trt;
   var mar_:;
   run;
ods select all;
proc print label;
   format sum f5. mean percentn.;
   label n='Denom' sum='N' mean='Pct';
   run;

Bal23 · Posted 02-23-2016 03:05 PM

Thanks. What does "trt" stand for?

FreelanceReinh · Posted 02-23-2016 01:43 PM

Hi @Bal23,

Another option would be not to create 31 indicator variables, but only one variable containing the body system code (e.g. 1, 2, ..., 31 with a format assigning the codes to body system names).

Example:

if substr(dx{i},1,3) in ('080', '081', '082') then bscode=8;

Without any intermediate datasets you can then obtain the desired counts and percentages using PROC SQL: count(distinct id) gives the number of individuals in the respective group. Your sample table (with percentages summing up to 134.9) indicates that overlaps between the body systems are anticipated. I assume that the percentages refer to some comprehensive dataset (called ALLIDS below) which may contain individuals who don't occur in any of the body system categories (like IDs 31 and 33 in the example below). If all of your individuals are classified, you can get the total from dataset HAVE.

/* Create sample data */

proc format;
value bscfmt
5='Musculoskeletal'
6='Neurological'
7='Respiratory'
31='All Other';
run;

data have;
bscode=5;
do id=1 to 20;
  output;
  if id>17 then output; /* IDs with 2 muscul. diseases */
end;
bscode=7;
do id=16 to 30;
  output;
end;
bscode=6;
do id=16 to 36 by 2;
  output;
end;
bscode=31;
do id=35 to 48;
  output;
end;
format bscode bscfmt.;
label bscode='Body system';
run;

data allids;
do id=1 to 48;
  output;
end;
run;

/* Perform counts and sort the results */

proc sql;
select count(distinct id) into :total 
from allids;

create table bscounts as
select bscode, count(distinct id) as N,
       calculated N/&total as percent format=percent8.1
from have
group by bscode
order by (bscode=31), N desc; /* This keeps category "All Other" at the end. */
quit;

proc print data=bscounts label noobs;
run;

Bal23 · Posted 02-23-2016 03:08 PM

Thank you very much. I like it. I have not tried yet, because I find there is something wrong with my code. That is, all these variables I generated, including mar_colon, mar_head, have value "0";

would any of you suggest what that happens?

FreelanceReinh · Posted 02-23-2016 03:15 PM

In the IF/THEN statement in your initial post you had a comma between the condition and "then". If it's this, the log should contain error messages.

data_null__ · Posted 02-23-2016 03:15 PM

You will need to show example program and data.

Bal23 · Posted 02-23-2016 04:10 PM

no comma, no error message

below is the sample code

if substr(dx{i},1,2) in ('77')

or substr(dx{i},1,6) in ('EYES -', 'EYES-G')
then do;
mar_eye=1;
end;

run;

proc contents data=have;

run; /*I realize I generate numerical variables, instead of categorical one that I prefer. Would you please let me know how to fix that as well?*/

proc means data=have;

var mar_:;

run;

data_null__ · Posted 02-23-2016 04:20 PM

Still too little information, for me to do anything but guess.

FreelanceReinh · Posted 02-23-2016 04:27 PM

There is no syntax error in your code provided that the array variables dx{i} are character variables.

You wrote that "all these variables" mar_xxx have value 0. This means that there must be a substantial issue which should be easy to find: Just print an arbitrary dx{i} value from your dataset without formatting (format _all_;) and then check why the corresponding IF condition is not met. (Possible reasons include leading blanks, upper/lower case mismatch, formatted vs. unformatted values, ...)

Bal23 · Posted 02-24-2016 09:36 AM

Thank you very much. I am sorry I do not understand it very well. Would you please explain a little bit more? Or give an example like a sample sas code>

I do find some notes from the log

Character values have been converted to numeric values at the places given by:
(Line):(Column).
2286:25 2653:12
NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column).

FreelanceReinh · Posted 02-24-2016 10:09 AM

These automatic conversion messages can be harmless in some cases and indicate serious issues in others. Therefore, I recommend to avoid them whenever possible by performing the conversions explicitly:

numeric to character typically by means of the PUT function (and a format as the second function argument)
character to numeric typically by means of the INPUT function (and an informat as the second function argument).

Regarding the ICD-9 codes, all you have shown so far are those IF conditions involving applications of the SUBSTR function to array elements dx{i}. Array elements, however, are valid references only during a data step where the array is defined. So, the real names of the variables you referred to as dx{i} could be dx1, dx2 etc., but could be completely different as well, e.g. icdcode_BL, icdcode_wk4, ....

Please just look up some of the said real names from the array statement, let us know of what type and length these variables are (see PROC CONTENTS output for the dataset the variables belong to) and print a few unformatted values of these variables:

proc print data=have(obs=5);
format _all_;
var dx1 dx2 icdcode_BL; /* or whatever the real names are */
run;

Bal23 · Posted 02-24-2016 10:21 AM

Proc contents

they are all numerical, with length of 8

proc print data=have(obs=5);
format _all_;
var dx1 dx2 icdcode_BL; /* or whatever the real names are */
run;

I did run this, the values of all those var are missing

Registration is open

SAS Training: Just a Click Away