Hello all! I am a new user to SAS and was trying to manipulate a dataset to produce tables with prevalence rates.
I have several variables like disease status (1,0), age (binned into categories), location (various sites), and year (binned by calender year). I was trying to create a single variable that would represent the prevalence rate so I could easily produce 2x2 tables with prevalence rate by various other vars (year, age, location etc).
In my data step I was trying to do this the following way:
data prevalence;
set work.import;
length Hivstatus $ 10;
if hivpos =1 then Hivstatus = "Positive";
if hivpos =0 then Hivstatus = "Negtive";
if hivpos =1 then Hivpositive = "HIV Positive";
if hivpos =0 then Hivnegative = "HIV Negative";
if hivpos in (0,1) then HIVtotal = "All Cases";
if hivpos = 1 then HIVcases = "HIV Positive Cases";
HIVprevelence = HIVcases/HIVtotal;
It ends up producing a variable "HIVprevelence" where all the data are listed as "missing". Im not sure if there the code is incorrect or if I am thinking about doing this the wrong way. It very difficult to get prevalence rates with multiple variables using proc freq.
Any help appreciated!
data prevalence;
set work.import;
length Hivstatus $ 15;
if hivpos =1 then Hivstatus = "Positive";
if hivpos =0 then Hivstatus = "Negtive";
if hivpos =1 then Hivpositive = "HIV Positive";
if hivpos =0 then Hivnegative = "HIV Negative";
if hivpos in (0,1) then HIVtotal = "All Cases";
if hivpos = 1 then HIVcases = "HIV Positive Cases";
HIVprevelence = HIVcases/HIVtotal;
run;
You try to do a numeric operation with character values. Maxim 2: Read Your Log. You will find the NOTE's about conversion and invalid data.
I guess(!) you want to sum all cases vs. all positive cases:
proc sql;
select
sum(hivpos) as hivcases,
count(hivpos) as hivtotal,
(calculated hivcases / calculated hivtotal) as hivprevalence
from work.import;
quit;
(assuming that hivpos is either zero, one or missing)
or, in a data step
data want;
set have end=done;
retain hivcases hivtotal 0;
hivtotal + 1;
if hivpos = 1 then hivcases + 1;
if done;
hivprevalence = hivcases / hivtotal;
keep hivcases hivtotal hivprevalence;
run;
Hey @KurtBremser thanks so much! That was super helpful - I was able to get things working with some minor adjustments to my code.
I am now have three new vars:
hivcases hivtotal hivprevalence
I removed the keep statement so that they were added to my existing dataset.
I am having trouble getting the output I intended now, initially I was thinking something like
Proc freq; tables hivprevelence*timeyear; run;
which would give me a table with the total prevalence for the disease by year. However, obviously now hivprevelence is not a 1/0 variable and so I used the proc univariate step instead.
proc univariate data=work.import; class timeyear; var hivprevalence; run;
This gives me the mean for hivprevelence (which is the prevalence) but it is buried in a mountain of other stuff that is not very useful. I was looking to build an output that is more flexible and table oriented (something more like pivot tables in excel). I tried adding the freq option to proc univariate but that didn't get me too far.
Here is a basic example of the types of tables I am thinking of generating:
Thank you for helping further my learning! Its much appreciated!
Please post a sample of your import dataset in a data step with datalines, so we can readily use it, and what you expect to get out of that data.
Hey sure - I just made up some code to reflect the variables of interest related to figuring out the prevalence:
data Prevalence;
input date age hivpos location $ ;
datalines;
09/27/19 56 1 A
03/12/15 92 0 B
12/05/16 39 1 D
;
I have categorized some of the variables as follows:
if year(date) = 2014 then Year = "2014";
else if year(date) = 2015 then Year = "2015";
else if year(date) = 2016 then Year = "2016";
else if year(date) = 2017 then Year = "2017";
else if year(date) = 2018 then Year = "2018";
else if year(date) = 2019 then Year= "2019";
length matage_category $15;
if . < age < 20 then age_category = "<20";
else if 20 <= age <= 40 then age_category = "20 to 40";
else if 40 < age <= 60 then age_category = "41 to 60";
else if age > 60 then age_category = ">61";
else if age = . then age_category = "Unknown";
Proc Freq will give me tables with frequencies by age_category / time year / location but divides the tables into HIV=1 and HIV=0 so in principle I could just divide the the cells in each table to get the prevalence however I thought there must be an easy way to internalize this type of calculation. So using the code from the previous post I created a new variable "hivprevalence" which I want to display in tables by the stratified variables above so the output looks something like:
I tried using:
proc means data=prevelence n mean; Class year age_category; var hivprevalence; run;
This seemed to get things close - but the hivprevelence calculation is not correct. It dosent seem to be taking the number of hivpos=1 cases and dividing by the total hivpos=0 cases for each cell in the table. Any thoughts / help would be greatly appreciated!
Anyone have any suggestions or thoughts on this?
Much appreciated!
First, use formats on your variables for grouping:
proc format;
value agebin
. = 'Unknown'
0 -< 20 = "<20"
20 - 40 = "20 to 40"
40 <- 60 = "41 to 60"
60 <- high = ">61"
;
run;
data Prevalence;
input date age hivpos location $ ;
format date year4. age agebin.;
datalines;
09/27/19 56 1 A
03/12/15 92 0 B
12/05/16 39 1 D
;
These groups will take effect if the variables are used as CLASS in procedures.
Or you use formatted values as GROUP BY in a SQL SELECT.
If you provide a larger dataset where sums and counts per group make sense, and the expected output, I can test the code against that.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.