Hello all! I am a new user to SAS and was trying to manipulate a dataset to produce tables with prevalence rates.
I have several variables like disease status (1,0), age (binned into categories), location (various sites), and year (binned by calender year). I was trying to create a single variable that would represent the prevalence rate so I could easily produce 2x2 tables with prevalence rate by various other vars (year, age, location etc).
In my data step I was trying to do this the following way:
data prevalence;
set work.import;
length Hivstatus $ 10;
if hivpos =1 then Hivstatus = "Positive";
if hivpos =0 then Hivstatus = "Negtive";
if hivpos =1 then Hivpositive = "HIV Positive";
if hivpos =0 then Hivnegative = "HIV Negative";
if hivpos in (0,1) then HIVtotal = "All Cases";
if hivpos = 1 then HIVcases = "HIV Positive Cases";
HIVprevelence = HIVcases/HIVtotal;
It ends up producing a variable "HIVprevelence" where all the data are listed as "missing". Im not sure if there the code is incorrect or if I am thinking about doing this the wrong way. It very difficult to get prevalence rates with multiple variables using proc freq.
Any help appreciated!
data prevalence;
set work.import;
length Hivstatus $ 15;
if hivpos =1 then Hivstatus = "Positive";
if hivpos =0 then Hivstatus = "Negtive";
if hivpos =1 then Hivpositive = "HIV Positive";
if hivpos =0 then Hivnegative = "HIV Negative";
if hivpos in (0,1) then HIVtotal = "All Cases";
if hivpos = 1 then HIVcases = "HIV Positive Cases";
HIVprevelence = HIVcases/HIVtotal;
run;
You try to do a numeric operation with character values. Maxim 2: Read Your Log. You will find the NOTE's about conversion and invalid data.
I guess(!) you want to sum all cases vs. all positive cases:
proc sql;
select
sum(hivpos) as hivcases,
count(hivpos) as hivtotal,
(calculated hivcases / calculated hivtotal) as hivprevalence
from work.import;
quit;
(assuming that hivpos is either zero, one or missing)
or, in a data step
data want;
set have end=done;
retain hivcases hivtotal 0;
hivtotal + 1;
if hivpos = 1 then hivcases + 1;
if done;
hivprevalence = hivcases / hivtotal;
keep hivcases hivtotal hivprevalence;
run;
Hey @KurtBremser thanks so much! That was super helpful - I was able to get things working with some minor adjustments to my code.
I am now have three new vars:
hivcases hivtotal hivprevalence
I removed the keep statement so that they were added to my existing dataset.
I am having trouble getting the output I intended now, initially I was thinking something like
Proc freq; tables hivprevelence*timeyear; run;
which would give me a table with the total prevalence for the disease by year. However, obviously now hivprevelence is not a 1/0 variable and so I used the proc univariate step instead.
proc univariate data=work.import; class timeyear; var hivprevalence; run;
This gives me the mean for hivprevelence (which is the prevalence) but it is buried in a mountain of other stuff that is not very useful. I was looking to build an output that is more flexible and table oriented (something more like pivot tables in excel). I tried adding the freq option to proc univariate but that didn't get me too far.
Here is a basic example of the types of tables I am thinking of generating:
Thank you for helping further my learning! Its much appreciated!
Please post a sample of your import dataset in a data step with datalines, so we can readily use it, and what you expect to get out of that data.
Hey sure - I just made up some code to reflect the variables of interest related to figuring out the prevalence:
data Prevalence;
input date age hivpos location $ ;
datalines;
09/27/19 56 1 A
03/12/15 92 0 B
12/05/16 39 1 D
;
I have categorized some of the variables as follows:
if year(date) = 2014 then Year = "2014";
else if year(date) = 2015 then Year = "2015";
else if year(date) = 2016 then Year = "2016";
else if year(date) = 2017 then Year = "2017";
else if year(date) = 2018 then Year = "2018";
else if year(date) = 2019 then Year= "2019";
length matage_category $15;
if . < age < 20 then age_category = "<20";
else if 20 <= age <= 40 then age_category = "20 to 40";
else if 40 < age <= 60 then age_category = "41 to 60";
else if age > 60 then age_category = ">61";
else if age = . then age_category = "Unknown";
Proc Freq will give me tables with frequencies by age_category / time year / location but divides the tables into HIV=1 and HIV=0 so in principle I could just divide the the cells in each table to get the prevalence however I thought there must be an easy way to internalize this type of calculation. So using the code from the previous post I created a new variable "hivprevalence" which I want to display in tables by the stratified variables above so the output looks something like:
I tried using:
proc means data=prevelence n mean; Class year age_category; var hivprevalence; run;
This seemed to get things close - but the hivprevelence calculation is not correct. It dosent seem to be taking the number of hivpos=1 cases and dividing by the total hivpos=0 cases for each cell in the table. Any thoughts / help would be greatly appreciated!
Anyone have any suggestions or thoughts on this?
Much appreciated!
First, use formats on your variables for grouping:
proc format;
value agebin
. = 'Unknown'
0 -< 20 = "<20"
20 - 40 = "20 to 40"
40 <- 60 = "41 to 60"
60 <- high = ">61"
;
run;
data Prevalence;
input date age hivpos location $ ;
format date year4. age agebin.;
datalines;
09/27/19 56 1 A
03/12/15 92 0 B
12/05/16 39 1 D
;
These groups will take effect if the variables are used as CLASS in procedures.
Or you use formatted values as GROUP BY in a SQL SELECT.
If you provide a larger dataset where sums and counts per group make sense, and the expected output, I can test the code against that.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.