BookmarkSubscribeRSS Feed
u49592411
Calcite | Level 5

Hello all! I am a new user to SAS and was trying to manipulate a dataset to produce tables with prevalence rates. 

 

I have several variables like disease status (1,0), age (binned into categories), location (various sites), and year (binned by calender year). I was trying to create a single variable that would represent the prevalence rate so I could easily produce 2x2 tables with prevalence rate by various other vars (year, age, location etc). 

 

In my data step I was trying to do this the following way: 

 

data prevalence;
set work.import;


length Hivstatus $ 10;
if hivpos =1 then Hivstatus = "Positive";
if hivpos =0 then Hivstatus = "Negtive";

 

if hivpos =1 then Hivpositive = "HIV Positive";
if hivpos =0 then Hivnegative = "HIV Negative";

 

if hivpos in (0,1) then HIVtotal = "All Cases";
if hivpos = 1 then HIVcases = "HIV Positive Cases";

 

HIVprevelence = HIVcases/HIVtotal; 

 

 

It ends up producing a variable "HIVprevelence" where all the data are listed as "missing". Im not sure if there the code is incorrect or if I am thinking about doing this the wrong way. It very difficult to get prevalence rates with multiple variables using proc freq. 

 

Any help appreciated!

 

 

data prevalence;
set work.import;

length Hivstatus $ 15;
if hivpos =1 then Hivstatus = "Positive";
if hivpos =0 then Hivstatus = "Negtive";

if hivpos =1 then Hivpositive = "HIV Positive";
if hivpos =0 then Hivnegative = "HIV Negative";

if hivpos in (0,1) then HIVtotal = "All Cases";
if hivpos = 1 then HIVcases = "HIV Positive Cases";

HIVprevelence = HIVcases/HIVtotal; 

run; 

 

7 REPLIES 7
Kurt_Bremser
Super User

You try to do a numeric operation with character values. Maxim 2: Read Your Log. You will find the NOTE's about conversion and invalid data.

 

I guess(!) you want to sum all cases vs. all positive cases:

proc sql;
select
  sum(hivpos) as hivcases,
  count(hivpos) as hivtotal,
  (calculated hivcases / calculated hivtotal) as hivprevalence
from work.import;
quit;

(assuming that hivpos is either zero, one or missing)

or, in a data step

data want;
set have end=done;
retain hivcases hivtotal 0;
hivtotal + 1;
if hivpos = 1 then hivcases + 1;
if done;
hivprevalence = hivcases / hivtotal;
keep hivcases hivtotal hivprevalence;
run;
u49592411
Calcite | Level 5

Hey @KurtBremser thanks so much! That was super helpful - I was able to get things working with some minor adjustments to my code. 

 

I am now have three new vars: 

 

hivcases hivtotal hivprevalence

 

 I removed the keep statement so that they were added to my existing dataset. 

 

I am having trouble getting the output I intended now, initially I was thinking something like 

Proc freq; 
tables hivprevelence*timeyear; 
run; 

which would give me a table with the total prevalence for the disease by year. However, obviously now hivprevelence is not a 1/0 variable and so I used the proc univariate step instead. 

 

proc univariate data=work.import; 
class timeyear;
var hivprevalence;
run; 

This gives me the mean for hivprevelence (which is the prevalence) but it is buried in a mountain of other stuff that is not very useful. I was looking to build an output that is more flexible and table oriented (something more like pivot tables in excel). I tried adding the freq option to proc univariate but that didn't get me too far. 

 

Here is a basic example of the types of tables I am thinking of generating: 

 

Screen Shot 2020-08-30 at 5.20.05 PM.png

 

Thank you for helping further my learning! Its much appreciated!

Kurt_Bremser
Super User

Please post a sample of your import dataset in a data step with datalines, so we can readily use it, and what you expect to get out of that data.

u49592411
Calcite | Level 5

Hey sure - I just made up some code to reflect the variables of interest related to figuring out the prevalence: 

 

data Prevalence;
   input date age hivpos location $ ;
   datalines;
09/27/19 56 1 A 
03/12/15 92 0 B
12/05/16 39 1 D
;

I have categorized some of the variables as follows: 
if year(date) = 2014 then Year = "2014";
else if year(date) = 2015 then Year = "2015";
else if year(date) = 2016 then Year = "2016";
else if year(date) = 2017 then Year = "2017";
else if year(date) = 2018 then Year = "2018";
else if year(date) = 2019 then Year= "2019";

length matage_category $15;
   if      . <  age <  20 then age_category = "<20";
   else if 20 <= age <=  40 then age_category = "20 to 40";
   else if 40   < age <=  60 then age_category = "41 to 60";
   else if age > 60 then age_category = ">61";
   else if        age = . then age_category = "Unknown";

 

Proc Freq will give me tables with frequencies by age_category / time year / location but divides the tables into HIV=1 and HIV=0 so in principle I could just divide the the cells in each table to get the prevalence however I thought there must be an easy way to internalize this type of calculation. So using the code from the previous post I created a new variable "hivprevalence" which I want to display in tables by the stratified variables above so the output looks something like: 

Screen Shot 2020-09-01 at 5.11.21 PM.png

I tried using: 

proc means data=prevelence n mean;
Class year age_category;
var hivprevalence;
run; 

This seemed to get things close - but the hivprevelence calculation is not correct. It dosent seem to be taking the number of hivpos=1 cases and dividing by the total hivpos=0 cases for each cell in the table. Any thoughts / help would be greatly appreciated!

 

 

 

u49592411
Calcite | Level 5

Anyone have any suggestions or thoughts on this?

 

Much appreciated!

 

 

Kurt_Bremser
Super User

First, use formats on your variables for grouping:

proc format;
value agebin
  . = 'Unknown'
  0 -< 20 = "<20"
  20 - 40 = "20 to 40"
  40 <- 60 = "41 to 60"
  60 <- high = ">61"
;
run;

data Prevalence;
input date age hivpos location $ ;
format date year4. age agebin.;
datalines;
09/27/19 56 1 A 
03/12/15 92 0 B
12/05/16 39 1 D
;

These groups will take effect if the variables are used as CLASS in procedures.

Or you use formatted values as GROUP BY in a SQL SELECT.

 

If you provide a larger dataset where sums and counts per group make sense, and the expected output, I can test the code against that.

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 3139 views
  • 0 likes
  • 2 in conversation