DATA Step, Macro, Functions and more

Graphing 70-year smoking prevalence from data containing date ranges for smokers

Reply
Occasional Contributor
Posts: 11

Graphing 70-year smoking prevalence from data containing date ranges for smokers

Hi,

I have a dataset (n=100,000+) with personID, gender, smoking status (Y/N/null), and beginyear and endyear of the period the smokers  smoked. Am trying to structure the file so that i can do a frequency distribution on smoking status for every year between 1940 and 2009, and i'm thinking the easiest way to do that is to create a column for each year (1940, 1941...2009).

 

For each personID, i want to indicate the years they smoked using these rules:  if smoking status is null, then the value for each year column would be null.  If smoking status is 'N', then the value for each year column would be 'N'. If smoking status is Y, then for the years between beginyear and endyear (inclusive), the value for each year column would be 'Y', all the years before beginyear would be 'N', and all the years after endyear would be null. I'd like to run the frequency for all records, and also separately by gender. Can anyone help with code for this, or have a better approach to essentially produce a graph of smoking prevalence for each year between 1940-2009? Thanks in advance..

Super User
Posts: 11,343

Re: Graphing 70-year smoking prevalence from data containing date ranges for smokers

Do NOT create a column for each year. Especially if you want to graph distribution. Better would be to create one record for each year of smoking history. Then Year is available as an axis variable and you can display frequency or barcharts or box plots for each year. Which really is a pain to do if you have to reference 70 different variables.

 

Please do not post "data" in Excel files. Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

 

And what kind of graph are you envisioning?

Occasional Contributor
Posts: 11

Re: Graphing 70-year smoking prevalence from data containing date ranges for smokers

i'm sorry but have not able to get the macro to work, and am unable to attach a sample sasdb. am hoping someone will be able to respond based on the table below? it's a fairly simple dataset..

 

For data presentation, can either be a table with % of smokers or line graph with the same..thank you..

 

personIDgendersmokebeginend
8005M   
6305F   
7767M   
1435MN  
8516MY19481966
9481M   
1659F   
2812M   
9282F   
9910M   
6619MY19631988
8120F   
5549M   
5567F   
7987F   
5012F   
7110M   
Super User
Posts: 11,343

Re: Graphing 70-year smoking prevalence from data containing date ranges for smokers

So how do expect to include the PersonID that have neither Y or N for the smoke variable?

What years to expect the N for smoke variable to represent? If we treat all of them as years 1940 to 2009 we may well be including years that individual was not alive. Or including smoking in the cradle...

 

If your example data is representative of your full 100,000+ records you appear to have years of information for about 11% of the records.

 

If you had a more complete data set I might do something like this to get one record per year:

data example;
   input personid $ sex $ smoke $ startyear endyear;
datalines;
1  F  Y  1960  1969
1  F  N  1970  1995
2  M  Y  1945  1986 
3  M  Y  1954  1973
3  M  N  1974  1998
4  F  Y  1958  1988
5  M  Y  1961  1989
;
run;

data want;
   set example;
   do year=startyear to endyear;
      output;
   end;
   drop startyear endyear;
run;

proc freq data=want;
   tables year*smoke;
run;

Which would allow a number of different types of analysis. You could also get some graphs with data summarized on year for counts or percentages of interest.

 

Ask a Question
Discussion stats
  • 3 replies
  • 93 views
  • 0 likes
  • 2 in conversation