Hi!
I'm the new user of Base SAS Programming.
I have got a stuck at a problem.
Question: Which Year Maximum standard deviation from the mean compared to the standard deviation from the mean whole dataset.
Note: This has to be done only using proc tabulate instead of using proc means, proc freq, proc sql or marco.
data xyz;
infile datalines missover turncate;
input year casualties @@;
datalines;
1980 1 1989 23
1981 3 1990 103
1985 45 1991 56
1980 85 1983 78
1989 23 1983 85
1990 21 1980 56
1984 22 1984 35
1989 56 1981 14
....
1985 26 1989 12
;
/* individual calculation */
proc tabulate data=xyz out=abc;
class year;
var casualties;
tables year, casualties*std;
run;
/* whole dataset calculation*/
proc tabulate data=xyz out=pqr;
var casualties;
tables casualties*std;
run;
I don't know what to do next to compare these two datasets (abc or pqr) to get my final required output. I need your assistance.
Hi:
Well, here's what TABULATE would give on a limited number of years:
The line with "All" above is the overall for these years. 2006 and 2009 and a few other years are higher than the overall number. What would you envision as the FINAL report? I don't understand what you mean when you say you: "want a report that shows the comparison of each year std to the overall Std and also display the greater values in the output."
What would your "comparison of each year to the overall" look like? Is there a formula, how would this look in a small table such as the one I show above, where you can see a few years and then see the overall. And, "display the greater values in the output" which greater values? Only show the highest std values?
Cynthia
Hi:
I'm not sure you need 2 PROC TABULATE steps. You could just do this:
tables year all='Overall', casualties*std;
which would put the overall STD at the bottom of the rows for each year.
But I don't understand what your desired output is. Do you want a report that shows the comparison of each year to the overall STD?
Also, I would not expect that you get any output at all, because I would expect you are seeing this error message in the log:
ERROR 23-2: Invalid option name TURNCATE.
because "turncate" is not a valid option for the INFILE statement.
Cynthia
Hi:
Well, here's what TABULATE would give on a limited number of years:
The line with "All" above is the overall for these years. 2006 and 2009 and a few other years are higher than the overall number. What would you envision as the FINAL report? I don't understand what you mean when you say you: "want a report that shows the comparison of each year std to the overall Std and also display the greater values in the output."
What would your "comparison of each year to the overall" look like? Is there a formula, how would this look in a small table such as the one I show above, where you can see a few years and then see the overall. And, "display the greater values in the output" which greater values? Only show the highest std values?
Cynthia
Yes, Only shows the highest std values with their respective year.
I posting my actual sas code may you get more idea what I'm looking for.
options dlcreatedir;
proc options option=dlcreatedir ;
run;
filename File "/folders/myfolders/raw_files/Airplane_Crashes_and_Fatalities_Since_1908.csv" ;
libname Lib "/folders/myfolders/DataSet";
/*Fetching Data from external file*/
data Lib.Airplanes(label="DETAILS:Airplanes Crashes since 1908");
infile File missover dsd lrecl=32767 firstobs=2 dlm=',' ;
input Date mmddyy10. +1 Time:$10. Location:$100. Operator: $50.
Flight: $50. Route: $100. Plane_Type: $100. Registration: $20. cn_In:$20. Aboard
Fatalities Ground Summary: $1000. ;
format Date mmddyy10.;
/* Extracting Year, Month and Day from Date */
Year=year(Date);
/* Trimming route */
Route=trim(Route);
/* Calculating Survivors */
Survivor=Aboard-Fatalities;
/* Calculating Decade */
Decade=floor(Year/10)*10;
drop Date;
run;
/***** SORTING AIRPLANES DATASET IN ASCENDING ORDER ******/
proc sort data=Lib.Airplanes;
by Year;
run;
/*//-----------------------CASUALTIES--------------// */
/*
* QUESTION:1) YEAR WHICH HAD THE MEAN NUMBER OF CRASHES CLOSEST TO MEDIAN NUMBER OF
* CRASHES ?
*/
/************** YEARS ********************/
ods select none; *NO PRINT FOR LINE 41 STATEMENT;
proc tabulate data=Lib.Airplanes out=WORK.YearlyCrashes( rename=(N=Crash)) ;
class Year;
table Year,N;
run;
ods select all; * PRINT FOR LINE 46 STATEMENT ;
proc tabulate data=WORK.YearlyCrashes;
var Crash Year ;
table (Year*Mean=''*F=4.)(Crash='No. of plane Crashes'*Mean*F=4.)(Year*Median=''*F=4.)
(Crash='No. of plane Crashes'*Median*F=4.)/rts=20.;
title3 color=RED bold 'Year which had the Mean number of crashes closest to Median number of
crashes';
run;
/* QUESTION :2) WHICH YEAR HAD THE MAXIMUM STANDARD DEVIATION FROM THE MEAN COMPARED TO THE STANDARD
* DEVIATION FROM THE MEAN WHOLE DATASET?
*/
proc tabulate data=Lib.Airplanes out=Work.Std(rename= Fatalities_std=SYstd);
var Fatalities;
class Year;
tables (Year='' All='Overall'),Fatalities='Causalties'* Std='Standard Deviation'*F=5./box='Year';
run;
/***** calculating std(individual)>std(overall) *********/
data Work.stdfinal;
set Work.std;
retain std_001 Year_00 0;
if(SYstd > last.SYstd) then
do;
std_001=SYstd;
Year_001=Year;
end;
keep std_001 Year_001;
run;
/******** printing yearly standard deviation(s) > overall standard deviations*******/
proc tabulate data=work.stdfinal;
class Year_001;
var std_001;
tables Year_001='', std_001='highest Standard Deviation'*sum=''*F=3./box='Year';
title color=RED bold ' YEAR WHICH HAD THE MAXIMUM STANDARD DEVIATION FROM THE MEAN COMPARED TO THE STANDARD DEVIATION FROM THE MEAN WHOLE DATASET';
run;
@Saurabh291989 wrote:
Yes, Only shows the highest std values with their respective year.
I posting my actual sas code may you get more idea what I'm looking for.
Code that does not actually work used to tell us "more idea" of what you are looking for still requires mind reading because we would still be guessing at what you might want.
Pick a small example that you can do by hand, show the input data and the actual desired result and whether the result should be data set (to be read by machines for further processing) or a report to be read by people.
Hi:
In this output, there are report rows for 6 years, where the values are higher than the overall amount? Now what?
What would you envision seeing next? Would you want to have PROC TABULATE highlight these cells automatically (with a user-defined format) instead of doing it manually? Would you ONLY want to see the highlighted rows on the report where the values are higher than the overall value? I still don't understand 1) what you want the final report to look like and 2) how you want the overall number to be used in determining what rows should appear on the report.
cynthia
yes, Cynthia, I want only those rows for the years that are highlighted in yellow for the report.
Hi:
Based on the code you posted, I don't understand your use of "last.SYstd" in the DATA step or what that logic is doing. I would expect you to get a NOTE in the log that NOTE: Variable last.SYstd is uninitialized. and that the program wouldn't work as you desire.
Since you have to use a DATA step to generate your STDFinal data, with only the rows you want, I'd probably have taken a different approach and just used a DATA step to make a Macro variable to hold the value of the the overall std and then use that to create WORK.STDFinal.
Do you want to see the final TABULATE with a difference between the overall the the year? Or see the years appearing in descending order from greatest STD to lowest, for the rows selected? Just some other ideas.
Cynthia
Hi: For last. to work, you have to be using BY group processing, but I don't see that you need BYGroup processing here.
Given the data that you posted, a simple MACRO variable should allow you to create a report like this:
This is the approach I would take, as shown here:
Notice that the output is report #6. I did a few extra steps along the way, to illustrate various approaches and outputs.
Here's the final code that creates all the outputs.
** read data;
data xyz;
length date 8 time $10 location $100 operator $50 flt_num $50
route $100 type $100 registration $20 cn_In $20
aboard casualties ground 8 summary $1000;
infile "<path to file>\Airplane_Crashes_and_Fatalities_Since_1908.csv" dlm=',' dsd firstobs=2;
input Date : anydtdte. time $ location $ operator $ flt_num $
Route $ Type $ Registration $ cn_In aboard casualties ground summary $;
if date gt . then year = year(date);
Survivor=Aboard-casualties;
Decade=floor(Year/10)*10;
format Date mmddyy10.;
run;
title;
/* individual calculation */
proc tabulate data=xyz out=work.std;
title '1) Original TABULATE makes WORK.STD';
class year;
var casualties;
tables year all, casualties*std / box='Individual Calculation';
run;
title;
proc print data=work.std;
title '2) Data WORK.STD _TYPE_ =1 are year std and _TYPE_ = 0 is overall';
run;
** make a macro variable that represents the overall std;
data _null_;
set work.std;
if _type_ = '0' then do;
call symputx('ovstd',put(casualties_std,8.4),'G');
end;
run;
%put The overall STD is &ovstd;
** now make a format to highlight cells gt overall;
proc format;
value c_std &ovstd-high = 'yellow'
other = 'white';
run;
** repeat original tabulate, and just highlight the years gt overall;
proc tabulate data=xyz f=8.4;
title '3) Highlight desired values to show on final report';
title2 'Show all rows, first highlighting starts in year 1972';
class year;
var casualties;
tables year all='Overall',
casualties*std=' '*{s={background=c_std.}}
/ box='Individual Calculation';
run;
** create WORK.STDFINAL from WORK.STD from only _TYPE_=1 rows;
** where casualties_std > overall std;
** using the macro variable created from WORK.STD;
data Work.stdfinal;
set Work.std;
where _type_ = '1';
retain overall &ovstd;
if casualties_std > overall then do;
difference = casualties_std - overall;
output;
end;
run;
proc print data=work.stdfinal;
title '4) After using Macro Variable for Overall STD';
var year casualties_std overall difference;
run;
** FINAL tabulate, show only the years and the difference;
proc tabulate data=work.stdfinal f=8.4;
title1 color=RED bold
"5) STANDARD DEVIATION FOR DATASET WAS: &ovstd";
title2 color=RED bold
'YEARS WHICH HAD THE MAXIMUM STANDARD DEVIATION FROM THE MEAN COMPARED TO THE STANDARD DEVIATION FROM THE MEAN WHOLE DATASET';
class Year ;
var casualties_std difference;
tables Year=' ',
casualties_std='highest Standard Deviation'*sum=' ' difference*sum=' '
/box='Year';
run;
** show report in descending order of casualties_std value;
proc sort data=stdfinal;
by descending casualties_std;
run;
proc tabulate data=work.stdfinal f=8.4;
title1 color=RED bold
"6) STANDARD DEVIATION FOR DATASET WAS: &ovstd";
title2 color=RED bold
'YEARS WHICH HAD THE MAXIMUM STANDARD DEVIATION FROM THE MEAN COMPARED TO THE STANDARD DEVIATION FROM THE MEAN WHOLE DATASET';
class Year /order=data;
var casualties_std difference;
tables Year=' ',
casualties_std='highest Standard Deviation'*sum=' ' difference*sum=' '
/box='Year';
run;
title;
Hope this helps,
Cynthia
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.