- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I have question pertaining to the challenger question in SAS programming 1 Lesson 5 - Analyzing and Reporting on Data.
Image below shows the solution to the "Challenger" practice of Topic "Creating Summary Reports and Data".
My question:
Why didn't the proc means step creating the output of top 3 parks grouped by REGION and YEARS?
The answer for 188594 is the third highest number of park visitors in Alaska region in the month of JUNE of year 2010.
If I were to sum the total of visitors by YEARS and REGION, what would the code look like?
Regards,
Siroo
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The value of MONTH is not considered at all by the PROC since you never referenced it.
It is using the values of VISITERS and PARKNAME from the three observations with the maximum value of VISITOR within the group of observations defined by the combination of REGION and YEAR.
The only reason it looks to like it has anything to do with months in because in your dataset there is a MONTH variable to distinguish the multiple observations per region and year combination
If you had daily counts instead of monthly counts (so 365 observations per region per year instead of just 12) then the top 3 would be the top daily counts.
If you want to see the month that corresponds to the values of VISITORS that you are outputting add it into the list of variables to select.
idgroup(max(Visitors) out[3] (Visitors ParkName Month)=)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The whole point of the IDGROUP is to let you output some of the individual values that are used to create the aggregate values that the normal options on the OUTPUT statement let you create. Sounds like from your description the third largest number of visitors to the Alaska region during the year 2010 occurred in the month of June.
Here is an example you can run that does not require the datasets from that course.
The CLASS statement will group the data by CLASS and the two IDGROUP will get some of the detail information for the two tallest and two shortest in the group.
proc summary data=sashelp.class nway ;
class sex;
var height ;
output out=summary max=max min=min
idgroup (max(height) out[2] (name height)=tall_name tall_height )
idgroup (min(height) out[2] (name height)=short_name short_height )
;
run;
tall_ tall_ tall_ tall_ short_ short_ short_ short_ Obs Sex _TYPE_ _FREQ_ max min name_1 name_2 height_1 height_2 name_1 name_2 height_1 height_2 1 F 1 9 66.5 51.3 Mary Barbara 66.5 65.3 Joyce Louise 51.3 56.3 2 M 1 10 72.0 57.3 Philip Alfred 72.0 69.0 James Thomas 57.3 57.5
So you can see that the tallest boy is Philip and the shortest girl Joyce. But you can also see that the second shortest boy is Thomas and the second tallest girl is Barbara.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tom,
Let's assume the dataset comes with the variable REGION, MONTH, YEAR, PARKNAME and VISITORS.
The result of code below would return the top 3 number visitors by REGION, YEAR and MONTH.
proc means data=pg1.np_multiyr noprint; var Visitors; class Region Year; ways 2; output out=top3parks(drop=_freq_ _type_) sum=TotalVisitors idgroup(max(Visitors) out[3] (Visitors ParkName)=); run;
SAS output running the above code:
By referring to the raw data below, 193,116 visitors is the data for Alaska in the 8th month of 2010.
I am just wondering since the code only classify the VISITORS by REGION and YEAR, why would "MONTH" be considered when there is no MONTH variable in the code?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The value of MONTH is not considered at all by the PROC since you never referenced it.
It is using the values of VISITERS and PARKNAME from the three observations with the maximum value of VISITOR within the group of observations defined by the combination of REGION and YEAR.
The only reason it looks to like it has anything to do with months in because in your dataset there is a MONTH variable to distinguish the multiple observations per region and year combination
If you had daily counts instead of monthly counts (so 365 observations per region per year instead of just 12) then the top 3 would be the top daily counts.
If you want to see the month that corresponds to the values of VISITORS that you are outputting add it into the list of variables to select.
idgroup(max(Visitors) out[3] (Visitors ParkName Month)=)