Solved: Re: Calculating Average while grouping data?

Krueger · Posted 01-28-2020 12:25 PM

I'm trying to figure out how to group the below data so that I can get the Average number of ID's for each time period based on the Week Week (Mon-Sun). The example below is only Wednesday as I was trying to keep this simple while testing out solutions.

DATA Have;
	INPUT LOCATION ID $2. Day:MMDDYY9. Month:Date8. Hour:time.;
	FORMAT Day WeekDate3. Month Date8. Hour timeampm5.;
	DATALINES;
1 A 7/3/2019  1-Jul-19 08:00
1 A 7/3/2019  1-Jul-19 09:00
1 A 7/3/2019  1-Jul-19 10:00
1 A 7/3/2019  1-Jul-19 11:00
1 A 7/3/2019  1-Jul-19 12:00
1 A 7/10/2019 1-Jul-19 08:00
1 A 7/10/2019 1-Jul-19 09:00
1 A 7/10/2019 1-Jul-19 10:00
1 A 7/10/2019 1-Jul-19 11:00
1 A 7/10/2019 1-Jul-19 12:00
1 A 7/17/2019 1-Jul-19 08:00
1 A 7/17/2019 1-Jul-19 09:00
1 A 7/17/2019 1-Jul-19 10:00
1 A 7/17/2019 1-Jul-19 11:00
1 A 7/17/2019 1-Jul-19 12:00
1 B 7/3/2019  1-Jul-19 08:00
1 B 7/3/2019  1-Jul-19 09:00
1 B 7/3/2019  1-Jul-19 10:00
1 B 7/3/2019  1-Jul-19 11:00
1 B 7/3/2019  1-Jul-19 12:00
1 B 7/10/2019 1-Jul-19 08:00
1 B 7/10/2019 1-Jul-19 09:00
1 B 7/10/2019 1-Jul-19 10:00
1 B 7/10/2019 1-Jul-19 11:00
1 B 7/10/2019 1-Jul-19 12:00
1 B 7/17/2019 1-Jul-19 08:00
1 B 7/17/2019 1-Jul-19 09:00
1 B 7/17/2019 1-Jul-19 10:00
1 B 7/17/2019 1-Jul-19 11:00
1 B 7/17/2019 1-Jul-19 12:00
1 C 7/03/2019 1-Jul-19 08:00
1 C 7/03/2019 1-Jul-19 09:00
1 C 7/10/2019 1-Jul-19 08:00
1 C 7/10/2019 1-Jul-19 09:00
1 C 7/17/2019 1-Jul-19 08:00
1 C 7/17/2019 1-Jul-19 09:00
;
RUN;
	
DATA WANT;
	INPUT LOCATION Day Hour:time. AVG;
	FORMAT hour timeampm5.;
	DATALINES;
1 WED 08:00 3
1 WED 09:00 3
1 WED 10:00 2
1 WED 11:00 2
1 WED 12:00 2
;
RUN;

Below is one of my latest attempts which I think(?) will work (once I fix the error) but I'm not sure how to implement another step for this "fix".

PROC REPORT DATA = have;
2167  title attempt;
2168  column location day hour id;
2169  define location / group;
2170  define day / group;
2171  define hour / group;
2172  define id / analysis mean;
2173  run;

ERROR: id is an ANALYSIS variable but not numeric.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE REPORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

I understand ID is a character. What I don't know how to do is to group this to get an accurate SUM out of these or maybe I just need to convert it to numeric? I'm not sure. Any help would be appreciated.

Sajid01 · Posted 01-28-2020 03:12 PM

I have updated it. It gives the average number of occurrence on a given day of the week.
Code is self explanatory.
Replace my proc sql by the following two steps:

/* This counts the occurence */

proc sql;
create table have3 as
select Location ,day format mmddyy9., dow , put(Hour, tod5.) as Hour, count(hour) as no_on_day from have2
group by Location,day, dow,Hour;
quit;

/*

The number average number of occurrences on a given location, time and day.

dow is day o f week.

*/
proc sql;
select location, dow, hour, avg(no_on_day) as avg from have3
group by Location, dow,hour;
quit;

I have written the code to match the appearance of your question.

View solution in original post

Tom · Posted 01-28-2020 12:28 PM

Are you tying to count observations? Or distinct values of ID?

Krueger · Posted 01-28-2020 12:31 PM

I want to know the Average number of ID's that occur based on Location, WeekDay (Monday-Sunday) and Hour (8AM-12PM). I then want my output to look like my want dataset so that I have Location, WeekDay and Hour listed with all of the Averages as the last column.

In my prior datastep I removed the duplicates and have it down to only distinct values by Location, WeekDay and Hour for each ID.

So in the above test data there are 3 ID's...

ID= C is the only ID that has only hours 8AM & 9AM

ID's A & B have all hours listed.

Averages for these would be 8AM = 3, 9AM = 3, 10AM-12PM = 2.

ballardw · Posted 01-28-2020 12:59 PM

I'm afraid I can't tell exactly what you want when you refer to "week" since you only show Wed.

To get a value that represents a the start of a week on Monday you can apply the Intnx function to a date.

Example:

data work.example;
   do date = '01Jan2020'd to '31MAR2020'd;
      weekmonday = intnx('week.2',date,0,'b');
      output;
   end;
   format date weekmonday date9.;
run;

The intnx function increments date, time or datetime values. The first parameter is the interval of interest, week in your problem. Since you want the week to start on Monday, which is the 2nd day of the week use a shift index of .2. The 0 is how many weeks, ie the current week and 'b' yields the first day of the given week. So you can create a new variable to indicate the first day of the week of interest. How to display that value depends but you haven't given an example of how the week is to be shown or actually used.

Something like:

   
DATA Have;
	INPUT LOCATION ID $2. Day:MMDDYY9. Month:Date8. Hour:time.;
	FORMAT Day WeekDate3. Month Date8. Hour timeampm5.;
   weekmonday = intnx('week.2',day,0,'b');
   format weekmonday date9.;
	DATALINES;
1 A 7/3/2019  1-Jul-19 08:00
1 A 7/3/2019  1-Jul-19 09:00
1 A 7/3/2019  1-Jul-19 10:00
1 A 7/3/2019  1-Jul-19 11:00
1 A 7/3/2019  1-Jul-19 12:00
1 A 7/10/2019 1-Jul-19 08:00
1 A 7/10/2019 1-Jul-19 09:00
1 A 7/10/2019 1-Jul-19 10:00
1 A 7/10/2019 1-Jul-19 11:00
1 A 7/10/2019 1-Jul-19 12:00
1 A 7/17/2019 1-Jul-19 08:00
1 A 7/17/2019 1-Jul-19 09:00
1 A 7/17/2019 1-Jul-19 10:00
1 A 7/17/2019 1-Jul-19 11:00
1 A 7/17/2019 1-Jul-19 12:00
1 B 7/3/2019  1-Jul-19 08:00
1 B 7/3/2019  1-Jul-19 09:00
1 B 7/3/2019  1-Jul-19 10:00
1 B 7/3/2019  1-Jul-19 11:00
1 B 7/3/2019  1-Jul-19 12:00
1 B 7/10/2019 1-Jul-19 08:00
1 B 7/10/2019 1-Jul-19 09:00
1 B 7/10/2019 1-Jul-19 10:00
1 B 7/10/2019 1-Jul-19 11:00
1 B 7/10/2019 1-Jul-19 12:00
1 B 7/17/2019 1-Jul-19 08:00
1 B 7/17/2019 1-Jul-19 09:00
1 B 7/17/2019 1-Jul-19 10:00
1 B 7/17/2019 1-Jul-19 11:00
1 B 7/17/2019 1-Jul-19 12:00
1 C 7/03/2019 1-Jul-19 08:00
1 C 7/03/2019 1-Jul-19 09:00
1 C 7/10/2019 1-Jul-19 08:00
1 C 7/10/2019 1-Jul-19 09:00
1 C 7/17/2019 1-Jul-19 08:00
1 C 7/17/2019 1-Jul-19 09:00
;
RUN;

And the count you want is likely something like this:

proc freq data=have noprint;
   tables weekmonday*hour*location/out=work.count1 nopercent;
run;
proc freq data=work.count1 noprint;
   tables weekmonday*hour/out=work.count2 nopercent;
run;

However I'm not sure since you didn't provide much of an example of what you really want as far as having multiple values per "week".

Krueger · Posted 01-28-2020 02:16 PM

Thanks for the feedback I appreciate it.

I'll try to do a better job explaining...

Ultimately I'm trying to determine from 1/1/2019-12/31/2019 the average number of ID's by location and hour for a given WeekDay (Monday-Sunday). The final Dataset would look similar to this:

Given we are dealing with Averages these (AVG column) will likely be 1.1056, 2.223 etc.. but again just trying to keep it simple.

In the Have dataset each date (7/3/2019, 7/10/2019, etc..) are all Wednesdays. My actual dataset has days 1/1/2019-12/31/2019. I'm needing all of these dates grouped into the WeekDate format however so what I want is it grouped Monday-Sunday. So if I put in 1/2/2019 or even 12/25/2019 (both are Wednesday's) it would also apply to the above group.

So from there... we have unique ID's for each location and on a timeframe from 8AM-8PM (in this example I kept it down to 8AM-12PM). ID A may only appear 2 times out of the potential 5 (in my example) hours for any particular Wednesday, but may appear 3 times another Wednesday; both of which are unique entries which will be counted towards the average. So back to my original set we can see ID C appears during ALL of the Wednesdays provided, but only appears for 8AM & 9AM.

We know there are 3 Wednesdays being calculated: 7/3/2019, 7/10/2019, 7/17/2019.

We know we have 3 ID's: A,B,C.

For hours 8AM & 9AM users A, B & C have entries for each Wednesday.

For Hours 10AM-12PM users A & B have entries for each Wednesday...ID C does not have entries so they are not counted.

So we know on 7/3/2019, 7/10/2019 and 7/17/2019 there are 3 unique ID's for 8AM & 9AM and 9 total entries (3, 3, 3).

However on the same dates above there are only 2 unique ID's for 10AM-12PM and only 6 entries (2, 2, 2).

I want the average of these entries so we then know based off that information that the Average for Wednesday 8AM & 9AM = 3 and 10AM-12PM = 2.

Now if we added another Wednesday with only 1 unique ID for 8AM there would be 4 unique ID's for 8AM and 10 total entries. At this point the average for Wednesday at 8AM would change from 3 to 2.5 (3, 3, 3, 1).

Hopefully this makes more sense although I feel like I'm complicating it further in an attempt to explain it.

PaigeMiller · Posted 01-28-2020 03:01 PM

@Krueger wrote:

We know there are 3 Wednesdays being calculated: 7/3/2019, 7/10/2019, 7/17/2019.

We know we have 3 ID's: A,B,C.

For hours 8AM & 9AM users A, B & C have entries for each Wednesday.

For Hours 10AM-12PM users A & B have entries for each Wednesday...ID C does not have entries so they are not counted.

So we know on 7/3/2019, 7/10/2019 and 7/17/2019 there are 3 unique ID's for 8AM & 9AM and 9 total entries (3, 3, 3).

However on the same dates above there are only 2 unique ID's for 10AM-12PM and only 6 entries (2, 2, 2).

I want the average of these entries so we then know based off that information that the Average for Wednesday 8AM & 9AM = 3 and 10AM-12PM = 2.

So as you have described it, you need to do 3 distinct calculations

Count number of times each HOUR appears
Count number of distinct ID values
Divide item #1 by item #3

Thus, to make this easiest to understand, the code needs three distinct calculations (maybe someone smart can do this in one step, but logically it is easier to understand if we program each of these as distinct steps in SAS)

DATA Have;
	INPUT LOCATION ID $2. Day:MMDDYY9. Month:Date8. Hour:time.;
	FORMAT Day WeekDate3. Month Date8. Hour timeampm5.;
	DATALINES;
1 A 7/3/2019  1-Jul-19 08:00
1 A 7/3/2019  1-Jul-19 09:00
1 A 7/3/2019  1-Jul-19 10:00
1 A 7/3/2019  1-Jul-19 11:00
1 A 7/3/2019  1-Jul-19 12:00
1 A 7/10/2019 1-Jul-19 08:00
1 A 7/10/2019 1-Jul-19 09:00
1 A 7/10/2019 1-Jul-19 10:00
1 A 7/10/2019 1-Jul-19 11:00
1 A 7/10/2019 1-Jul-19 12:00
1 A 7/17/2019 1-Jul-19 08:00
1 A 7/17/2019 1-Jul-19 09:00
1 A 7/17/2019 1-Jul-19 10:00
1 A 7/17/2019 1-Jul-19 11:00
1 A 7/17/2019 1-Jul-19 12:00
1 B 7/3/2019  1-Jul-19 08:00
1 B 7/3/2019  1-Jul-19 09:00
1 B 7/3/2019  1-Jul-19 10:00
1 B 7/3/2019  1-Jul-19 11:00
1 B 7/3/2019  1-Jul-19 12:00
1 B 7/10/2019 1-Jul-19 08:00
1 B 7/10/2019 1-Jul-19 09:00
1 B 7/10/2019 1-Jul-19 10:00
1 B 7/10/2019 1-Jul-19 11:00
1 B 7/10/2019 1-Jul-19 12:00
1 B 7/17/2019 1-Jul-19 08:00
1 B 7/17/2019 1-Jul-19 09:00
1 B 7/17/2019 1-Jul-19 10:00
1 B 7/17/2019 1-Jul-19 11:00
1 B 7/17/2019 1-Jul-19 12:00
1 C 7/03/2019 1-Jul-19 08:00
1 C 7/03/2019 1-Jul-19 09:00
1 C 7/10/2019 1-Jul-19 08:00
1 C 7/10/2019 1-Jul-19 09:00
1 C 7/17/2019 1-Jul-19 08:00
1 C 7/17/2019 1-Jul-19 09:00
;
RUN;
	
/* Item 1 -- calculate the number of observations for each hour */
proc summary data=have nway;
	class location day hour;
	var month;
	output out=n n=count;
run;

/* Item 2 -- determine the number of distinct ID values */
proc sql;
	create table distinct_id as select count(distinct id) as distinct_id from have;
quit;

/* Item 3 -- divide */
data want;
	if _n_=1 then set distinct_id;
	set n;
	average=count/distinct_id;
run;

--
Paige Miller

Krueger · Posted 01-28-2020 03:06 PM

Thanks for breaking this down and explaining it. I think this will work for what I'm after. Going to implement into live data before accepting as solution.

Krueger · Posted 01-28-2020 03:30 PM

This worked for the example data but the live data it didn't account for the different weekdays. If I added 7/4/2019 for example (Thursday) Item 2 only accounts for the 3 ID's still. I did implement a group by to this section which fixed that part but wasn't sure how to update Item 3 to accommodate the different grouping's so it would divide correctly. Adding a by statement(?) to item3 didn't seem to help although I'm sure the solution to this is simple.

Thank you again though for your input I definitely learned a lot from this!

Sajid01 · Posted 01-28-2020 02:40 PM

DATA Have;
INPUT LOCATION ID $2. Day:MMDDYY9. Month:Date8. Hour:time.;
FORMAT Day WeekDate3. Month Date8. Hour timeampm5.;
DATALINES;
1 A 7/3/2019 1-Jul-19 08:00
1 A 7/3/2019 1-Jul-19 09:00
1 A 7/3/2019 1-Jul-19 10:00
1 A 7/3/2019 1-Jul-19 11:00
1 A 7/3/2019 1-Jul-19 12:00
1 A 7/10/2019 1-Jul-19 08:00
1 A 7/10/2019 1-Jul-19 09:00
1 A 7/10/2019 1-Jul-19 10:00
1 A 7/10/2019 1-Jul-19 11:00
1 A 7/10/2019 1-Jul-19 12:00
1 A 7/17/2019 1-Jul-19 08:00
1 A 7/17/2019 1-Jul-19 09:00
1 A 7/17/2019 1-Jul-19 10:00
1 A 7/17/2019 1-Jul-19 11:00
1 A 7/17/2019 1-Jul-19 12:00
1 B 7/3/2019 1-Jul-19 08:00
1 B 7/3/2019 1-Jul-19 09:00
1 B 7/3/2019 1-Jul-19 10:00
1 B 7/3/2019 1-Jul-19 11:00
1 B 7/3/2019 1-Jul-19 12:00
1 B 7/10/2019 1-Jul-19 08:00
1 B 7/10/2019 1-Jul-19 09:00
1 B 7/10/2019 1-Jul-19 10:00
1 B 7/10/2019 1-Jul-19 11:00
1 B 7/10/2019 1-Jul-19 12:00
1 B 7/17/2019 1-Jul-19 08:00
1 B 7/17/2019 1-Jul-19 09:00
1 B 7/17/2019 1-Jul-19 10:00
1 B 7/17/2019 1-Jul-19 11:00
1 B 7/17/2019 1-Jul-19 12:00
1 C 7/03/2019 1-Jul-19 08:00
1 C 7/03/2019 1-Jul-19 09:00
1 C 7/10/2019 1-Jul-19 08:00
1 C 7/10/2019 1-Jul-19 09:00
1 C 7/17/2019 1-Jul-19 08:00
1 C 7/17/2019 1-Jul-19 09:00
;
RUN;

data have2;
set have;
dow= put(day,downame.);
run;

proc sql;
select Location , dow as day , put(Hour, tod5.) as Hour, count(hour) as count from have2
group by Location, dow,Hour;
quit;

The result will be

LOCATION	day	Hour	Count
1	Wednesday	8:00	9
1	Wednesday	9:00	9
1	Wednesday	10:00	6
1	Wednesday	11:00	6
1	Wednesday	12:00	6

Krueger · Posted 01-28-2020 02:50 PM

Definitely on the right track. I need the average though but this gives me a place to experiment further with.

Thanks!

Sajid01 · Posted 01-28-2020 03:12 PM

I have updated it. It gives the average number of occurrence on a given day of the week.
Code is self explanatory.
Replace my proc sql by the following two steps:

/* This counts the occurence */

proc sql;
create table have3 as
select Location ,day format mmddyy9., dow , put(Hour, tod5.) as Hour, count(hour) as no_on_day from have2
group by Location,day, dow,Hour;
quit;

/*

The number average number of occurrences on a given location, time and day.

dow is day o f week.

*/
proc sql;
select location, dow, hour, avg(no_on_day) as avg from have3
group by Location, dow,hour;
quit;

I have written the code to match the appearance of your question.

Krueger · Posted 01-28-2020 03:27 PM

This ended up working perfectly as it also accounts for different WeekDays besides wednesday. Thank you!

Registration is open

SAS Training: Just a Click Away