Solved: Calculating totals grouping by date

wlierman · Posted 04-29-2022 01:29 AM

I am continuing from an earlier question.

My data input set looks like the following:

ID County Race_Ethnicity NAICS_Sector Sector_Type Date

1056200 Yamhill Goods Construction 10Jun2020

1054324 Baker Goods Agriculture Forestry 10Jun2020

2314536 Grant Goods Construction 10Jun2020 .

.

2145735 Curry Goods Construction 22Jun2020

.

176834 Columbia Goods Manufacturing 13Nov2020

170034 Columbia Goods Manufacturing 13Nov2020

198784 Columbia Goods Manufacturing 13Nov2020

and so on. The main characteristic is that Contact ID can appear any number of times for any particular date - it is dependent on the success of the contact tracing.

I want to keep the fields shown but for the next sql code I need to have the total IDs

for each date totaled. Again the number of contacts for any date depends on the contact tracing effort and success of those efforts.

I used the following code.

Proc sql noprint;
CREATE Table SAS_EMPL.POE_Goods_Trend AS
   SELECT Contact_Person_ID,
          County_1,
		  Race_Ethnicity,
		  NAICS_Sector,
		  Sector_Type,
		  datepart(Created_On) AS Date label = "Date" format=date9.,
		  count(distinct Contact_Person_ID) AS N
   FROM SAS_EMPL.POE_GOODS_FINI
   Where '01JAN20'd <= Calculated Date <= '31MAR22'd
   And NAICS_Sector = "GOODS"
   Group by date;
   *Order by 1;
   quit;

However it didn't give what I need

This what I got:

Date N

04JUN2020 1

05JUN2020 1

09JUN2020 1

11JUN2020 1

This what I need/want:

Date N

04JUN2020 1

05JUN2020 1

09JUN2020 3

11JUN2020 6

So I need the output to look like the last little snippet for all the dates that are in the

dataset (the cumulative total for each date).

Thank you.

For information the log accompanying the above code is

 Proc sql noprint;
328  CREATE Table SAS_EMPL.POE_Goods_Trend AS
329     SELECT Contact_Person_ID,
330            County_1,
331            Race_Ethnicity,
332            NAICS_Sector,
333            Sector_Type,
334            datepart(Created_On) AS Date label = "Date" format=date9.,
335            count(distinct Contact_Person_ID) AS N
336     FROM SAS_EMPL.POE_GOODS_FINI
337     Where '01JAN20'd <= Calculated Date <= '31MAR22'd
338     And NAICS_Sector = "GOODS"
339     Group by date;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Compressing data set SAS_EMPL.POE_GOODS_TREND decreased size by 60.00 percent.
      Compressed is 4 pages; un-compressed would require 10 pages.
NOTE: Table SAS_EMPL.POE_GOODS_TREND created, with 2467 rows and 7 columns.

340     *Order by 1;
341     quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds

Thank you for your help in answering this specific request.

wlierman

Kurt_Bremser · Posted 05-01-2022 03:28 PM

Then you will never get your expected data as you posted it. You can only get that with

select date, count(...) as count
from have
group by date;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

andreas_lds · Posted 04-29-2022 01:57 AM

Please post data in usable form (data step with datalines, or proc sql with insert), so that we can run the code you have provided.

The expected output doesn't match the data posted and i don't see "I want to keep the fields shown" in the expected output, too.

wlierman · Posted 04-29-2022 01:50 PM

Okay here is the input data set (in xlsx format)

Created_On	Contact_Person_ID	NAICS_Sector
7/15/2020 9:38	1000850	GOODS
10/16/2020 10:14	1011546	GOODS
9/22/2020 8:41	1012367	GOODS
6/13/2020 9:30	1012517	GOODS
11/2/2020 8:52	1014369	GOODS
8/11/2020 9:55	101964	GOODS
10/13/2020 11:33	1025507	GOODS
12/5/2020 9:35	1026268	GOODS
10/31/2020 11:28	1027330	GOODS
9/1/2020 10:12	1028370	GOODS
10/15/2020 10:57	1050427	GOODS
12/2/2020 8:15	1050453	GOODS

The code that I used is as follows

Proc sql noprint;
CREATE Table SAS_EMPL.POE_Goods_Trend_2 AS
   SELECT Contact_Person_ID,
		  NAICS_Sector,
		  datepart(Created_On) AS Date label = "Date" format=date9.,
		  count(distinct Contact_Person_ID) AS N
   FROM SAS_EMPL.POE_GOODS_FINI_2
   Where '01JAN20'd <= datepart(Created_On) <= '30APR22'd
   And NAICS_Sector = "GOODS"
   Group by date
   Order by date;
   quit;

The output dataset (partial and again in xlsx format) is

Contact_Person_ID	NAICS_Sector	Date	N
1056200	GOODS	10-Jun-20	1
1069490	GOODS	13-Jun-20	2
1012517	GOODS	13-Jun-20	2
1074374	GOODS	14-Jun-20	6
1072985	GOODS	14-Jun-20	6
1074445	GOODS	14-Jun-20	6
1074884	GOODS	14-Jun-20	6
951206	GOODS	14-Jun-20	6
1074312	GOODS	14-Jun-20	6
1077220	GOODS	15-Jun-20	1
1079736	GOODS	16-Jun-20	1
1083450	GOODS	17-Jun-20	1
1089620	GOODS	18-Jun-20	5
1090413	GOODS	18-Jun-20	5
9991088408	GOODS	18-Jun-20	5
1090459	GOODS	18-Jun-20	5
1087981	GOODS	18-Jun-20	5
1092420	GOODS	19-Jun-20	2
9991088337	GOODS	19-Jun-20	2
1098412	GOODS	20-Jun-20	7
1098376	GOODS	20-Jun-20	7
1098407	GOODS	20-Jun-20	7
1097263	GOODS	20-Jun-20	7
1080505	GOODS	20-Jun-20	7
1098432	GOODS	20-Jun-20	7
1081349	GOODS	20-Jun-20	7
1101656	GOODS	21-Jun-20	3
1102603	GOODS	21-Jun-20	3
666589	GOODS	21-Jun-20	3
1107510	GOODS	22-Jun-20	4
1106453	GOODS	22-Jun-20	4
1107392	GOODS	22-Jun-20	4
1107703	GOODS	22-Jun-20	4
933588	GOODS	23-Jun-20	5
1110589	GOODS	23-Jun-20	5
1110335	GOODS	23-Jun-20	5
1109851	GOODS	23-Jun-20	5
1109712	GOODS	23-Jun-20	5

This not what I want.  I want this result

Date      N
10-Jun-20 1
13-Jun-20   2
14-Jun-20   6
15-Jun-20   1
16-Jun-20   1
17-Jun-20   1
15-Jun-20   3
18-Jun-20 5

and so on for the
entire data
set

I need the cumulative total of contacts by each distinct date. Notice here that there is a lone 15-JUN-20 1 in the 4th obs but should be in the

second to last line so that 15-JUN-20 should be 4

What changes in the code can produce that total / cumulative count of contacts (N) for each distinct / unique date.

Thank you

Kurt_Bremser · Posted 05-01-2022 06:46 AM

I ask once again (and for the very last time) why do you select additional variables, when all you want is the date and the count?

PS I edited your post down to a size that the browsers can easily handle. No need to swamp the post with hundreds of lines when a few are sufficient to illustrate your issue. Nobody answers to a post that takes more than a minute to load.

And you were specifically asked to provide data in a usable form (as a data step with datalines or a proc sql with insert). A copy/paste directly out of Excel is not usable (it forces us to put import code around it, making assumptions about things we must know). Right now we can't tell if you have dates and times or datetimes, and if those are stored as SAS numeric values or strings.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

wlierman · Posted 05-01-2022 03:18 PM

I need those variables in subsequent calculations so I carry them along to help speed up the process.
I used the usual data step process to get the output that I need for the time series charts.
Thank you for your help.

wlierman

Kurt_Bremser · Posted 05-01-2022 03:28 PM

Then you will never get your expected data as you posted it. You can only get that with

select date, count(...) as count
from have
group by date;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

wlierman · Posted 05-01-2022 03:36 PM

Right.

Thank you for that insight.

wlierman

Tom · Posted 05-01-2022 05:35 PM

Your example data does not have any replication of dates.

So first thing is to fix that.

Here is how to supply example data. I modified the data to have multiple contacts on the same day.

data have;
  input Created_On :datetime. Contact_Person_ID :$20. NAICS_Sector :$20.;
  format Created_On datetime19. ;
cards;
15JUL2020:09:38:00 1000850 GOODS
15OCT2020:10:14:00 1011546 GOODS
22SEP2020:08:41:00 1012367 GOODS
13JUN2020:09:30:00 1012517 GOODS
02NOV2020:08:52:00 1014369 GOODS
11AUG2020:09:55:00  101964 GOODS
15OCT2020:11:33:00 1025507 GOODS
05DEC2020:09:35:00 1026268 GOODS
31OCT2020:11:28:00 1027330 GOODS
01SEP2020:10:12:00 1028370 GOODS
15OCT2020:10:57:00 1050427 GOODS
02DEC2020:08:15:00 1050453 GOODS
;

Now your code should work fine.

proc sql ;
create table want AS
  select Contact_Person_ID
       , NAICS_Sector
       , datepart(Created_On) AS Date format=date9.
       , count(distinct Contact_Person_ID) AS N
  from have
  where '01JAN2020'd <= datepart(Created_On) <= '30APR2022'd
    and NAICS_Sector = "GOODS"
  group by date
  order by date
;
quit;

Results

       Contact_
       Person_     NAICS_
Obs       ID       Sector         Date    N

  1    1012517     GOODS     13JUN2020    1
  2    1000850     GOODS     15JUL2020    1
  3    101964      GOODS     11AUG2020    1
  4    1028370     GOODS     01SEP2020    1
  5    1012367     GOODS     22SEP2020    1
  6    1025507     GOODS     15OCT2020    3
  7    1050427     GOODS     15OCT2020    3
  8    1011546     GOODS     15OCT2020    3
  9    1027330     GOODS     31OCT2020    1
 10    1014369     GOODS     02NOV2020    1
 11    1050453     GOODS     02DEC2020    1
 12    1026268     GOODS     05DEC2020    1

So you can see that 15OCT2020 has 3 different customers.

PS It does seem silly to order the data by DATE and not have DATE as the FIRST variable in the dataset. It makes it hard for humans to review the dataset to check if it was created properly.

wlierman · Posted 05-01-2022 11:14 PM

Thank you for your explanation and set up.
I really appreciate that.
wlierman

wlierman · Posted 05-23-2022 03:06 PM

Thank you very much.

This is exactly what I was looking for.

This should also be labeled as SOLUTION

Kurt_Bremser · Posted 04-29-2022 03:51 AM

You cannot get only two columns when you SELECT 7. Correct that first.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Calculating totals grouping by date

Re: Calculating totals grouping by date

Re: Calculating totals grouping by date

Re: Calculating totals grouping by date

Re: Calculating totals grouping by date

Re: Calculating totals grouping by date

Re: Calculating totals grouping by date

Re: Calculating totals grouping by date

Re: Calculating totals grouping by date

Re: Calculating totals grouping by date

Re: Calculating totals grouping by date

Re: Calculating totals grouping by date

Classroom Training Available!