I am continuing from an earlier question.
My data input set looks like the following:
ID County Race_Ethnicity NAICS_Sector Sector_Type Date
1056200 Yamhill Goods Construction 10Jun2020
1054324 Baker Goods Agriculture Forestry 10Jun2020
2314536 Grant Goods Construction 10Jun2020 .
.
2145735 Curry Goods Construction 22Jun2020
.
.
176834 Columbia Goods Manufacturing 13Nov2020
170034 Columbia Goods Manufacturing 13Nov2020
198784 Columbia Goods Manufacturing 13Nov2020
and so on. The main characteristic is that Contact ID can appear any number of times for any particular date - it is dependent on the success of the contact tracing.
I want to keep the fields shown but for the next sql code I need to have the total IDs
for each date totaled. Again the number of contacts for any date depends on the contact tracing effort and success of those efforts.
I used the following code.
Proc sql noprint;
CREATE Table SAS_EMPL.POE_Goods_Trend AS
SELECT Contact_Person_ID,
County_1,
Race_Ethnicity,
NAICS_Sector,
Sector_Type,
datepart(Created_On) AS Date label = "Date" format=date9.,
count(distinct Contact_Person_ID) AS N
FROM SAS_EMPL.POE_GOODS_FINI
Where '01JAN20'd <= Calculated Date <= '31MAR22'd
And NAICS_Sector = "GOODS"
Group by date;
*Order by 1;
quit;
However it didn't give what I need
This what I got:
Date N
04JUN2020 1
05JUN2020 1
09JUN2020 1
09JUN2020 1
09JUN2020 1
11JUN2020 1
11JUN2020 1
11JUN2020 1
11JUN2020 1
11JUN2020 1
11JUN2020 1
This what I need/want:
Date N
04JUN2020 1
05JUN2020 1
09JUN2020 3
11JUN2020 6
So I need the output to look like the last little snippet for all the dates that are in the
dataset (the cumulative total for each date).
Thank you.
For information the log accompanying the above code is
Proc sql noprint;
328 CREATE Table SAS_EMPL.POE_Goods_Trend AS
329 SELECT Contact_Person_ID,
330 County_1,
331 Race_Ethnicity,
332 NAICS_Sector,
333 Sector_Type,
334 datepart(Created_On) AS Date label = "Date" format=date9.,
335 count(distinct Contact_Person_ID) AS N
336 FROM SAS_EMPL.POE_GOODS_FINI
337 Where '01JAN20'd <= Calculated Date <= '31MAR22'd
338 And NAICS_Sector = "GOODS"
339 Group by date;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Compressing data set SAS_EMPL.POE_GOODS_TREND decreased size by 60.00 percent.
Compressed is 4 pages; un-compressed would require 10 pages.
NOTE: Table SAS_EMPL.POE_GOODS_TREND created, with 2467 rows and 7 columns.
340 *Order by 1;
341 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds
Thank you for your help in answering this specific request.
wlierman
Then you will never get your expected data as you posted it. You can only get that with
select date, count(...) as count
from have
group by date;
Please post data in usable form (data step with datalines, or proc sql with insert), so that we can run the code you have provided.
The expected output doesn't match the data posted and i don't see "I want to keep the fields shown" in the expected output, too.
Okay here is the input data set (in xlsx format)
Created_On Contact_Person_ID NAICS_Sector
7/15/2020 9:38 1000850 GOODS
10/16/2020 10:14 1011546 GOODS
9/22/2020 8:41 1012367 GOODS
6/13/2020 9:30 1012517 GOODS
11/2/2020 8:52 1014369 GOODS
8/11/2020 9:55 101964 GOODS
10/13/2020 11:33 1025507 GOODS
12/5/2020 9:35 1026268 GOODS
10/31/2020 11:28 1027330 GOODS
9/1/2020 10:12 1028370 GOODS
10/15/2020 10:57 1050427 GOODS
12/2/2020 8:15 1050453 GOODS
The code that I used is as follows
Proc sql noprint;
CREATE Table SAS_EMPL.POE_Goods_Trend_2 AS
SELECT Contact_Person_ID,
NAICS_Sector,
datepart(Created_On) AS Date label = "Date" format=date9.,
count(distinct Contact_Person_ID) AS N
FROM SAS_EMPL.POE_GOODS_FINI_2
Where '01JAN20'd <= datepart(Created_On) <= '30APR22'd
And NAICS_Sector = "GOODS"
Group by date
Order by date;
quit;
The output dataset (partial and again in xlsx format) is
Contact_Person_ID NAICS_Sector Date N
1056200 GOODS 10-Jun-20 1
1069490 GOODS 13-Jun-20 2
1012517 GOODS 13-Jun-20 2
1074374 GOODS 14-Jun-20 6
1072985 GOODS 14-Jun-20 6
1074445 GOODS 14-Jun-20 6
1074884 GOODS 14-Jun-20 6
951206 GOODS 14-Jun-20 6
1074312 GOODS 14-Jun-20 6
1077220 GOODS 15-Jun-20 1
1079736 GOODS 16-Jun-20 1
1083450 GOODS 17-Jun-20 1
1089620 GOODS 18-Jun-20 5
1090413 GOODS 18-Jun-20 5
9991088408 GOODS 18-Jun-20 5
1090459 GOODS 18-Jun-20 5
1087981 GOODS 18-Jun-20 5
1092420 GOODS 19-Jun-20 2
9991088337 GOODS 19-Jun-20 2
1098412 GOODS 20-Jun-20 7
1098376 GOODS 20-Jun-20 7
1098407 GOODS 20-Jun-20 7
1097263 GOODS 20-Jun-20 7
1080505 GOODS 20-Jun-20 7
1098432 GOODS 20-Jun-20 7
1081349 GOODS 20-Jun-20 7
1101656 GOODS 21-Jun-20 3
1102603 GOODS 21-Jun-20 3
666589 GOODS 21-Jun-20 3
1107510 GOODS 22-Jun-20 4
1106453 GOODS 22-Jun-20 4
1107392 GOODS 22-Jun-20 4
1107703 GOODS 22-Jun-20 4
933588 GOODS 23-Jun-20 5
1110589 GOODS 23-Jun-20 5
1110335 GOODS 23-Jun-20 5
1109851 GOODS 23-Jun-20 5
1109712 GOODS 23-Jun-20 5
This not what I want. I want this result
Date N
10-Jun-20 1
13-Jun-20 2
14-Jun-20 6
15-Jun-20 1
16-Jun-20 1
17-Jun-20 1
15-Jun-20 3
18-Jun-20 5
and so on for the
entire data
set
I need the cumulative total of contacts by each distinct date. Notice here that there is a lone 15-JUN-20 1 in the 4th obs but should be in the
second to last line so that 15-JUN-20 should be 4
What changes in the code can produce that total / cumulative count of contacts (N) for each distinct / unique date.
Thank you
I ask once again (and for the very last time) why do you select additional variables, when all you want is the date and the count?
PS I edited your post down to a size that the browsers can easily handle. No need to swamp the post with hundreds of lines when a few are sufficient to illustrate your issue. Nobody answers to a post that takes more than a minute to load.
And you were specifically asked to provide data in a usable form (as a data step with datalines or a proc sql with insert). A copy/paste directly out of Excel is not usable (it forces us to put import code around it, making assumptions about things we must know). Right now we can't tell if you have dates and times or datetimes, and if those are stored as SAS numeric values or strings.
Then you will never get your expected data as you posted it. You can only get that with
select date, count(...) as count
from have
group by date;
Right.
Thank you for that insight.
wlierman
Your example data does not have any replication of dates.
So first thing is to fix that.
Here is how to supply example data. I modified the data to have multiple contacts on the same day.
data have;
input Created_On :datetime. Contact_Person_ID :$20. NAICS_Sector :$20.;
format Created_On datetime19. ;
cards;
15JUL2020:09:38:00 1000850 GOODS
15OCT2020:10:14:00 1011546 GOODS
22SEP2020:08:41:00 1012367 GOODS
13JUN2020:09:30:00 1012517 GOODS
02NOV2020:08:52:00 1014369 GOODS
11AUG2020:09:55:00 101964 GOODS
15OCT2020:11:33:00 1025507 GOODS
05DEC2020:09:35:00 1026268 GOODS
31OCT2020:11:28:00 1027330 GOODS
01SEP2020:10:12:00 1028370 GOODS
15OCT2020:10:57:00 1050427 GOODS
02DEC2020:08:15:00 1050453 GOODS
;
Now your code should work fine.
proc sql ;
create table want AS
select Contact_Person_ID
, NAICS_Sector
, datepart(Created_On) AS Date format=date9.
, count(distinct Contact_Person_ID) AS N
from have
where '01JAN2020'd <= datepart(Created_On) <= '30APR2022'd
and NAICS_Sector = "GOODS"
group by date
order by date
;
quit;
Results
Contact_ Person_ NAICS_ Obs ID Sector Date N 1 1012517 GOODS 13JUN2020 1 2 1000850 GOODS 15JUL2020 1 3 101964 GOODS 11AUG2020 1 4 1028370 GOODS 01SEP2020 1 5 1012367 GOODS 22SEP2020 1 6 1025507 GOODS 15OCT2020 3 7 1050427 GOODS 15OCT2020 3 8 1011546 GOODS 15OCT2020 3 9 1027330 GOODS 31OCT2020 1 10 1014369 GOODS 02NOV2020 1 11 1050453 GOODS 02DEC2020 1 12 1026268 GOODS 05DEC2020 1
So you can see that 15OCT2020 has 3 different customers.
PS It does seem silly to order the data by DATE and not have DATE as the FIRST variable in the dataset. It makes it hard for humans to review the dataset to check if it was created properly.
Thank you very much.
This is exactly what I was looking for.
This should also be labeled as SOLUTION
You cannot get only two columns when you SELECT 7. Correct that first.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.