I have the following dataset. I would like to extract first and last date per city and doing a count by city as well
Have:
City | dates |
New York | 1/5/2019 |
Boston | 4/20/2019 |
Washington DC | 8/20/2019 |
Washington DC | 9/1/2019 |
Washington DC | 10/19/2019 |
Washington DC | 12/13/2019 |
Hartford | 5/5/2019 |
Pittsburgh | 6/4/2019 |
Philadelphia | 4/6/2019 |
Philadelphia | 6/14/2019 |
Philadelphia | 9/8/2019 |
Want
City | Count | firstDate | LastDate |
New York | 1 | 1/5/2019 | 1/5/2019 |
Boston | 1 | 4/20/2019 | 4/20/2019 |
Washington DC | 4 | 8/20/2019 | 12/13/2019 |
Hartford | 1 | 5/5/2019 | 5/5/2019 |
Pittsburgh | 1 | 6/4/2019 | 6/4/2019 |
Philadelphia | 3 | 4/6/2019 | 9/8/2019 |
Possibly also percent per city? like NY will have 1/11, Boston 1/11 (9.1%), etc
Here is waht i have done so far
proc sql;
select distinct city, count(*) as total_count, min(dates) as firstDate, max(dates) as lastDate
from table group by city;quit;
However when I run it, all the dates are converted into NUMBERS, not dates. How do I add n/total % ?
THanks!!
Add formats to display dates correctly, and use a preliminary step to get the overall count.
proc sql noprint;
select nobs into :tc
from dictionary.tables
where libname = "WORK" and memname = "TABLE";
select
city,
count(*) as total_count,
calculated total_count / &tc. as percent format=percent7.2,
min(dates) as firstDate format=e8601da10.,
max(dates) as lastDate format=e8601da10.
from table
group by city;
quit;
Edit: added "calculated" in the SQL to prevent ERROR
total_count / &tc. as percent format=percent7.2,
I always point out that this calculation fails to produce the proper answer if there are missing values in the data.
@PaigeMiller wrote:
total_count / &tc. as percent format=percent7.2,I always point out that this calculation fails to produce the proper answer if there are missing values in the data.
Why?
Missing values would just end up as another group, with a correct count. The user can then decide to filter them out, or not.
See this:
proc sql noprint;
select nobs into :tc
from dictionary.tables
where libname = "WORK" and memname = "CLASS";
create table want as
select
sex,
count(*) as total_count,
calculated total_count / &tc. as percent format=percent7.2
from class
group by sex;
quit;
proc print data=want noobs;
run;
Result:
Sex total_count percent 1 5.00% F 9 45.0% M 10 50.0%
@Kurt_Bremser wrote:
@PaigeMiller wrote:
total_count / &tc. as percent format=percent7.2,I always point out that this calculation fails to produce the proper answer if there are missing values in the data.
Why?
Missing values would just end up as another group, with a correct count. The user can then decide to filter them out, or not.
Because people don't realize they have to make a decision to account for missings (or they don't even realize what their code is doing with missings). I have seen examples where people code count(*) in PROC SQL and they get the wrong answer. Or they develop the code on data that doesn't have missings, and then it fails later when run on code with missings. I consider using count(*) a dangerous thing to do.
You have asked several very similar questions. Please tell us if you want a data set (for further manipulation or analysis) or a report that people read.
Most of your questions if a report is needed can be answered by one of the reporting procedures like Proc Report or Proc tabulate:
data have; infile datalines dlm=','; input City :$25. dates mmddyy10.; format dates mmddyy10.; datalines; New York,1/5/2019 Boston,4/20/2019 Washington DC,8/20/2019 Washington DC,9/1/2019 Washington DC,10/19/2019 Washington DC,12/13/2019 Hartford,5/5/2019 Pittsburgh,6/4/2019 Philadelphia,4/6/2019 Philadelphia,6/14/2019 Philadelphia,9/8/2019 ; proc tabulate data=have; class city; var dates; tables city , dates=' '*(n='Count' colpctn='%' min='First Date'*f=mmddyy10. max='Last Date'*f=mmddyy10.) ; run;
If you do not specify a specific date format for a calculated result then it will default to BEST. Since SAS date values are the number of days from 1 Jan 1960 then unformatted values look like 21340 instead of a formatted date like 06/05/2018 or 05Jun2018.
PROC SUMMARY has features that make this easy even when the dates are character in iso8601 format.
data have;
infile cards dsd;
input City :$25. dates mmddyy10.;
format dates yymmdd10.;
dtc = put(dates,yymmdd10.);
cards;
New York,1/5/2019
Boston,4/20/2019
Washington DC,8/20/2019
Washington DC,9/1/2019
Washington DC,10/19/2019
Washington DC,12/13/2019
Hartford,5/5/2019
Pittsburgh,6/4/2019
Philadelphia,4/6/2019
Philadelphia,6/14/2019
Philadelphia,9/8/2019
;;;;
proc summary data=have nway missing;
class city;
output out=need(drop=_type_ rename=(_freq_=Count))
idgroup(min(dtc) out(dtc)=Start)
idgroup(max(dtc) out(dtc)=End)
;
run;
Hi @radhikaa4
You can try this:
proc sql;
create table want as
select a.City,
a.Count,
a.Count/b.total as Percent format=Percent8.1,
a.firstDate format=MMDDYY10.,
a.LastDate format=MMDDYY10.
from (select City,
count(City) as Count,
min(Dates) as FirstDate,
max(Dates) as LastDate
from table
group by City) as a full join
(select distinct City, count(*) as total from table) as b
on a.City = b.City
;
quit;
You can apply a format (e.g. MMDDYY10.) directly after the calculation so that dates can be human-readable : e.g. min(Dates) as FirstDate format = MMDDYY10.
To compute percent, I have merged you first query (alias 'a') with another another one (alias 'b') that retrieve the total number of records.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.