04-05-2016 08:43 AM - edited 04-05-2016 10:14 AM
I want to take a group of data and summarize it and put the values in a new column. Consider
data test; format sn $2. ctry $3. date date9. code $3.; input sn ctry lat long date code; datalines; 1 US 1.25 2.13 20155 abc 1 US 1.26 2.12 20156 abc 1 US 1.32 2.15 20157 def 1 EU 2.53 3.12 20175 xyz 1 EU 2.53 2.13 20176 xyz 2 EU 5.67 10.2 20156 fgh 2 EU 5.68 10.5 20158 fgh ; run;
I want to summarize by sn by finding the mean for lat and long, the date range, the mode of code, the last code, and give me the count of occurrences. The following data summarized would create a new table that has:
sn avg_lat avg_long start_dt end_dt code_mode code_last cnt 1 1.277 2.133 20155 20157 abc def 3 2 5.675 10.35 20156 20158 fgh fgh 2
I know how to do this in Python's pandas, but I cannot figure it out in SAS. I know there is proc means but I don't think (which means I don't know how to use proc means to do this if possible) it would make a table like this. Also, proc means can't do a count.
Does this make sense?
I was able to find a way do everything but the mode of code and last code for the data.
proc sql; create table summarized_test as select sn, ctry, avg(lat) as avg_lat, avg(long) as avg_long, min(date) as start_dt, max(date) as end_dt, count(date) as cnt from test group by sn, ctry; quit;
I also added a ctry line which I forgot about in the OP which was my mistake.
04-05-2016 09:20 AM
You want proc means:proc means data=test; by sn date; var lat long; output out=want mean=mean mode=mode min=min max=max; run;
That looks fishy. What about LONG. You only provide names for LAT.
04-05-2016 09:18 AM
The MODE of a character field will be difficult for PROC MEANS. Otherwise, all things are possible. First, the questions, though. Should CNT be incremented even if LAT (and/or LONG) have a missing value? If DATE has a missing value, should that be ignored when computing the START_DT?
04-05-2016 09:34 AM - edited 04-05-2016 09:40 AM
@Astounding, I have executed a where statement on real data set excluding these instances you speak of.
04-05-2016 09:38 AM
04-05-2016 10:03 AM
Here's a PROC MEANS that gives you all statistics for the numeric fields. Working with the CODE will require different tools.
proc means data=test;
var lat long date;
output out=want (drop=_type_ rename=(_freq_ = cnt)) mean = avg_lat avg_long min(date) = start_dt max(date) = end_dt;
proc print data=want (obs=5);
See if this makes sense, and we can look at CODE separately.
04-05-2016 10:54 PM
data have; input Patient Dosedate: mmddyy10.; format Dosedate mmddyy10.; cards; 101 01/01/2015 101 01/02/2015 101 01/03/2015 101 01/04/2015 102 01/05/2015 102 01/06/2015 102 01/08/2015 102 01/10/2015 ; run; data want; merge have have(firstobs=2 rename=(Patient=_Patient Dosedate=_Dosedate)); output; if Patient=_Patient then do; do i=Dosedate+1 to _Dosedate-1; Dosedate=.;miss_date=i;output; end; end; format miss_date mmddyy10.; drop i _:; run;
04-05-2016 10:55 PM
data test; format sn $2. ctry $3. date date9. code $3.; input sn ctry lat long date code; datalines; 1 US 1.25 2.13 20155 abc 1 US 1.26 2.12 20156 abc 1 US 1.32 2.15 20157 def 1 EU 2.53 3.12 20175 xyz 1 EU 2.53 2.13 20176 xyz 2 EU 5.67 10.2 20156 fgh 2 EU 5.68 10.5 20158 fgh ; run; proc sql; create table temp as select *,count(*) as count from test group by sn,ctry,code; create table want as select sn, ctry, avg(lat) as avg_lat, avg(long) as avg_long, min(date) as start_dt, max(date) as end_dt, (select distinct code from temp where sn=a.sn and ctry=a.ctry having count=max(count)) as code_mode, (select distinct code from temp where sn=a.sn and ctry=a.ctry having date=max(date) ) as code_last, count(date) as cnt from temp as a group by sn, ctry; quit;