Calculate descriptive statistics by row using column data from other t...

j262byuu · Posted 02-12-2020 03:31 PM

I have two tables like below

Table 1

ID    day start end
21005 3   4004  4012
21005 3   4031  4055
21005 4   4671  4679
21005 4   4689  4792
21005 4   4803  4828
21005 4   4837  4860
21005 4   5301  5424
21005 6   8068  8084
21005 6   8114  8148
21005 7   9239  9256
21005 7   9311  9326
21005 7   9497  9515
21005 7   9535  9574
21005 7   9615  9627
21005 7   9679  9694
21006 3   3738  3752
21006 5   7065  7077
21006 6   8374  8402
21007 1   732   742
21007 1   773   785
21007 1   831   853
21007 1   865   876
21007 1   880   894
...................

Table 2

ID    day time concentration
21005 1   1    562
...................
21005 2   1    90
...................
21005 3   1    2258
...................
21005 3   4004 1981
21005 3   4005 3077
21005 3   4006 1371
21005 3   4007 1990
21005 3   4008 1869
21005 3   4009 934
21005 3   4010 413
21005 3   4011 2506
21005 3   4012 2948
...................
21005 3   4031 2266
...................

and I would like to create a summary table using table 1 and table 2 data

summary table

ID    day start end  min max  mean    median sum
21005 3   4004  4012 413 3077 1898.78 1981   17089
................................................

basically summary table = calculate the min max mean median and sum of concentration using start and end time in table 1, by ID and day. I have no idea about how to do this ...

I tried to merge table 2 to table 1 by id and day, then calculate the descriptive stats, but there's multiple start and end for same day

svh · Posted 02-13-2020 08:11 PM

I think this is pretty do-able with PROC SQL because it is very good for producing summary statistics at different strata in the data set. However, it seems to me that you need an extra identifier in the first data set (which I've called endpoints) that can differentiate different periods of time within ID and day. I've made that extra variable and called it "Period" and it should be unique for every unique value of ID, day, and Start. Then you need to join the tables with a WHERE statement that categorizes the different times from the concentration data set in the periods defined in the endpoints data set.

I've added my own fake data to a section of the "Table 2" to illustrate how this might work:

data endpoints;
input ID $   day $ start end period;
cards;
21005 3   4004  4012 1
21005 3   4031  4055 2
;
run;
data concentration;
input ID $ day $ time concentration;
cards;
21005 3   4004 1981
21005 3   4005 3077
21005 3   4006 1371
21005 3   4007 1990
21005 3   4008 1869
21005 3   4009 934
21005 3   4010 413
21005 3   4011 2506
21005 3   4012 2948
21005 3   4031 222
21005 3   4032 444
21005 3   4033 666
;
run;

proc sql;
   create table want as
   select distinct a.ID, a.day, a.start, a.end, min(b.concentration) as min, max(b.concentration) as max, mean(b.concentration) as mean, median(b.concentration) as median, sum(b.concentration) as sum
   from endpoints a, concentration b
   where a.id = b.id and a.day = b.day and (a.start <= b.time <=a.end)
   group by a.id, a.day, a.period;
   quit; 
proc print;
run;

Calculate descriptive statistics by row using column data from other table

Re: Calculate descriptive statistics by row using column data from other table

Calculate descriptive statistics by row using column data from other table

Re: Calculate descriptive statistics by row using column data from other table

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!