I have a large dataset that has includes values of a chemical concentration, the date of testing, the site of testing, and a lot of other variable not immediately relevant to this question. Some of these values are from samples taken on the same day; sometimes up to ~100 tests/day. So, for example, you end up with a data set that looks something like this:
SiteID SampleDate Concentration
1 1/1/2000 1
1 1/1/2000 2
1 1/1/2000 3
1 1/1/2001 1
1 1/1/2002 1
1 1/1/2003 1
2 1/1/2000 1
2 1/1/2000 2
2 1/1/2001 1
2 1/1/2001 2
3 1/1/2000 1
3 1/1/2002 1
4 1/1/2001 1
4 1/1/2002 1
4 1/1/2003 1
4 1/1/2003 2
4 1/1/2004 1
5 1/1/2003 1
In subsequent analyses I just need the median value from each day, so I am trying to 1) determine the median concentration per site per day and 2) create a new output data set that uses the calculated median value rather than the original concentration and displays the number of samples taken on the same date; using the example data above, something like this:
SiteID SampleDate Med_Conc SampleN
1 1/1/2000 2 3
1 1/1/2001 1 1
1 1/1/2002 1 1
1 1/1/2003 1 1
2 1/1/2000 1.5 2
2 1/1/2001 1.5 2
3 1/1/2000 1 1
3 1/1/2002 1 1
4 1/1/2001 1 1
4 1/1/2002 1 1
4 1/1/2003 1.5 2
4 1/1/2004 1 1
5 1/1/2003 1 1
I'm at a loss on how to do this, so any help would be greatly appreciated.