06-25-2016 03:43 AM
I have some data, which I need to aggregate from a household level to district level. This I have done with proc summary. The trouble is that I need one of the original variables when I'm doing my regression: The standard errors are clustered at the household level. I'm using the proc surveyreg to do the regression and I need to specify the cluster to be at the household level. This variable is not available as I have aggregated the information. How can I do my regression?
I'm using SAS 9.4.
06-25-2016 04:54 AM
You could remerge your aggregated data back with the original. That would repeat the groupically-level data for every member of the group but from what I understand that is what you're after.
In many cases PROC SQL would do that for you automatically, leading to the famous note
NOTE: The query requires remerging summary statistics back with the original data.
Nothing is keeping you from joining the summary dataset back with its origin:
proc sql; select h.household_id, h.district_id, h.income, s.income_avg from household h, summary.s where h.district_id=s.district_id; quit;
06-25-2016 05:23 AM
But there is only like 600 observations(Districts) in the new data set and let's say 50,000 observations in the old(households). I can't see these two being merged.
06-25-2016 06:50 AM
No they will not be merged 1:1 obviously. But there is a 1:n relationship that will easily be merged with the code I suggested. The result would have the same 50.000 household records that you started with. The district data will be repeated.
If that does not meet your requirements than please give us examples of your data that you have and that you want.
06-26-2016 04:03 PM
I'm not a statistician, so I',m probably out of bounds here.
But, why cant you do your analysis on the original data where you have the necessary variables?
If you need to go for the aggregated data, how would you suspect to match the cluster variable to the district? There are surely multiple per district, otherwise it wouldn't be a problem. You need a business rule to chose the proper cluster level, or else it will be random.