topic Re: How to aggregate data and still keep some of the original information in Statistical Procedures

How to aggregate data and still keep some of the original information

Pejak — Sat, 25 Jun 2016 07:43:02 GMT

I have some data, which I need to aggregate from a household level to district level. This I have done with proc summary. The trouble is that I need one of the original variables when I'm doing my regression: The standard errors are clustered at the household level. I'm using the proc surveyreg to do the regression and I need to specify the cluster to be at the household level. This variable is not available as I have aggregated the information. How can I do my regression?

I'm using SAS 9.4.

Re: How to aggregate data and still keep some of the original information

jklaverstijn — Sat, 25 Jun 2016 08:54:41 GMT

You could remerge your aggregated data back with the original. That would repeat the groupically-level data for every member of the group but from what I understand that is what you're after.

In many cases PROC SQL would do that for you automatically, leading to the famous note

NOTE: The query requires remerging summary statistics back with the original data.

Nothing is keeping you from joining the summary dataset back with its origin:

proc sql;
   select h.household_id, h.district_id, h.income, s.income_avg
     from household h, summary.s 
        where h.district_id=s.district_id;
quit;

Re: How to aggregate data and still keep some of the original information

Pejak — Sat, 25 Jun 2016 09:23:26 GMT

But there is only like 600 observations(Districts) in the new data set and let's say 50,000 observations in the old(households). I can't see these two being merged.

Re: How to aggregate data and still keep some of the original information

jklaverstijn — Sat, 25 Jun 2016 10:50:25 GMT

Hi @Pejak

No they will not be merged 1:1 obviously. But there is a 1:n relationship that will easily be merged with the code I suggested. The result would have the same 50.000 household records that you started with. The district data will be repeated.

If that does not meet your requirements than please give us examples of your data that you have and that you want.

regards,

- Jan.

Re: How to aggregate data and still keep some of the original information

LinusH — Sun, 26 Jun 2016 20:03:56 GMT

I'm not a statistician, so I',m probably out of bounds here.

But, why cant you do your analysis on the original data where you have the necessary variables?

If you need to go for the aggregated data, how would you suspect to match the cluster variable to the district? There are surely multiple per district, otherwise it wouldn't be a problem. You need a business rule to chose the proper cluster level, or else it will be random.