BookmarkSubscribeRSS Feed
Pejak
Calcite | Level 5

I have some data, which I need to aggregate from a household level to district level. This I have done with proc summary. The trouble is that I need one of the original variables when I'm doing my regression: The standard errors are clustered at the household level. I'm using the proc surveyreg to do the regression and I need to specify the cluster to be at the household level. This variable is not available as I have aggregated the information. How can I do my regression?

 

 

I'm using SAS 9.4.

4 REPLIES 4
jklaverstijn
Rhodochrosite | Level 12

You could remerge your aggregated data back with the original. That would repeat the groupically-level data for every member of the group but from what I understand that is what you're after.

 

In many cases PROC SQL would do that for you automatically, leading to the famous note

 

NOTE: The query requires remerging summary statistics back with the original data.

Nothing is keeping you from joining the summary dataset back with its origin:

 

proc sql;
   select h.household_id, h.district_id, h.income, s.income_avg
     from household h, summary.s 
        where h.district_id=s.district_id;
quit;
Pejak
Calcite | Level 5

But there is only like 600 observations(Districts) in the new data set and let's say 50,000 observations in the old(households). I can't see these two being merged.

jklaverstijn
Rhodochrosite | Level 12

Hi @Pejak

 

No they will not be merged 1:1 obviously. But there is a 1:n relationship that will easily be merged with the code I suggested. The result would have the same 50.000 household records that you started with. The district data will be repeated.

 

If that does not meet your requirements than please give us examples of your data that you have and that you want.

 

regards,

- Jan.

LinusH
Tourmaline | Level 20

I'm not a statistician, so I',m probably out of bounds here.

But, why cant you do your analysis on the original data where you have the necessary variables?

If you need to go for the aggregated data, how would you suspect to match the cluster variable to the district? There are surely multiple per district, otherwise it wouldn't be a problem. You need a business rule to chose the proper cluster level, or else it will be random.

Data never sleeps

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1455 views
  • 0 likes
  • 3 in conversation