Solved
Contributor
Posts: 35

# Comparing 2 datasets

I have 2 sets of data that are geocoded for the "small area", years (2003-2007) and (2008 to 2012).
The two sets are geocoded in a different way. I'm trying to concatenate both sets and see if there's a difference in death rates between the two periods (years).
I have this code:
data mort03_07;
set data.mortsarea99_09_newrace;
if NMRes=1;
if 2003<=year<=2007;
sarea134=sarea133;
if sarea133=100 then sarea134=99;

geo=2; *not geocoded;
if 1<=sarea134<=108 then geo=1; *geocoded;

run;
data mort08_12;
set final.death99_13geo_14ungeo_ibis_std;
if NMRes=1;
if 2008<=year<=2012;

geo=2; *not geocoded;
if 1<=sarea134<=108 then geo=1; *geocoded;

run;

/*
proc summary data=mort03_07;
var x geo;
class fipscode;
output out=numgeo1 sum(geo)=numgeo sum(x)=totnum;
run;

proc summary data=mort08_12;
var x geo;
class fipscode;
output out=numgeo2 sum(geo)=numgeo sum(x)=totnum;
run;

data numge01;
set numgeo1;
period=1;
run;
data numge02;
set numgeo2;
period=2;
run;

data numgeo;
set numge01 numge02;
geopct=numgeo/totnum;
run;

proc print data=numgeo1; title 'geocoded, period1';
proc print data=numgeo2; title 'geocoded, period2';
proc print data=numgeo; title 'geocoded, both periods';
run;

When I run the proc print data=numgeo; title 'geocoded, both periods';
I get the geopct >1  (because the numgeo > totnum)
I'm not sure what am I doing wrong here.
Thank you,
Ruzeina

Accepted Solutions
Solution
‎03-22-2016 11:19 AM
Posts: 1,256

## Re: Comparing 2 datasets

Hi @mayasak,

Without sample data it's a bit hard to say, but my first guess is that the coding of variable GEO might be not ideal: If NUMGEO is to be the number of geocoded items, the code for "not geocoded" should be 0, not 2. Otherwise NUMGEO is likely to be too large, leading to incorrectly large values of GEOPCT, possibly GEOPCT>1, as you've observed.

All Replies
Solution
‎03-22-2016 11:19 AM
Posts: 1,256

## Re: Comparing 2 datasets

Hi @mayasak,

Without sample data it's a bit hard to say, but my first guess is that the coding of variable GEO might be not ideal: If NUMGEO is to be the number of geocoded items, the code for "not geocoded" should be 0, not 2. Otherwise NUMGEO is likely to be too large, leading to incorrectly large values of GEOPCT, possibly GEOPCT>1, as you've observed.

Contributor
Posts: 35

## Re: Comparing 2 datasets

Ya you're so right. We're dealing with counts here. Thanks a lot . I just have another question if you don't mind. As I said, I have to see if there is any difference in death rates in small areas due to difference in geocoding in two different data sets (period 1,years 2003-2007, and period 2,years 2008-2012). Do you have any thoughts about how can I do it ? So far I've calculated percentage of geocoded data in each county (not small areas), and ANOVA tests with "period" and "sarea" as independent variables and "cause of death" as dependent variable (couldn't do the interaction terms due to 0 degrees of freedom for errors).

Thanks

Thank you,

Posts: 1,256

## Re: Comparing 2 datasets

As this is a completely different question, it will be better if you open a new thread for it. To do this, you should select a different forum within the SAS Support Communities: Analytics --> SAS Statistical Procedures.

There you will attract a more targeted audience. Also, it will be helpful to describe your data a little more (types of variables and their meaning). I am not familiar with geocoding and its implications for epidemiological research questions.

🔒 This topic is solved and locked.