The code below works fine to to link the nearest geographic points given the condition race, gender and age match. However, it keeps crashing at larger counties such as 059. Is there anyway not to create a physical data &dist? Interim dist data seems to take up lot of space since it's involved with cartesian product.
I appreciate your help.
%macro models(dist,county,data);
proc sql;
create table &dist as
select t1.*,
t1.latitude as x1_registry,t1.longitude as y1_registry,
t2.latitude as x2_synthetic,t2.longitude as y2_synthetic,
t2.sp_hh_id, t2.sp_id_p,
t2.agecat as s_agecat, t2.racecat as s_racecat,
t2.hh_income, t2.sex as s_sex, t1.agecat as c_agecat,
t1.racecat as c_racecat, t1.sex as c_sex,
geodist(x1_registry, y1_registry, x2_synthetic, y2_synthetic) as distance
from reg_pancreas4(where=(county=&county)) t1 join p.synpop3(where=(county=&county)) t2
on t1.agecat=t2.agecat and t1.sex=t2.sex and t1.racecat=t2.racecat
order by display_id, calculated distance;
quit;
/*hh id*/
data p.&data(compress=yes);
array mag(11000) $;
set &dist;
retain i needone mag:;
by display_id;
if _n_ eq 1 then i=0;
if first.display_id then needone=1;
if needone then do;
if sp_hh_id not in mag then do;
i+1;
needone=0;
mag(i)=sp_hh_id;
output;
end;
end;
drop mag: needone i;
run;
%mend models;
%models(dist059,'059',county059);
You should modify your macro first.
1. In the first SQL query, create the table with a fixed name. You don't need it past that step so delete it at the end of the step using PROC DATASETS.
2. Remove the parameters for dist and data, you don't really need them. Add one for SEX.
3. At the end of your macro (before %MEND) add the PROC APPEND to append the data sets
4. After the Append, delete any data set you don't need after the loop finishes. Makes sure that the master data set is empty. If you run it a few times you'll want to delete it first and then recreate it for the full run or you may end up with duplicates.
To call the macro for each combination, see the code below. It's verbose on purpose, it could be done in a single data step so you can follow each step.
1. Create list of all gender/sex
2. Make a string that looks like your macro call
3. Execute macro using CALL EXECUTE and passing the string from #2 to the function.
As for deleting data sets - I'll leave that to you. untested:
*assumes this table has all the county/sex combinations needed;
proc sql;
create table report_list as
select distinct county, sex
from p.synpop3
order by 1, 2;
quit;
*create string variable;
data call_macro;
set report_list;
models(county,data)
str = catt('%models(',
county,
', '
sex,
');');
run;
*execute macro;
data _null_;
set call_macro (obs=2);
call execute(str);
run;
I'm trying to stratify further by sex: (where=(county=&county and sex='1')). I'll post how it went
Why not simply create a view instead?
Any hints for how? I will use the resulting datasets for modelling eventually
How many points do you have? Are those latitude and longitude specific to anything?
I did this as part of my MSc thesis (over a decade ago) and I used a double set with the distance matrix already created to quickly find the nearest neighbor. It's possible this or a hash solution would work as well. Both are beyond my current SAS skills but perhaps someone else can help with that approach if you think it may work. This will only work if the distance matrix/table is a fixed one - that you can consistently reference.
If it's zip codes, maybe you can download the distance tables here and just use as a lookup rather than calculation?
http://www.nber.org/data/zip-code-distance-database.html
And/or consider QGIS. It's designed for this type of analysis.
And a last thought...can you filter the data sets you're creating distances for such that you only go X KM outside state and then use GINSIDE or something to specifically limit the points you're using. The odds of a NY zip for example being the closest to Hawaii is 0. Or a distance limit of some sort to drop irrelevant results.
Hi Reeza, Nassau county (NY) has 1,339,532 people and I'm running by gender which halves the size. reg_panreas data has 3500 people needed to be linked to the people in Nassau county by nearest geodist. both datasets have lat and long of individual residential homes. This method is more intuitive to me since I have no time to learn supposedly great methods you mentioned here. It just finished running without crashing as I write this with gender stratified. I guess I just have to keep cleaning the SAS temp folder and concatenate all stratified datasets later.
@Cruise wrote:
Hi Reeza, Nassau county (NY) has 1,339,532 people and I'm running by gender which halves the size. reg_panreas data has 3500 people needed to be linked to the people in Nassau county by nearest geodist. both datasets have lat and long of individual residential homes. This method is more intuitive to me since I have no time to learn supposedly great methods you mentioned here. It just finished running without crashing as I write this with gender stratified. I guess I just have to keep cleaning the SAS temp folder and concatenate all stratified datasets later.
Fair enough. Use CALL EXECUTE to call the macro for all your age/gender stratification levels and add a step to drop temporary data sets that you don't need. If you can append the results as you go through it's probably a good idea, then less smaller data sets floating around.
You should modify your macro first.
1. In the first SQL query, create the table with a fixed name. You don't need it past that step so delete it at the end of the step using PROC DATASETS.
2. Remove the parameters for dist and data, you don't really need them. Add one for SEX.
3. At the end of your macro (before %MEND) add the PROC APPEND to append the data sets
4. After the Append, delete any data set you don't need after the loop finishes. Makes sure that the master data set is empty. If you run it a few times you'll want to delete it first and then recreate it for the full run or you may end up with duplicates.
To call the macro for each combination, see the code below. It's verbose on purpose, it could be done in a single data step so you can follow each step.
1. Create list of all gender/sex
2. Make a string that looks like your macro call
3. Execute macro using CALL EXECUTE and passing the string from #2 to the function.
As for deleting data sets - I'll leave that to you. untested:
*assumes this table has all the county/sex combinations needed;
proc sql;
create table report_list as
select distinct county, sex
from p.synpop3
order by 1, 2;
quit;
*create string variable;
data call_macro;
set report_list;
models(county,data)
str = catt('%models(',
county,
', '
sex,
');');
run;
*execute macro;
data _null_;
set call_macro (obs=2);
call execute(str);
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.