Solved: Re: Geodist and insufficient memory

Cruise · Posted 09-12-2018 04:56 PM

The code below works fine to to link the nearest geographic points given the condition race, gender and age match. However, it keeps crashing at larger counties such as 059. Is there anyway not to create a physical data &dist? Interim dist data seems to take up lot of space since it's involved with cartesian product.

I appreciate your help.


%macro models(dist,county,data);
proc sql;
  create table &dist as
    select t1.*,
    t1.latitude as x1_registry,t1.longitude as y1_registry,
    t2.latitude as x2_synthetic,t2.longitude as y2_synthetic,
    t2.sp_hh_id, t2.sp_id_p,
    t2.agecat as s_agecat, t2.racecat as s_racecat,
    t2.hh_income, t2.sex as s_sex, t1.agecat as c_agecat,
    t1.racecat as c_racecat, t1.sex as c_sex,
    geodist(x1_registry, y1_registry, x2_synthetic, y2_synthetic) as distance
      from reg_pancreas4(where=(county=&county)) t1 join p.synpop3(where=(county=&county)) t2
       on t1.agecat=t2.agecat and t1.sex=t2.sex and t1.racecat=t2.racecat
       order by display_id, calculated distance;
quit;

/*hh id*/
data p.&data(compress=yes);
  array mag(11000) $;
  set &dist;
  retain i needone mag:;
  by display_id;
  if _n_ eq 1 then i=0;
  if first.display_id then needone=1;
  if needone then do;
    if sp_hh_id not in mag then do;
      i+1;
      needone=0;
      mag(i)=sp_hh_id;
      output;
    end;
  end;
drop mag: needone i;
run; 

%mend models;
%models(dist059,'059',county059);

Reeza · Posted 09-12-2018 05:39 PM

You should modify your macro first.

1. In the first SQL query, create the table with a fixed name. You don't need it past that step so delete it at the end of the step using PROC DATASETS.

2. Remove the parameters for dist and data, you don't really need them. Add one for SEX.

3. At the end of your macro (before %MEND) add the PROC APPEND to append the data sets

4. After the Append, delete any data set you don't need after the loop finishes. Makes sure that the master data set is empty. If you run it a few times you'll want to delete it first and then recreate it for the full run or you may end up with duplicates.

To call the macro for each combination, see the code below. It's verbose on purpose, it could be done in a single data step so you can follow each step.

1. Create list of all gender/sex

2. Make a string that looks like your macro call

3. Execute macro using CALL EXECUTE and passing the string from #2 to the function.

As for deleting data sets - I'll leave that to you. untested:

*assumes this table has all the county/sex combinations needed;
proc sql;
create table report_list as
select distinct county, sex
from p.synpop3
order by 1, 2;
quit;

*create string variable;
data call_macro;
set report_list;
models(county,data)
str = catt('%models(',
               county, 
              ', '
              sex,
             ');');
run;

*execute macro;
data _null_;
set call_macro (obs=2);
call execute(str);
run;

View solution in original post

Cruise · Posted 09-12-2018 04:59 PM

I'm trying to stratify further by sex: (where=(county=&county and sex='1')). I'll post how it went

PeterClemmensen · Posted 09-12-2018 05:05 PM

Why not simply create a view instead?

The DATA to DATA Step Macro
Blog: SASnrd

Cruise · Posted 09-12-2018 05:08 PM

@PeterClemmensen

Any hints for how? I will use the resulting datasets for modelling eventually

Reeza · Posted 09-12-2018 05:09 PM

How many points do you have? Are those latitude and longitude specific to anything?

I did this as part of my MSc thesis (over a decade ago) and I used a double set with the distance matrix already created to quickly find the nearest neighbor. It's possible this or a hash solution would work as well. Both are beyond my current SAS skills but perhaps someone else can help with that approach if you think it may work. This will only work if the distance matrix/table is a fixed one - that you can consistently reference.

If it's zip codes, maybe you can download the distance tables here and just use as a lookup rather than calculation?

http://www.nber.org/data/zip-code-distance-database.html

And/or consider QGIS. It's designed for this type of analysis.

And a last thought...can you filter the data sets you're creating distances for such that you only go X KM outside state and then use GINSIDE or something to specifically limit the points you're using. The odds of a NY zip for example being the closest to Hawaii is 0. Or a distance limit of some sort to drop irrelevant results.

Cruise · Posted 09-12-2018 05:21 PM

Hi Reeza, Nassau county (NY) has 1,339,532 people and I'm running by gender which halves the size. reg_panreas data has 3500 people needed to be linked to the people in Nassau county by nearest geodist. both datasets have lat and long of individual residential homes. This method is more intuitive to me since I have no time to learn supposedly great methods you mentioned here. It just finished running without crashing as I write this with gender stratified. I guess I just have to keep cleaning the SAS temp folder and concatenate all stratified datasets later.

Reeza · Posted 09-12-2018 05:25 PM

@Cruise wrote:

Hi Reeza, Nassau county (NY) has 1,339,532 people and I'm running by gender which halves the size. reg_panreas data has 3500 people needed to be linked to the people in Nassau county by nearest geodist. both datasets have lat and long of individual residential homes. This method is more intuitive to me since I have no time to learn supposedly great methods you mentioned here. It just finished running without crashing as I write this with gender stratified. I guess I just have to keep cleaning the SAS temp folder and concatenate all stratified datasets later.

Fair enough. Use CALL EXECUTE to call the macro for all your age/gender stratification levels and add a step to drop temporary data sets that you don't need. If you can append the results as you go through it's probably a good idea, then less smaller data sets floating around.

Cruise · Posted 09-12-2018 05:26 PM

Reeza, please show me in the code, would you?

Reeza · Posted 09-12-2018 05:39 PM