Matching people if geographically closest in distance and some other d...

Cruise · Posted 11-20-2017 05:07 PM

Matching people if geographically closest in distance and some other demographics:

I have two datasets 'one' and 'two'.

Dataset 'one' has 20 million people and their location (latitude and longitude), gender and race/ethnicity (synthetic data).

Dataset 'two' has 120,000 people and their location (latitude and longitude), gender and race/ethnicity.

Dataset 'one' includes dataset 'two'. I'd like take as 'matched' enough if following conditions met:

- Shortest distance between individual in 'one' and 'two' datasets and

- if gender and race/ethnicity matches.

Below code works fine. However, would you agree if working code does the job true to my logic? Please help if my program would narrow down to cases as matched as explained above?

proc sql;
  create table want as
    select  
      s.longitude*1 as s_long,
      s.latitude*1  as s_lat,
      c.longitude*1 as c_long,
      c.latitude*1  as c_lat,
      c.sex         as c_sex,
      s.sex         as s_sex,
      geodist(calculated s_lat, calculated s_long, 
      calculated c_lat, calculated c_long) as distance 
    from two c inner join one s
	on c.sex=s.sex and c.race=s.race
	group by c.uniq_id 
    having min(calculated distance)=calculated distance;
quit;

Reeza · Posted 11-20-2017 05:19 PM

You're about to do, 20, 000, 000 X 120, 000 = 2.4 Trillion calculations. Do you have the resources to handle that many calculations?

art297 · Posted 11-20-2017 05:44 PM

Like @Reeza mentioned,if your system has sufficient resources to accomplish the task, since dataset one contains dataset two, won't you be finding each individual's own record in the other dataset? I think you need a condition (and, obviously, the data fields in each file) to exclude such matches.

Art, CEO, AnalystFinder.com

mkeintz · Posted 11-21-2017 12:26 AM

As @reeeza points out, your program will do 2.4 trillion comparisons.

But you can reduce that by a lot. Let's say you have 2 sexes and 4 races, i.e. 8 combinations. You could first divide dataset ONE and dataset TWO into 8 datasets each. Then you could run 8 sql CREATE TABLE statements on much smaller datasets.

In fact, if you had 8 equally sized combinations of race and sex, you would only need to perform 8*(120,000/8)*(20,000,000/8)=20,000,000 calculations.

Then you could deal with @art297's concern by removing the dataset TWO records from dataset ONE before running the proc sql. (Do you have a unique ID variable?).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Cruise · Posted 11-21-2017 10:35 AM

Thanks a lot guys.

I totally agree with you on the need to stratify the data for multiple sql sessions. @art297 and @mkeintz I have no issue of exact duplicates between two datasets. Because, one of datasets is the National Synthetic Population 2010 dataset by RTI https://www.rti.org/impact/synthpop where they created a synthetic dot for every person. However, those dots are spatially spread randomly within the block group but not intended to represent the exact location of each individuals. My goal is to point out to the synthetic 'dot-person' as it matched to actual human in my data by the nearest geographical distance in between while matching on some demographics.

Hope I wrote clear enough!

Cruise · Posted 11-21-2017 04:22 PM

Yes Mkeintz, both data has unique individual identifiers

Matching people if geographically closest in distance and some other demographics

Re: Matching people if geographically closest in distance and some other demographics

Re: Matching people if geographically closest in distance and some other demographics

Re: Matching people if geographically closest in distance and some other demographics

Re: Matching people if geographically closest in distance and some other demographics

Re: Matching people if geographically closest in distance and some other demographics

Catch up on SAS Innovate 2026

Matching people if geographically closest in distance and some other demographics

Re: Matching people if geographically closest in distance and some other demographics

Re: Matching people if geographically closest in distance and some other demographics

Re: Matching people if geographically closest in distance and some other demographics

Re: Matching people if geographically closest in distance and some other demographics

Re: Matching people if geographically closest in distance and some other demographics

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away