Solved
Contributor
Posts: 40

# Using hash tables for pairwise comparing

[ Edited ]

I have a data set that has a list of serial numbers. For each serial number, I want to run the same calculation without using a macro to loop through each serial number and create a table. There are 37k serial numbers.

The following code works if I specify one serial number:

``````data test;
infile datalines delimiter = ',';
input serial lat long;
datalines;
1, 58.6732, 118.2569
1, 58.9852, 118.2568
1, 58.9865, 118.3156
1, 75.259, -52.0023
1, 75.2689, -52.125
2, 10.123, 15.256
2, 10.2365, 14.2698
2, 11.3569, 15.3489
3, 23.2569, 0.2659
3, 22.6789, 0.2788
3, 12.3569, 35.126
3, 12.3789, 34.679
;
run;

%let r_km = 6371; /*radius of the earth in km (3959 miles)*/

proc fcmp outlib = funcs.haversine;
function haversine(lat1, long1, lat2, long2); /*function name*/
/*function definition*/
dist = round(&r_km * 2 * arsin(min(1, sqrt(sin((lat2 - lat1) * &rad / 2)**2 +
return(dist); /*what to return*/
endsub;
run;
options cmplib = funcs; /*needed to access the function*/

data test;
set test;
cnt = _n_;

/*pairwise compare distance*/
data results (drop = _rc dt);
/*set up hash table*/
if _n_ = 1 then do;
if 0 then
set test(rename = (lat = lat1 long = long2));
declare hash h1(dataset:"test(rename = (lat = lat1 long = long2, multidata&colon; 'y');
_rc = h1.definekey('cnt', 'serial');
_rc = h1.definedata('lat1', 'long1');
_rc = h1.definedone();
end;
if serial = /*first serial number*/ then accum + 1;/*but then the serial number needs to change, I think a macro wouldn't be ideal though*/
do _i = (cnt + 1) to (cnt + accum);
_rc = h1.find(key: _i, key: serial);
hdist = haversine(lat, long, lat1, long1);
output;
end;
run;``````
`` ``

Can this be generalized to run on whole dataset comparing distance only for the specific serial numbers, recognizing a new number, and starting the process again. Otherwise, I need a macro to subset the major data set 37k times which seems inefficient.

The gist is, I need to find the distance between gps location 1 with 2, 2 with 3, ... for serial number 1. Then when I get to serial number 2, I need to repeat this process and so on.

This is a rough sketch of the what I am trying to achieve. I added comments where I explained the intent but haven't figured out the process to achieve the desired results.

``````/*Tried setting up a macro for the serial loop but this doesn't work
Here is what I tried for that*/
%let r_km = 6371; /*radius of the earth in km (3959 miles)*/

proc fcmp outlib = funcs.haversine;
function haversine(lat1, long1, lat2, long2); /*function name*/
/*function definition*/
dist = round(&r_km * 2 * arsin(min(1, sqrt(sin((lat2 - lat1) * &rad / 2)**2 +
return(dist); /*what to return*/
endsub;
run;
options cmplib = funcs; /*needed to access the function*/

data test;
infile datalines delimiter = ',';
input serial lat long;
label serial = 'serial'
lat = 'lat'
long = 'long';
datalines;
1, 58.6732, 118.2569
1, 58.9852, 118.2568
1, 58.9865, 118.3156
1, 75.259, -52.0023
1, 75.2689, -52.125
2, 10.123, 15.256
2, 10.2365, 14.2698
2, 11.3569, 15.3489
3, 23.2569, 0.2659
3, 22.6789, 0.2788
3, 12.3569, 35.126
3, 12.3789, 34.679
;
run;

data test;
set test;
input cnt;
label cnt = 'cnt';
cnt = _n_;
run;

data serials (keep = serial);
set test;
run;

proc sort nodupkey data = serials;
by serial;
run;

data _null_;
set serials nobs = n;
call symput ('num', n);
run;

options mprint; /*full output for macro execution so you know something is happening*/

%macro serial_loop;
%local i sn;
%do i = 1 %to &num;
data _null_;
set serials (obs = &i firstobs = &i);
call symputx ('sn', serial);
run;
if serial = "&sn" then
accum + 1;
/*find the distance between the ith and ith + 1 coordinates for each gps location*/
do _i = (cnt + 1) to (cnt + accum);
_rc = h1.find(key: _i, key: ser_no);
if _rc = 0 then do;
hdist = haversine(lat, long, lat1, long1);
output;
end;
end;
%end;
%mend serial_loop;
data results (drop = _rc);
/*set up hash table*/
set test;
if _n_ = 1 then do;
if 0 then
set test(rename = (lat = lat1 long = long1));
declare hash h1(dataset: "test(rename = (lat = lat1
long = long1))",
multidata&colon; 'y');
_rc = h1.definekey('cnt', 'serial');
_rc = h1.definedata('lat', 'long');
_rc = h1.definedone();
end;
%serial_loop;
run;

ERROR: Variable lat is not on file WORK.TEST.``````

Accepted Solutions
Solution
‎03-08-2016 10:09 AM
Contributor
Posts: 40

## Re: Using hash tables for pairwise comparing

I have answered my own question. I used SAS' geodist for the example solution.

``````data test;
infile datalines delimiter = ',';
input serial lat long date;
format date date9.;
datalines;
1, 58.6732, 118.2569, 20225
1, 58.9852, 118.2568, 20226
1, 58.9865, 118.3156, 20227
1, 75.259, -52.0023,  20228
1, 75.2689, -52.125,  20229
2, 10.123, 15.256,    22111
2, 10.2365, 14.2698,  22112
2, 11.3569, 15.3489,  22113
3, 23.2569, 0.2659,   21256
3, 22.6789, 0.2788,   21257
3, 12.3569, 35.126,   21258
3, 12.3789, 34.679,   21259
;
run;

data test;
set test;
i = _n_;
run;

proc sql _method;
create table results (drop = sn date1 j)
as select
a.*,
b.serial as sn label = 'sn',
b.lat as lat1 label = 'lat1',
b.long as long1 label = 'long1',
b.date as date1 label = 'date1',
b.i as j,
geodist(a.lat, a.long, b.lat, b.long) as gdist
from test as a
inner join test as b
on a.serial = b.serial
where a.i > b.i;
quit;

proc sort results;
by serial date;
run;``````

All Replies
Posts: 3,852

## Re: Using hash tables for pairwise comparing

Guessing I would say yes.  I think you will need to supply data to go with your code get anyone interested.  Also show what the output might look like for a "new" number where process is started over.

Contributor
Posts: 40

## Re: Using hash tables for pairwise comparing

I added a sample data set

Super User
Posts: 13,563

## Re: Using hash tables for pairwise comparing

dwsmith wrote:

I have a data set that has a list of serial numbers. For each serial number, I want to run the same calculation without using a macro to loop through each serial number and create a table. There are 37k serial numbers.

Are you comparing this only to other items with the same serial number?

I am not sure what this means at all:

``recognizing a new number, and starting the process again. Otherwise, I need a macro to subset the major data set 37k times which seems inefficient.``

It will really help if you can provide an example data set with a small number of records, say 5 to 10 and the minimum number of variables to work with and what the output should look like. What little I think I understand might be done with a reflexive join in Proc Sql since it looks like you are only using one data set.

Contributor
Posts: 40

## Re: Using hash tables for pairwise comparing

I added a sample data set.

Solution
‎03-08-2016 10:09 AM
Contributor
Posts: 40

## Re: Using hash tables for pairwise comparing

I have answered my own question. I used SAS' geodist for the example solution.

``````data test;
infile datalines delimiter = ',';
input serial lat long date;
format date date9.;
datalines;
1, 58.6732, 118.2569, 20225
1, 58.9852, 118.2568, 20226
1, 58.9865, 118.3156, 20227
1, 75.259, -52.0023,  20228
1, 75.2689, -52.125,  20229
2, 10.123, 15.256,    22111
2, 10.2365, 14.2698,  22112
2, 11.3569, 15.3489,  22113
3, 23.2569, 0.2659,   21256
3, 22.6789, 0.2788,   21257
3, 12.3569, 35.126,   21258
3, 12.3789, 34.679,   21259
;
run;

data test;
set test;
i = _n_;
run;

proc sql _method;
create table results (drop = sn date1 j)
as select
a.*,
b.serial as sn label = 'sn',
b.lat as lat1 label = 'lat1',
b.long as long1 label = 'long1',
b.date as date1 label = 'date1',
b.i as j,
geodist(a.lat, a.long, b.lat, b.long) as gdist
from test as a
inner join test as b
on a.serial = b.serial
where a.i > b.i;
quit;

proc sort results;
by serial date;
run;``````
Super User
Posts: 13,563

## Re: Using hash tables for pairwise comparing

And that solution uses a reflexive join ...

Contributor
Posts: 40

## Re: Using hash tables for pairwise comparing

I am not sure if this solution is good for a data set with 15million rows though.

🔒 This topic is solved and locked.