DATA Step, Macro, Functions and more

Using hash tables for pairwise comparing

Accepted Solution Solved
Reply
Contributor
Posts: 40
Accepted Solution

Using hash tables for pairwise comparing

[ Edited ]

I have a data set that has a list of serial numbers. For each serial number, I want to run the same calculation without using a macro to loop through each serial number and create a table. There are 37k serial numbers.

 

The following code works if I specify one serial number:

data test;
 infile datalines delimiter = ',';
 input serial lat long;
 datalines;
 1, 58.6732, 118.2569
 1, 58.9852, 118.2568
 1, 58.9865, 118.3156
 1, 75.259, -52.0023
 1, 75.2689, -52.125
 2, 10.123, 15.256
 2, 10.2365, 14.2698
 2, 11.3569, 15.3489
 3, 23.2569, 0.2659
 3, 22.6789, 0.2788
 3, 12.3569, 35.126
 3, 12.3789, 34.679
 ;
run;

%let rad = constant('pi') / 180; /*degrees to radians*/
%let r_km = 6371; /*radius of the earth in km (3959 miles)*/

proc fcmp outlib = funcs.haversine;
 function haversine(lat1, long1, lat2, long2); /*function name*/
 /*function definition*/
 dist = round(&r_km * 2 * arsin(min(1, sqrt(sin((lat2 - lat1) * &rad / 2)**2 + 
 cos(lat1 * &rad) * cos(lat2 * &rad) * sin((long2 - long1) * &rad / 2)**2))), 0.000001);
 return(dist); /*what to return*/
 endsub;
run;
options cmplib = funcs; /*needed to access the function*/

data test;
 set test;
 cnt = _n_;

	/*pairwise compare distance*/
	data results (drop = _rc dt);
		/*set up hash table*/
		if _n_ = 1 then do;
			if 0 then 
				set test(rename = (lat = lat1 long = long2));
			declare hash h1(dataset:"test(rename = (lat = lat1 long = long2, multidata: 'y');
			_rc = h1.definekey('cnt', 'serial');
			_rc = h1.definedata('lat1', 'long1');
			_rc = h1.definedone();
		end;
		if serial = /*first serial number*/ then accum + 1;/*but then the serial number needs to change, I think a macro wouldn't be ideal though*/
		do _i = (cnt + 1) to (cnt + accum);
			_rc = h1.find(key: _i, key: serial);
			hdist = haversine(lat, long, lat1, long1);
			output;
		end;
	run;
 

Can this be generalized to run on whole dataset comparing distance only for the specific serial numbers, recognizing a new number, and starting the process again. Otherwise, I need a macro to subset the major data set 37k times which seems inefficient.

 

The gist is, I need to find the distance between gps location 1 with 2, 2 with 3, ... for serial number 1. Then when I get to serial number 2, I need to repeat this process and so on.

 

This is a rough sketch of the what I am trying to achieve. I added comments where I explained the intent but haven't figured out the process to achieve the desired results.

 

/*Tried setting up a macro for the serial loop but this doesn't work
  Here is what I tried for that*/
%let rad = constant('pi') / 180; /*degrees to radians*/
%let r_km = 6371; /*radius of the earth in km (3959 miles)*/

proc fcmp outlib = funcs.haversine;
 function haversine(lat1, long1, lat2, long2); /*function name*/
 /*function definition*/
 dist = round(&r_km * 2 * arsin(min(1, sqrt(sin((lat2 - lat1) * &rad / 2)**2 + 
 cos(lat1 * &rad) * cos(lat2 * &rad) * sin((long2 - long1) * &rad / 2)**2))), 0.000001);
 return(dist); /*what to return*/
 endsub;
run;
options cmplib = funcs; /*needed to access the function*/

data test;
 infile datalines delimiter = ',';
 input serial lat long;
 label serial = 'serial'
 lat = 'lat'
 long = 'long';
 datalines;
 1, 58.6732, 118.2569
 1, 58.9852, 118.2568
 1, 58.9865, 118.3156
 1, 75.259, -52.0023
 1, 75.2689, -52.125
 2, 10.123, 15.256
 2, 10.2365, 14.2698
 2, 11.3569, 15.3489
 3, 23.2569, 0.2659
 3, 22.6789, 0.2788
 3, 12.3569, 35.126
 3, 12.3789, 34.679
 ;
run;

data test;
 set test;
 input cnt;
 label cnt = 'cnt';
 cnt = _n_;
run;

data serials (keep = serial);
 set test;
run;

proc sort nodupkey data = serials;
 by serial;
run;

data _null_;
 set serials nobs = n;
 call symput ('num', n);
run;

options mprint; /*full output for macro execution so you know something is happening*/

%macro serial_loop;
%local i sn;
%do i = 1 %to #
 data _null_;
 set serials (obs = &i firstobs = &i);
 call symputx ('sn', serial);
 run;
 if serial = "&sn" then
 accum + 1;
 /*find the distance between the ith and ith + 1 coordinates for each gps location*/
 do _i = (cnt + 1) to (cnt + accum);
 _rc = h1.find(key: _i, key: ser_no);
 if _rc = 0 then do;
 hdist = haversine(lat, long, lat1, long1);
 output;
 end;
 end;
%end;
%mend serial_loop;
data results (drop = _rc);
 /*set up hash table*/
 set test;
 if _n_ = 1 then do;
 if 0 then 
 set test(rename = (lat = lat1 long = long1));
 declare hash h1(dataset: "test(rename = (lat = lat1 
 long = long1))",
 multidata: 'y');
 _rc = h1.definekey('cnt', 'serial');
 _rc = h1.definedata('lat', 'long');
 _rc = h1.definedone();
 end;
 %serial_loop;
run;
 
ERROR: Variable lat is not on file WORK.TEST.

 

 


Accepted Solutions
Solution
‎03-08-2016 10:09 AM
Contributor
Posts: 40

Re: Using hash tables for pairwise comparing

I have answered my own question. I used SAS' geodist for the example solution.

 

data test;
 infile datalines delimiter = ',';
 input serial lat long date;
 format date date9.;
 datalines;
 1, 58.6732, 118.2569, 20225
 1, 58.9852, 118.2568, 20226
 1, 58.9865, 118.3156, 20227
 1, 75.259, -52.0023,  20228
 1, 75.2689, -52.125,  20229
 2, 10.123, 15.256,    22111
 2, 10.2365, 14.2698,  22112
 2, 11.3569, 15.3489,  22113
 3, 23.2569, 0.2659,   21256
 3, 22.6789, 0.2788,   21257
 3, 12.3569, 35.126,   21258
 3, 12.3789, 34.679,   21259
 ;
run;

data test;
	set test;
	i = _n_;
run;

proc sql _method;
	create table results (drop = sn date1 j) 
	as select
		a.*,
		b.serial as sn label = 'sn',
		b.lat as lat1 label = 'lat1',
		b.long as long1 label = 'long1',
		b.date as date1 label = 'date1',
		b.i as j,
		geodist(a.lat, a.long, b.lat, b.long) as gdist
	from test as a
	inner join test as b
	on a.serial = b.serial
	where a.i > b.i;
quit; 

proc sort results;
	by serial date;
run;

View solution in original post


All Replies
Respected Advisor
Posts: 3,777

Re: Using hash tables for pairwise comparing

Guessing I would say yes.  I think you will need to supply data to go with your code get anyone interested.  Also show what the output might look like for a "new" number where process is started over.

Contributor
Posts: 40

Re: Using hash tables for pairwise comparing

I added a sample data set

Super User
Posts: 10,497

Re: Using hash tables for pairwise comparing


dwsmith wrote:

I have a data set that has a list of serial numbers. For each serial number, I want to run the same calculation without using a macro to loop through each serial number and create a table. There are 37k serial numbers.

 


Are you comparing this only to other items with the same serial number?

 

I am not sure what this means at all:

recognizing a new number, and starting the process again. Otherwise, I need a macro to subset the major data set 37k times which seems inefficient.

It will really help if you can provide an example data set with a small number of records, say 5 to 10 and the minimum number of variables to work with and what the output should look like. What little I think I understand might be done with a reflexive join in Proc Sql since it looks like you are only using one data set.

 

Contributor
Posts: 40

Re: Using hash tables for pairwise comparing

I added a sample data set.

Solution
‎03-08-2016 10:09 AM
Contributor
Posts: 40

Re: Using hash tables for pairwise comparing

I have answered my own question. I used SAS' geodist for the example solution.

 

data test;
 infile datalines delimiter = ',';
 input serial lat long date;
 format date date9.;
 datalines;
 1, 58.6732, 118.2569, 20225
 1, 58.9852, 118.2568, 20226
 1, 58.9865, 118.3156, 20227
 1, 75.259, -52.0023,  20228
 1, 75.2689, -52.125,  20229
 2, 10.123, 15.256,    22111
 2, 10.2365, 14.2698,  22112
 2, 11.3569, 15.3489,  22113
 3, 23.2569, 0.2659,   21256
 3, 22.6789, 0.2788,   21257
 3, 12.3569, 35.126,   21258
 3, 12.3789, 34.679,   21259
 ;
run;

data test;
	set test;
	i = _n_;
run;

proc sql _method;
	create table results (drop = sn date1 j) 
	as select
		a.*,
		b.serial as sn label = 'sn',
		b.lat as lat1 label = 'lat1',
		b.long as long1 label = 'long1',
		b.date as date1 label = 'date1',
		b.i as j,
		geodist(a.lat, a.long, b.lat, b.long) as gdist
	from test as a
	inner join test as b
	on a.serial = b.serial
	where a.i > b.i;
quit; 

proc sort results;
	by serial date;
run;
Super User
Posts: 10,497

Re: Using hash tables for pairwise comparing

And that solution uses a reflexive join ...

Contributor
Posts: 40

Re: Using hash tables for pairwise comparing

I am not sure if this solution is good for a data set with 15million rows though.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 342 views
  • 0 likes
  • 3 in conversation