I have a data set that has a list of serial numbers. For each serial number, I want to run the same calculation without using a macro to loop through each serial number and create a table. There are 37k serial numbers.
The following code works if I specify one serial number:
data test;
infile datalines delimiter = ',';
input serial lat long;
datalines;
1, 58.6732, 118.2569
1, 58.9852, 118.2568
1, 58.9865, 118.3156
1, 75.259, -52.0023
1, 75.2689, -52.125
2, 10.123, 15.256
2, 10.2365, 14.2698
2, 11.3569, 15.3489
3, 23.2569, 0.2659
3, 22.6789, 0.2788
3, 12.3569, 35.126
3, 12.3789, 34.679
;
run;
%let rad = constant('pi') / 180; /*degrees to radians*/
%let r_km = 6371; /*radius of the earth in km (3959 miles)*/
proc fcmp outlib = funcs.haversine;
function haversine(lat1, long1, lat2, long2); /*function name*/
/*function definition*/
dist = round(&r_km * 2 * arsin(min(1, sqrt(sin((lat2 - lat1) * &rad / 2)**2 +
cos(lat1 * &rad) * cos(lat2 * &rad) * sin((long2 - long1) * &rad / 2)**2))), 0.000001);
return(dist); /*what to return*/
endsub;
run;
options cmplib = funcs; /*needed to access the function*/
data test;
set test;
cnt = _n_;
/*pairwise compare distance*/
data results (drop = _rc dt);
/*set up hash table*/
if _n_ = 1 then do;
if 0 then
set test(rename = (lat = lat1 long = long2));
declare hash h1(dataset:"test(rename = (lat = lat1 long = long2, multidata: 'y');
_rc = h1.definekey('cnt', 'serial');
_rc = h1.definedata('lat1', 'long1');
_rc = h1.definedone();
end;
if serial = /*first serial number*/ then accum + 1;/*but then the serial number needs to change, I think a macro wouldn't be ideal though*/
do _i = (cnt + 1) to (cnt + accum);
_rc = h1.find(key: _i, key: serial);
hdist = haversine(lat, long, lat1, long1);
output;
end;
run;
Can this be generalized to run on whole dataset comparing distance only for the specific serial numbers, recognizing a new number, and starting the process again. Otherwise, I need a macro to subset the major data set 37k times which seems inefficient.
The gist is, I need to find the distance between gps location 1 with 2, 2 with 3, ... for serial number 1. Then when I get to serial number 2, I need to repeat this process and so on.
This is a rough sketch of the what I am trying to achieve. I added comments where I explained the intent but haven't figured out the process to achieve the desired results.
/*Tried setting up a macro for the serial loop but this doesn't work
Here is what I tried for that*/
%let rad = constant('pi') / 180; /*degrees to radians*/
%let r_km = 6371; /*radius of the earth in km (3959 miles)*/
proc fcmp outlib = funcs.haversine;
function haversine(lat1, long1, lat2, long2); /*function name*/
/*function definition*/
dist = round(&r_km * 2 * arsin(min(1, sqrt(sin((lat2 - lat1) * &rad / 2)**2 +
cos(lat1 * &rad) * cos(lat2 * &rad) * sin((long2 - long1) * &rad / 2)**2))), 0.000001);
return(dist); /*what to return*/
endsub;
run;
options cmplib = funcs; /*needed to access the function*/
data test;
infile datalines delimiter = ',';
input serial lat long;
label serial = 'serial'
lat = 'lat'
long = 'long';
datalines;
1, 58.6732, 118.2569
1, 58.9852, 118.2568
1, 58.9865, 118.3156
1, 75.259, -52.0023
1, 75.2689, -52.125
2, 10.123, 15.256
2, 10.2365, 14.2698
2, 11.3569, 15.3489
3, 23.2569, 0.2659
3, 22.6789, 0.2788
3, 12.3569, 35.126
3, 12.3789, 34.679
;
run;
data test;
set test;
input cnt;
label cnt = 'cnt';
cnt = _n_;
run;
data serials (keep = serial);
set test;
run;
proc sort nodupkey data = serials;
by serial;
run;
data _null_;
set serials nobs = n;
call symput ('num', n);
run;
options mprint; /*full output for macro execution so you know something is happening*/
%macro serial_loop;
%local i sn;
%do i = 1 %to #
data _null_;
set serials (obs = &i firstobs = &i);
call symputx ('sn', serial);
run;
if serial = "&sn" then
accum + 1;
/*find the distance between the ith and ith + 1 coordinates for each gps location*/
do _i = (cnt + 1) to (cnt + accum);
_rc = h1.find(key: _i, key: ser_no);
if _rc = 0 then do;
hdist = haversine(lat, long, lat1, long1);
output;
end;
end;
%end;
%mend serial_loop;
data results (drop = _rc);
/*set up hash table*/
set test;
if _n_ = 1 then do;
if 0 then
set test(rename = (lat = lat1 long = long1));
declare hash h1(dataset: "test(rename = (lat = lat1
long = long1))",
multidata: 'y');
_rc = h1.definekey('cnt', 'serial');
_rc = h1.definedata('lat', 'long');
_rc = h1.definedone();
end;
%serial_loop;
run;
ERROR: Variable lat is not on file WORK.TEST.
I have answered my own question. I used SAS' geodist for the example solution.
data test;
infile datalines delimiter = ',';
input serial lat long date;
format date date9.;
datalines;
1, 58.6732, 118.2569, 20225
1, 58.9852, 118.2568, 20226
1, 58.9865, 118.3156, 20227
1, 75.259, -52.0023, 20228
1, 75.2689, -52.125, 20229
2, 10.123, 15.256, 22111
2, 10.2365, 14.2698, 22112
2, 11.3569, 15.3489, 22113
3, 23.2569, 0.2659, 21256
3, 22.6789, 0.2788, 21257
3, 12.3569, 35.126, 21258
3, 12.3789, 34.679, 21259
;
run;
data test;
set test;
i = _n_;
run;
proc sql _method;
create table results (drop = sn date1 j)
as select
a.*,
b.serial as sn label = 'sn',
b.lat as lat1 label = 'lat1',
b.long as long1 label = 'long1',
b.date as date1 label = 'date1',
b.i as j,
geodist(a.lat, a.long, b.lat, b.long) as gdist
from test as a
inner join test as b
on a.serial = b.serial
where a.i > b.i;
quit;
proc sort results;
by serial date;
run;
Guessing I would say yes. I think you will need to supply data to go with your code get anyone interested. Also show what the output might look like for a "new" number where process is started over.
I added a sample data set
@dwsmith wrote:
I have a data set that has a list of serial numbers. For each serial number, I want to run the same calculation without using a macro to loop through each serial number and create a table. There are 37k serial numbers.
Are you comparing this only to other items with the same serial number?
I am not sure what this means at all:
recognizing a new number, and starting the process again. Otherwise, I need a macro to subset the major data set 37k times which seems inefficient.
It will really help if you can provide an example data set with a small number of records, say 5 to 10 and the minimum number of variables to work with and what the output should look like. What little I think I understand might be done with a reflexive join in Proc Sql since it looks like you are only using one data set.
I added a sample data set.
I have answered my own question. I used SAS' geodist for the example solution.
data test;
infile datalines delimiter = ',';
input serial lat long date;
format date date9.;
datalines;
1, 58.6732, 118.2569, 20225
1, 58.9852, 118.2568, 20226
1, 58.9865, 118.3156, 20227
1, 75.259, -52.0023, 20228
1, 75.2689, -52.125, 20229
2, 10.123, 15.256, 22111
2, 10.2365, 14.2698, 22112
2, 11.3569, 15.3489, 22113
3, 23.2569, 0.2659, 21256
3, 22.6789, 0.2788, 21257
3, 12.3569, 35.126, 21258
3, 12.3789, 34.679, 21259
;
run;
data test;
set test;
i = _n_;
run;
proc sql _method;
create table results (drop = sn date1 j)
as select
a.*,
b.serial as sn label = 'sn',
b.lat as lat1 label = 'lat1',
b.long as long1 label = 'long1',
b.date as date1 label = 'date1',
b.i as j,
geodist(a.lat, a.long, b.lat, b.long) as gdist
from test as a
inner join test as b
on a.serial = b.serial
where a.i > b.i;
quit;
proc sort results;
by serial date;
run;
And that solution uses a reflexive join ...
I am not sure if this solution is good for a data set with 15million rows though.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.