I have a data set that has a list of serial numbers. For each serial number, I want to run the same calculation without using a macro to loop through each serial number and create a table. There are 37k serial numbers.
The following code works if I specify one serial number:
data test;
infile datalines delimiter = ',';
input serial lat long;
datalines;
1, 58.6732, 118.2569
1, 58.9852, 118.2568
1, 58.9865, 118.3156
1, 75.259, -52.0023
1, 75.2689, -52.125
2, 10.123, 15.256
2, 10.2365, 14.2698
2, 11.3569, 15.3489
3, 23.2569, 0.2659
3, 22.6789, 0.2788
3, 12.3569, 35.126
3, 12.3789, 34.679
;
run;
%let rad = constant('pi') / 180; /*degrees to radians*/
%let r_km = 6371; /*radius of the earth in km (3959 miles)*/
proc fcmp outlib = funcs.haversine;
function haversine(lat1, long1, lat2, long2); /*function name*/
/*function definition*/
dist = round(&r_km * 2 * arsin(min(1, sqrt(sin((lat2 - lat1) * &rad / 2)**2 +
cos(lat1 * &rad) * cos(lat2 * &rad) * sin((long2 - long1) * &rad / 2)**2))), 0.000001);
return(dist); /*what to return*/
endsub;
run;
options cmplib = funcs; /*needed to access the function*/
data test;
set test;
cnt = _n_;
/*pairwise compare distance*/
data results (drop = _rc dt);
/*set up hash table*/
if _n_ = 1 then do;
if 0 then
set test(rename = (lat = lat1 long = long2));
declare hash h1(dataset:"test(rename = (lat = lat1 long = long2, multidata: 'y');
_rc = h1.definekey('cnt', 'serial');
_rc = h1.definedata('lat1', 'long1');
_rc = h1.definedone();
end;
if serial = /*first serial number*/ then accum + 1;/*but then the serial number needs to change, I think a macro wouldn't be ideal though*/
do _i = (cnt + 1) to (cnt + accum);
_rc = h1.find(key: _i, key: serial);
hdist = haversine(lat, long, lat1, long1);
output;
end;
run;
Can this be generalized to run on whole dataset comparing distance only for the specific serial numbers, recognizing a new number, and starting the process again. Otherwise, I need a macro to subset the major data set 37k times which seems inefficient.
The gist is, I need to find the distance between gps location 1 with 2, 2 with 3, ... for serial number 1. Then when I get to serial number 2, I need to repeat this process and so on.
This is a rough sketch of the what I am trying to achieve. I added comments where I explained the intent but haven't figured out the process to achieve the desired results.
/*Tried setting up a macro for the serial loop but this doesn't work
Here is what I tried for that*/
%let rad = constant('pi') / 180; /*degrees to radians*/
%let r_km = 6371; /*radius of the earth in km (3959 miles)*/
proc fcmp outlib = funcs.haversine;
function haversine(lat1, long1, lat2, long2); /*function name*/
/*function definition*/
dist = round(&r_km * 2 * arsin(min(1, sqrt(sin((lat2 - lat1) * &rad / 2)**2 +
cos(lat1 * &rad) * cos(lat2 * &rad) * sin((long2 - long1) * &rad / 2)**2))), 0.000001);
return(dist); /*what to return*/
endsub;
run;
options cmplib = funcs; /*needed to access the function*/
data test;
infile datalines delimiter = ',';
input serial lat long;
label serial = 'serial'
lat = 'lat'
long = 'long';
datalines;
1, 58.6732, 118.2569
1, 58.9852, 118.2568
1, 58.9865, 118.3156
1, 75.259, -52.0023
1, 75.2689, -52.125
2, 10.123, 15.256
2, 10.2365, 14.2698
2, 11.3569, 15.3489
3, 23.2569, 0.2659
3, 22.6789, 0.2788
3, 12.3569, 35.126
3, 12.3789, 34.679
;
run;
data test;
set test;
input cnt;
label cnt = 'cnt';
cnt = _n_;
run;
data serials (keep = serial);
set test;
run;
proc sort nodupkey data = serials;
by serial;
run;
data _null_;
set serials nobs = n;
call symput ('num', n);
run;
options mprint; /*full output for macro execution so you know something is happening*/
%macro serial_loop;
%local i sn;
%do i = 1 %to #
data _null_;
set serials (obs = &i firstobs = &i);
call symputx ('sn', serial);
run;
if serial = "&sn" then
accum + 1;
/*find the distance between the ith and ith + 1 coordinates for each gps location*/
do _i = (cnt + 1) to (cnt + accum);
_rc = h1.find(key: _i, key: ser_no);
if _rc = 0 then do;
hdist = haversine(lat, long, lat1, long1);
output;
end;
end;
%end;
%mend serial_loop;
data results (drop = _rc);
/*set up hash table*/
set test;
if _n_ = 1 then do;
if 0 then
set test(rename = (lat = lat1 long = long1));
declare hash h1(dataset: "test(rename = (lat = lat1
long = long1))",
multidata: 'y');
_rc = h1.definekey('cnt', 'serial');
_rc = h1.definedata('lat', 'long');
_rc = h1.definedone();
end;
%serial_loop;
run;
ERROR: Variable lat is not on file WORK.TEST.
I have answered my own question. I used SAS' geodist for the example solution.
data test;
infile datalines delimiter = ',';
input serial lat long date;
format date date9.;
datalines;
1, 58.6732, 118.2569, 20225
1, 58.9852, 118.2568, 20226
1, 58.9865, 118.3156, 20227
1, 75.259, -52.0023, 20228
1, 75.2689, -52.125, 20229
2, 10.123, 15.256, 22111
2, 10.2365, 14.2698, 22112
2, 11.3569, 15.3489, 22113
3, 23.2569, 0.2659, 21256
3, 22.6789, 0.2788, 21257
3, 12.3569, 35.126, 21258
3, 12.3789, 34.679, 21259
;
run;
data test;
set test;
i = _n_;
run;
proc sql _method;
create table results (drop = sn date1 j)
as select
a.*,
b.serial as sn label = 'sn',
b.lat as lat1 label = 'lat1',
b.long as long1 label = 'long1',
b.date as date1 label = 'date1',
b.i as j,
geodist(a.lat, a.long, b.lat, b.long) as gdist
from test as a
inner join test as b
on a.serial = b.serial
where a.i > b.i;
quit;
proc sort results;
by serial date;
run;
Guessing I would say yes. I think you will need to supply data to go with your code get anyone interested. Also show what the output might look like for a "new" number where process is started over.
I added a sample data set
@dwsmith wrote:
I have a data set that has a list of serial numbers. For each serial number, I want to run the same calculation without using a macro to loop through each serial number and create a table. There are 37k serial numbers.
Are you comparing this only to other items with the same serial number?
I am not sure what this means at all:
recognizing a new number, and starting the process again. Otherwise, I need a macro to subset the major data set 37k times which seems inefficient.
It will really help if you can provide an example data set with a small number of records, say 5 to 10 and the minimum number of variables to work with and what the output should look like. What little I think I understand might be done with a reflexive join in Proc Sql since it looks like you are only using one data set.
I added a sample data set.
I have answered my own question. I used SAS' geodist for the example solution.
data test;
infile datalines delimiter = ',';
input serial lat long date;
format date date9.;
datalines;
1, 58.6732, 118.2569, 20225
1, 58.9852, 118.2568, 20226
1, 58.9865, 118.3156, 20227
1, 75.259, -52.0023, 20228
1, 75.2689, -52.125, 20229
2, 10.123, 15.256, 22111
2, 10.2365, 14.2698, 22112
2, 11.3569, 15.3489, 22113
3, 23.2569, 0.2659, 21256
3, 22.6789, 0.2788, 21257
3, 12.3569, 35.126, 21258
3, 12.3789, 34.679, 21259
;
run;
data test;
set test;
i = _n_;
run;
proc sql _method;
create table results (drop = sn date1 j)
as select
a.*,
b.serial as sn label = 'sn',
b.lat as lat1 label = 'lat1',
b.long as long1 label = 'long1',
b.date as date1 label = 'date1',
b.i as j,
geodist(a.lat, a.long, b.lat, b.long) as gdist
from test as a
inner join test as b
on a.serial = b.serial
where a.i > b.i;
quit;
proc sort results;
by serial date;
run;
And that solution uses a reflexive join ...
I am not sure if this solution is good for a data set with 15million rows though.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.