Solved: How to deal with missing coordinate data when computing geodistance by...

RandoDando · Posted 01-15-2025 08:53 AM

I have data which has groups and geographic coordinates. It looks like this:

Group	Location	Date	Lat	Long
A	13	1/4/2024	32.58497	-95.5545
A	16	4/2/2024	34.07736	-99.9309
A	25	1/15/2023	35.6381	-104.508
A	256	8/1/2024	37.27033	-109.294
A	22	1/1/2025	.	.
A	125	3/4/2023	40.76247	-119.535
A	126	1/8/2023	42.62939	-125.009
A	214	10/3/2023	44.58182	-130.735
B	532	11/4/2023	.	.
B	6543	12/1/2024	48.75903	-142.984
B	235	1/5/2025	50.99219	-149.533
B	241	6/3/2023	53.32764	-156.382
B	524	5/12/2024	55.77004	-163.544

I have this code which sorts by Group and Date, and computes the cumulative distance between points for each group at each location, as well as the distance between locations (from prior) for each. The wrench in this is the missing data. I don't want to compute the distance between two locations if they are not successive. For instance, if one location is missing coordinates, then the distance values for that AND the next location should be null, and cumulative held constant until 2 locations AFTER the one with the missing data. How can I modify this code to do that? I also do the same with the dates between locations but there is no missing data to deal with there.

Proc sort data=my_data;
by Group Date;
run;

data my_data_new; set my_data;
by group;

cum_dist+geodist(lat,long, lag(lat), lag(long));
if first.group then cum_dist = 0;
run;

Data my_data_new1; set My_data_new;

dist_prev = sum(cum_dist, -lag1(cum_dist));
if first.group then dist_prev = .;

days_prev = abs(dif(Date));
if first.group then days_prev = .;

run;

FreelanceReinh · Posted 01-15-2025 10:47 AM

Hello @RandoDando,

I think you're almost there. Just store the value returned by the GEODIST function in variable dist_prev rather than computing dist_prev from cumulative distances. You can also omit the second DATA step (which is missing a BY statement!) and do all calculations in the first as shown below:

/* Create sample data for demonstration */

data have;
call streaminit(27182818);
do g=1 to 5;
  date=rand('integer','01JAN2022'd,'31DEC2022'd);
  do location=1 to 30;
    group=byte(64+g);
    date+rand('integer',50);
    if rand('bern',0.7) then do;
      lat=rand('uniform',30,60);
      long=rand('uniform',-165,-95);
    end;
    else call missing(lat, long);
    output;
  end;
end;
format date yymmdd10.;
run;

/* Compute distances, cumulative distances and time differences */

data want;
set have;
by group;
dist_prev=geodist(lat, long, lag(lat), lag(long));
cum_dist+dist_prev;
days_prev=dif(date);
if first.group then do;
  cum_dist = 0;
  call missing(dist_prev, days_prev);
end;
run;

View solution in original post

RandoDando · Posted 01-15-2025 09:33 AM

I should note that the above code gives a prev_dist value of 0 for the location with missing data and the location after it. However, since some locations are the same in my actual data, I want to differentiate between 0 actual distance and 0 due to missing data. I hope that makes sense!

FreelanceReinh · Posted 01-15-2025 10:47 AM

Hello @RandoDando,

I think you're almost there. Just store the value returned by the GEODIST function in variable dist_prev rather than computing dist_prev from cumulative distances. You can also omit the second DATA step (which is missing a BY statement!) and do all calculations in the first as shown below:

/* Create sample data for demonstration */

data have;
call streaminit(27182818);
do g=1 to 5;
  date=rand('integer','01JAN2022'd,'31DEC2022'd);
  do location=1 to 30;
    group=byte(64+g);
    date+rand('integer',50);
    if rand('bern',0.7) then do;
      lat=rand('uniform',30,60);
      long=rand('uniform',-165,-95);
    end;
    else call missing(lat, long);
    output;
  end;
end;
format date yymmdd10.;
run;

/* Compute distances, cumulative distances and time differences */

data want;
set have;
by group;
dist_prev=geodist(lat, long, lag(lat), lag(long));
cum_dist+dist_prev;
days_prev=dif(date);
if first.group then do;
  cum_dist = 0;
  call missing(dist_prev, days_prev);
end;
run;

RandoDando · Posted 01-15-2025 11:49 AM

OK, was just handed another wrench. How about if I have TWO dates (a start_date and End_date) and I want the days_prev to be the difference between the start date and the end_date of the previous record?

EDIT: got it.

days_prev = start_date - lag(end_date);
if first.group then days_prev = .;

How to deal with missing coordinate data when computing geodistance by group

Re: How to deal with missing coordinate data when computing geodistance by group

Re: How to deal with missing coordinate data when computing geodistance by group

Re: How to deal with missing coordinate data when computing geodistance by group

Re: How to deal with missing coordinate data when computing geodistance by group

How to deal with missing coordinate data when computing geodistance by group

Re: How to deal with missing coordinate data when computing geodistance by group

Re: How to deal with missing coordinate data when computing geodistance by group

Re: How to deal with missing coordinate data when computing geodistance by group

Re: How to deal with missing coordinate data when computing geodistance by group

Registration is open

SAS Training: Just a Click Away