I am no sure how to describe this, but I have a data set of home sales including foreclosures. I want to count the number of foreclosures near each observation based on distance and time as well as if the foreclosure meets one other, dummy criteria.
Below is the data.
h is a dummy variable for quality
f is a dummy variable for foreclosure sale
X and Y are lat/long.
t is the month the foreclosure sale
Count all the other observations that meet these critera:
Within distance: say 1 mile (I know how to use GEODIST to find this)
Within time: say 30 days so abs(t_subject - t_obs) < 30)
Where f=1
I also want to split this count between obs where h=0 and h=1
So for the first datapoint. I want to find all other datapoints that are within 1 mile of
| -113.219 | 37.18241 |
And within 30 days of 20239 or 20209 to 20269
Where f=1.
I want a count for when h=0 and h=1.
Here's the data:
| h | f | x | y | t |
| 0 | 1 | -113.219 | 37.18241 | 20239 |
| 1 | 0 | -113.751 | 37.73728 | 20404 |
| 1 | 0 | -113.124 | 37.69568 | 20428 |
| 0 | 1 | -113.862 | 37.70729 | 20194 |
| 1 | 0 | -113.373 | 37.14169 | 20200 |
| 1 | 0 | -113.485 | 37.52134 | 20724 |
| 1 | 0 | -114.051 | 36.89753 | 20395 |
| 1 | 0 | -113.666 | 37.45601 | 20410 |
| 1 | 0 | -113.421 | 36.86348 | 20511 |
| 1 | 1 | -113.59 | 37.08619 | 20652 |
| 1 | 1 | -113.904 | 37.40063 | 20098 |
| 0 | 0 | -113.997 | 37.13753 | 20050 |
| 0 | 0 | -113.458 | 37.27519 | 20590 |
| 0 | 0 | -113.457 | 37.45815 | 20403 |
| 0 | 0 | -113.732 | 37.14784 | 20252 |
| 0 | 1 | -113.673 | 37.03961 | 20818 |
| 1 | 1 | -113.861 | 37.75376 | 20859 |
| 0 | 0 | -113.793 | 36.85021 | 20730 |
| 1 | 0 | -113.488 | 37.4593 | 20451 |
| 1 | 0 | -113.415 | 37.26769 | 20074 |
My 2 cents. Should be reasonably fast as well.
data have;
input h f x y t;
datalines;
0 1 -113.219 37.18241 20239
1 0 -113.751 37.73728 20404
1 0 -113.124 37.69568 20428
0 1 -113.862 37.70729 20194
1 0 -113.373 37.14169 20200
1 0 -113.485 37.52134 20724
1 0 -114.051 36.89753 20395
1 0 -113.666 37.45601 20410
1 0 -113.421 36.86348 20511
1 1 -113.59 37.08619 20652
1 1 -113.904 37.40063 20098
0 0 -113.997 37.13753 20050
0 0 -113.458 37.27519 20590
0 0 -113.457 37.45815 20403
0 0 -113.732 37.14784 20252
0 1 -113.673 37.03961 20818
1 1 -113.861 37.75376 20859
0 0 -113.793 36.85021 20730
1 0 -113.488 37.4593 20451
1 0 -113.415 37.26769 20074
;
data want(drop = xx yy tt);
if _N_ = 1 then do;
dcl hash hh(dataset : "have(rename = (x = xx y = yy t = tt) where = (f = 1))");
hh.definekey("h");
hh.definedata("xx", "yy", "tt");
hh.definedone();
dcl hiter i("hh");
end;
set have;
xx = .; yy = .; tt = .; c = 0;
do while (i.next() = 0);
if 0 < geodist(y, x, yy, xx, 'm') < 1 & abs(t - tt) < 30 then c + 1;
end;
run;
If that's the entire data set, 20-30 observations, I would use a Cartesian join to get all possible combinations in a data set and then do the 1 miles and 30 days comparison on this data set. I would also add some sort of ID variable to each row, it could be as simple as sequence number.
UNTESTED CODE
proc sql;
create table cartesian as select a.id,a.h,a.f,a.x,a.y,a.t,
b.id as id2,b.h as h2,b.f as f2,b.x as x2,b.y as y2,b.t as t2
from have as a,have as b
order by a.id,b.id;
quit;
Now if you really have thousands of records, this probably won't work because maybe your computer can't handle this large problem, or is going to take a long time.
Yes, I actually have 250k observations. I have used the Cartesian join and other methods, but they all consume days worth of resources. I am trying to find an easier solution. I was wondering if there was a solution with do loops.
Well, there are methods that might work when you have 250k observations, such as a double-do loop, and hash objects, but I will leave that type of programming up to others.
My 2 cents. Should be reasonably fast as well.
data have;
input h f x y t;
datalines;
0 1 -113.219 37.18241 20239
1 0 -113.751 37.73728 20404
1 0 -113.124 37.69568 20428
0 1 -113.862 37.70729 20194
1 0 -113.373 37.14169 20200
1 0 -113.485 37.52134 20724
1 0 -114.051 36.89753 20395
1 0 -113.666 37.45601 20410
1 0 -113.421 36.86348 20511
1 1 -113.59 37.08619 20652
1 1 -113.904 37.40063 20098
0 0 -113.997 37.13753 20050
0 0 -113.458 37.27519 20590
0 0 -113.457 37.45815 20403
0 0 -113.732 37.14784 20252
0 1 -113.673 37.03961 20818
1 1 -113.861 37.75376 20859
0 0 -113.793 36.85021 20730
1 0 -113.488 37.4593 20451
1 0 -113.415 37.26769 20074
;
data want(drop = xx yy tt);
if _N_ = 1 then do;
dcl hash hh(dataset : "have(rename = (x = xx y = yy t = tt) where = (f = 1))");
hh.definekey("h");
hh.definedata("xx", "yy", "tt");
hh.definedone();
dcl hiter i("hh");
end;
set have;
xx = .; yy = .; tt = .; c = 0;
do while (i.next() = 0);
if 0 < geodist(y, x, yy, xx, 'm') < 1 & abs(t - tt) < 30 then c + 1;
end;
run;
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.