I am no sure how to describe this, but I have a data set of home sales including foreclosures. I want to count the number of foreclosures near each observation based on distance and time as well as if the foreclosure meets one other, dummy criteria.
Below is the data.
h is a dummy variable for quality
f is a dummy variable for foreclosure sale
X and Y are lat/long.
t is the month the foreclosure sale
Count all the other observations that meet these critera:
Within distance: say 1 mile (I know how to use GEODIST to find this)
Within time: say 30 days so abs(t_subject - t_obs) < 30)
Where f=1
I also want to split this count between obs where h=0 and h=1
So for the first datapoint. I want to find all other datapoints that are within 1 mile of
-113.219 | 37.18241 |
And within 30 days of 20239 or 20209 to 20269
Where f=1.
I want a count for when h=0 and h=1.
Here's the data:
h | f | x | y | t |
0 | 1 | -113.219 | 37.18241 | 20239 |
1 | 0 | -113.751 | 37.73728 | 20404 |
1 | 0 | -113.124 | 37.69568 | 20428 |
0 | 1 | -113.862 | 37.70729 | 20194 |
1 | 0 | -113.373 | 37.14169 | 20200 |
1 | 0 | -113.485 | 37.52134 | 20724 |
1 | 0 | -114.051 | 36.89753 | 20395 |
1 | 0 | -113.666 | 37.45601 | 20410 |
1 | 0 | -113.421 | 36.86348 | 20511 |
1 | 1 | -113.59 | 37.08619 | 20652 |
1 | 1 | -113.904 | 37.40063 | 20098 |
0 | 0 | -113.997 | 37.13753 | 20050 |
0 | 0 | -113.458 | 37.27519 | 20590 |
0 | 0 | -113.457 | 37.45815 | 20403 |
0 | 0 | -113.732 | 37.14784 | 20252 |
0 | 1 | -113.673 | 37.03961 | 20818 |
1 | 1 | -113.861 | 37.75376 | 20859 |
0 | 0 | -113.793 | 36.85021 | 20730 |
1 | 0 | -113.488 | 37.4593 | 20451 |
1 | 0 | -113.415 | 37.26769 | 20074 |
My 2 cents. Should be reasonably fast as well.
data have;
input h f x y t;
datalines;
0 1 -113.219 37.18241 20239
1 0 -113.751 37.73728 20404
1 0 -113.124 37.69568 20428
0 1 -113.862 37.70729 20194
1 0 -113.373 37.14169 20200
1 0 -113.485 37.52134 20724
1 0 -114.051 36.89753 20395
1 0 -113.666 37.45601 20410
1 0 -113.421 36.86348 20511
1 1 -113.59 37.08619 20652
1 1 -113.904 37.40063 20098
0 0 -113.997 37.13753 20050
0 0 -113.458 37.27519 20590
0 0 -113.457 37.45815 20403
0 0 -113.732 37.14784 20252
0 1 -113.673 37.03961 20818
1 1 -113.861 37.75376 20859
0 0 -113.793 36.85021 20730
1 0 -113.488 37.4593 20451
1 0 -113.415 37.26769 20074
;
data want(drop = xx yy tt);
if _N_ = 1 then do;
dcl hash hh(dataset : "have(rename = (x = xx y = yy t = tt) where = (f = 1))");
hh.definekey("h");
hh.definedata("xx", "yy", "tt");
hh.definedone();
dcl hiter i("hh");
end;
set have;
xx = .; yy = .; tt = .; c = 0;
do while (i.next() = 0);
if 0 < geodist(y, x, yy, xx, 'm') < 1 & abs(t - tt) < 30 then c + 1;
end;
run;
If that's the entire data set, 20-30 observations, I would use a Cartesian join to get all possible combinations in a data set and then do the 1 miles and 30 days comparison on this data set. I would also add some sort of ID variable to each row, it could be as simple as sequence number.
UNTESTED CODE
proc sql;
create table cartesian as select a.id,a.h,a.f,a.x,a.y,a.t,
b.id as id2,b.h as h2,b.f as f2,b.x as x2,b.y as y2,b.t as t2
from have as a,have as b
order by a.id,b.id;
quit;
Now if you really have thousands of records, this probably won't work because maybe your computer can't handle this large problem, or is going to take a long time.
Yes, I actually have 250k observations. I have used the Cartesian join and other methods, but they all consume days worth of resources. I am trying to find an easier solution. I was wondering if there was a solution with do loops.
Well, there are methods that might work when you have 250k observations, such as a double-do loop, and hash objects, but I will leave that type of programming up to others.
My 2 cents. Should be reasonably fast as well.
data have;
input h f x y t;
datalines;
0 1 -113.219 37.18241 20239
1 0 -113.751 37.73728 20404
1 0 -113.124 37.69568 20428
0 1 -113.862 37.70729 20194
1 0 -113.373 37.14169 20200
1 0 -113.485 37.52134 20724
1 0 -114.051 36.89753 20395
1 0 -113.666 37.45601 20410
1 0 -113.421 36.86348 20511
1 1 -113.59 37.08619 20652
1 1 -113.904 37.40063 20098
0 0 -113.997 37.13753 20050
0 0 -113.458 37.27519 20590
0 0 -113.457 37.45815 20403
0 0 -113.732 37.14784 20252
0 1 -113.673 37.03961 20818
1 1 -113.861 37.75376 20859
0 0 -113.793 36.85021 20730
1 0 -113.488 37.4593 20451
1 0 -113.415 37.26769 20074
;
data want(drop = xx yy tt);
if _N_ = 1 then do;
dcl hash hh(dataset : "have(rename = (x = xx y = yy t = tt) where = (f = 1))");
hh.definekey("h");
hh.definedata("xx", "yy", "tt");
hh.definedone();
dcl hiter i("hh");
end;
set have;
xx = .; yy = .; tt = .; c = 0;
do while (i.next() = 0);
if 0 < geodist(y, x, yy, xx, 'm') < 1 & abs(t - tt) < 30 then c + 1;
end;
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.