Dear all,
Thank you for the many responses. My hope was that there would be an easy method that constructed an interpolation(x, y) function (as is the case in the examples in Python, R, MATLAB and C++ shown in the original question). If I read the different responses correctly, this is not possible at all, am I correct with this?
The popular G3GRID method that is suggested does not seem to meet my criteria. The reason is that I have irregularly spaced datapoints that I wish to determine. This appears to be impossible using this method.
The MI method seems promising at first sight. However I only see examples where the imputation is done by either the "mean value" or by a "regression result". The documentation of this method does not discuss "interpolation".
My current "quick and dirty" workaround is by using the "proc discrim" procedure, to essentially do 1-nearest neighbour interpolation. This gives me the value of the nearest point as the "interpolation". This is a workable solution at the moment for me, although it is far from ideal.
As a recap: I have a series of regularly spaced (x0, y0, z0) values. I also have a (very large) series of irregularly spaced (x, y) values for which I want to determine the corresponding z values.
@NickVe wrote:
Dear all,
Thank you for the many responses. My hope was that there would be an easy method that constructed an interpolation(x, y) function (as is the case in the examples in Python, R, MATLAB and C++ shown in the original question). If I read the different responses correctly, this is not possible at all, am I correct with this?
PROC GLM does this. Which was mentioned earlier in this thread.
If I read the documentation correctly, "proc GLM" only provides a "linear fit", not an "interpolation" as is asked in the original question. Please correct me if I misread this.
> If I read the different responses correctly, this is not possible at all, am I correct with this?
I would not say that is a correct statement. It is possible by using programming statements. The difficulty seems to be that the built-in SAS procedures that we have suggested have been rejected by you for various reasons. If I remember correctly, you do not want a global (or even local) statistical fit but want a bilinear fit based on values at the corner of the cell (in a regular grid) in which each point lives.
I've written about how to use 2-D binning to find the cell that each point belongs to. If you are a programmer, such a routine can be programmed in the SAS/IML language by using the definition of bilinear interpolation.
This 2D binning article looks indeed promosing workaround to generate a regular grid from my irregularly spaced data. This could then be combined with the G3GRID method proposed above. Note however that this is still a workaround and technically not a solution to my problem.
The reason for rejecting the previous methods are that they do not answer the question:
The closest solution to my problem I have found seems to be the "proc discrim" method, which I use to make a 1NN interpolation. The advantage is that this meets the criteria of my question:
For me personally this is the most viable solution at the moment. Unfortunately this only fits the nearest point, and does not perform linear interpolation.
I understand that I could write my own program in SAS to solve this issue, however that was not the question. I wanted to avoid "re-inventing the wheel" by working with the optimized SAS procedures as much as possible. My belief was that these "interpolation functionalities" would also be available in SAS since they are available in for example MATLAB, R, Python and C++ (see examples).
PROC GLM does both a fit, and an interpolation based on the fit. In PROC GLM language, the interpolation is called a prediction.
Maybe @StatDave know some PROC for your requirement ?
Also if it was space data , could check MDS Procedure.
You can get proc krige2d to interpolate if you choose a variogram that goes to zero at distance=0. In this example, you can see that the estimates at the grid points match the grid point values and that the other estimates are interpolations of the grid point values.
data Grid;
call streaminit(12345);
do x = 0 to 10 by 2;
do y = 0 to 10 by 2;
z = abs(x-5) + abs(y-5) + round(rand("Normal"), 0.1);
output;
end;
end;
run;
data New;
do x = 1 to 9 by 1;
do y = 1 to 9 by 1;
output;
end;
end;
run;
/* 2.83 = sqrt(8), approx. The length of the grid diagonal. */
proc krige2d data=grid outest=want plots=none;
coordinates xcoord=x ycoord=y;
grid griddata=new xcoord=x ycoord=y;
predict var=z radius=2.83 minpoints=4 maxpoints=8;
model form=cubic range=2.83 scale=1;
run;
/* Check a few data points */
data both;
set grid want(rename=(gxc=x gyc=y));
if x <= 4 and y <= 4;
run;
proc sort data=both; by x y; run;
proc print data=both; var x y z estimate; run;
Obs. x y z ESTIMATE 1 0 0 10.3 . 2 0 2 9.1 . 3 0 4 6.8 . 4 1 1 . 8.57500 5 1 2 . 8.02992 6 1 3 . 6.75000 7 1 4 . 5.48173 8 2 0 7.9 . 9 2 1 . 7.43705 10 2 2 7.0 . 11 2 2 . 7.00000 12 2 3 . 5.54158 13 2 4 4.1 . 14 2 4 . 4.10000 15 3 1 . 5.82500 16 3 2 . 5.23316 17 3 3 . 4.02500 18 3 4 . 2.84404 19 4 0 4.9 . 20 4 1 . 4.25116 21 4 2 3.5 . 22 4 2 . 3.50000 23 4 3 . 2.55311 24 4 4 1.5 . 25 4 4 . 1.50000
Hi @NickVe,
Another option would be to perform linear interpolation twice (once in x-direction, then in y-direction or vice versa) using one of the two procedures suggested in the SAS Usage Note you linked to. I think PROC EXPAND would be superior for this approach, but I don't have a SAS/ETS license, so I use PROC TRANSREG.
But first a DATA step approach:
/* Create example data
HAVE1: rectangular x-y grid with known values z=f(x,y)
HAVE2: random points (x,y) in the interior of the grid
*/
%let xmesh=0.5;
%let ymesh=1.25;
%let xmin=1;
%let xmax=4;
%let ymin=0;
%let ymax=5;
data have1;
do x=&xmin to &xmax by &xmesh;
do y=&ymin to &ymax by &ymesh;
z=10*sin((x**2-1.618*x*y+3.142*y**2-40)/10);
output;
end;
end;
run;
data have2;
call streaminit(27182818);
do _n_=1 to 100;
x=rand('uniform',&xmin, &xmax);
y=rand('uniform',&ymin, &ymax);
output;
end;
run;
/* Bilinear interpolation in a DATA step (and a preliminary PROC SQL step) */
proc sql;
create view vhave2 as
select x, y, floor(x/&xmesh)*&xmesh as x1,
floor(y/&ymesh)*&ymesh as y1
from have2
order by x1, y1, x, y;
quit;
data want(keep=x y z);
if _n_=1 then do;
dcl hash h(multidata:'y');
h.definekey('x1','y1');
h.definedata('f');
h.definedone();
call missing(x1,y1,f);
do until(lr);
set have1 end=lr;
if x<&xmax & y<&ymax then h.add(key:x, key:y, data:z);
if x<&xmax & y>&ymin then h.add(key:x, key:y-&ymesh, data:z);
if x>&xmin & y<&ymax then h.add(key:x-&xmesh, key:y, data:z);
if x>&xmin & y>&ymin then h.add(key:x-&xmesh, key:y-&ymesh, data:z);
end;
end;
set vhave2;
by x1 y1;
if first.y1 then do;
rc=h.find();
if rc=0 then do;
f11=f;
h.find_next(); f12=f;
h.find_next(); f21=f;
h.find_next(); f22=f;
end;
end;
if rc=0 then do;
x2=x1+&xmesh;
y2=y1+&ymesh;
z = (f11*x2*y2-f12*x2*y1-f21*x1*y2+f22*x1*y1
-(f11 *y2-f12 *y1-f21 *y2+f22 *y1)*x
-(f11*x2 -f12*x2 -f21*x1 +f22*x1 ) *y
+(f11 -f12 -f21 +f22 )*x*y)/&xmesh/&ymesh;
end;
else z=.;
retain rc f:;
run;
proc sort data=want;
by x y;
run;
The PROC SQL step assigns the (x,y) pairs to the lower left corner (x1,y1) of their [x1, x2)×[y1,y2) rectangle in the grid. (Note that this means a marginal limitation: No interpolation would be done for points in HAVE2 with x=&xmax or y=&ymax, if any.) For each of these (x1,y1) pairs the hash object contains the known values of the function at all four corners of the rectangle, f11=f(x1,y1), f12=f(x1,y2), f21=f(x2,y1) and f22=f(x2,y2). These are used in the formula (as per Wikipedia article) to compute the interpolated value z≈f(x,y). (Alternatively, PROC IML could be used, but I don't have a SAS/IML license.)
EDIT: If dataset HAVE1 (the grid) is small and HAVE2 is large (millions of observations), the performance can be improved by omitting the preliminary PROC SQL step and not using BY-group processing:
data want(keep=x y z);
if _n_=1 then do;
dcl hash h(multidata:'y');
h.definekey('x1','y1');
h.definedata('f');
h.definedone();
call missing(x1,y1,f);
do until(lr);
set have1 end=lr;
if x<&xmax & y<&ymax then h.add(key:x, key:y, data:z);
if x<&xmax & y>&ymin then h.add(key:x, key:y-&ymesh, data:z);
if x>&xmin & y<&ymax then h.add(key:x-&xmesh, key:y, data:z);
if x>&xmin & y>&ymin then h.add(key:x-&xmesh, key:y-&ymesh, data:z);
end;
end;
set have2;
x1=floor(x/&xmesh)*&xmesh;
y1=floor(y/&ymesh)*&ymesh;
if h.find()=0 then do;
f11=f;
h.find_next(); f12=f;
h.find_next(); f21=f;
h.find_next(); f22=f;
x2=x1+&xmesh;
y2=y1+&ymesh;
z = (f11*x2*y2-f12*x2*y1-f21*x1*y2+f22*x1*y1
-(f11 *y2-f12 *y1-f21 *y2+f22 *y1)*x
-(f11*x2 -f12*x2 -f21*x1 +f22*x1 ) *y
+(f11 -f12 -f21 +f22 )*x*y)/&xmesh/&ymesh;
end;
else z=.;
run;
In this situation, the time saved by omitting the sort (ORDER BY clause) outweighs the time needed for many additional hash table lookups.
--- End of EDIT ---
As it turned out, the approach using PROC TRANSREG in two linear interpolations is prone to error, e.g., if data points fall on grid lines. The reason is that the NKNOTS= option of the MODEL statement of the second PROC TRANSREG step should ideally have individual values per BY group, which is impossible. The setting NKNOTS=0 (used in the draft code below) works for "generic" sets of points such as the random points in the above HAVE2 dataset, though. To mitigate the issues (and to reduce unnecessary calculations) more complex preparations of the input dataset would be needed. PROC EXPAND (SAS/ETS) is presumably more suitable for this approach.
Here is the draft code:
/* Bilinear interpolation via two linear interpolations with PROC TRANSREG (not recommended!) */
%let nyknots=%sysevalf((&ymax-&ymin)/&ymesh-1);
/* Prepare linear interpolation in y-direction */
proc sql;
create view have_y as
select x, y, z from have1
union
select floor(x/&xmesh)*&xmesh as x, y, . as z from have2
union
select ceil(x/&xmesh)*&xmesh as x, y, . as z from have2;
quit;
/* Perform linear interpolation in y-direction */
proc transreg data=have_y noprint;
by x;
model identity(z)=spline(y / degree=1 nknots=&nyknots);
output out=linintp_y(keep=x y pz rename=(pz=z)) predicted;
run;
/* Prepare linear interpolation in x-direction */
proc sql;
create view have_x as
select y, x, z from linintp_y
union
select y, x, . as z from have2;
quit;
/* Perform linear interpolation in x-direction */
proc transreg data=have_x noprint;
by y;
model identity(z)=spline(x / degree=1 nknots=0); /* <--- CAUTION: nknots=0 can cause incorrect results, */
output out=linintp_x(keep=x y pz rename=(pz=z)) predicted; /* e.g., if data points fall on grid lines! */
run;
/* Retrieve results of bilinear interpolation */
proc sql;
create table want2 as
select a.x, a.y, a.z
from linintp_x a, have2 b
where a.x=b.x & a.y=b.y
order by x, y;
quit;
/* Compare with results from DATA step approach */
proc compare data=want c=want2 criterion=1e-10;
run;
Here is an implementation of bilinear regression in interpolation in SAS. It uses SAS/IML and assumes the data are on a regular grid.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.