Solved: Re: proc distance with missing data

SeanZ · Posted 02-17-2017 05:25 PM

Hi, I have a general question regarding proc distance. I have a dataset of five variables, a, b, c, d, e, and there are many observations, say, 100. I would like to get a distance matrix between each of 100. All five variables are used in calculation. But there are some missing values in a - e. Now I still want to get distance between each pair and ignore ONLY missing value(s) in that observationn, but not the whole observation.

For example, obs 1 has only a, b, c and d, and obs 2 only has a and c. Then I would like to calculate the distance based on a and c.

How to adjust that in proc distance? I also need to standardize these variables.

Thanks.

PGStats · Posted 02-19-2017 12:22 AM

So this is more a programming problem than a statistical problem. You could do something along these lines:

data have;
call streaminit(687687);
array v{*} a b c d e;
do year = 2000 to 2003;
    do firm = 1 to 10;
        treat = mod(firm,10) = mod(year,10);
        do i = 1 to dim(v);
            if rand('uniform') < 0.1 then call missing(v{i});
            else v{i} = rand('normal');
            end;
        output;
        end;
    end;
drop i;
run;

proc sort data=have; by year descending treat firm; run;

data want;
array v{*} a b c d e;
array _v{*} _a _b _c _d _e;
merge 
    have (where=(treat) rename=(a=_a b=_b c=_c d=_d e=_e firm=treatedFirm))
    have ;
by year;
drop = treat;
do _i = 1 to dim(v);
    if missing(_v{_i}) then call missing(v{_i});
    else if missing(v{_i}) then drop = 1;
    else v{_i} = v{_i} - _v{_i};
    end;
distance = euclid(of v{*});
drop _: ;
run;

proc sql;
select year, treatedFirm, firm, distance
from want 
where not drop
group by year
having distance=min(distance);
quit;

PG

View solution in original post

mkeintz · Posted 02-17-2017 06:18 PM

The what do you actually mean by "distance". At first I though you meant euclidean distiance in the 5-dimensional space with each point at it (a,b,c,d,e) dimension. Are you saying that you want a 2-dimensional subspace of some distance just for a subset of points?

What does it mean for distance X1 -- X2 to be larger/smaller/equal to distance X1 -- X3 when X1 and X2 are 5-dimensional values, but X3 is only 2 dimensinal. What research purposes can be served.

Perhaps you should look at the correlations of the complete points, and use that to make inference about missing elements of the incomplete points.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SeanZ · Posted 02-17-2017 06:23 PM

Hi mkeintz,

Your first thought is quite right

"At first I though you meant euclidean distiance in the 5-dimensional space with each point at it (a,b,c,d,e) dimension. Are you saying that you want a 2-dimensional subspace of some distance just for a subset of points?"

In 100 observations, I only care the distance between 1st obs and all others, and try to find the nearest neighbor for the 1st obs. In that sense, I want the 2-dimensional euclidean distance for the example above.

Do you have any thought on how to do this?

Thanks.

mkeintz · Posted 02-17-2017 09:25 PM

If you really only want the 2-dimensional distance (using dimension a and b) from point P1 to each of P2 ..P99, that's only 100 distances - no need to determine all the pairwise distances, so a simple DATA step should work.

data want (drop=a1 b1 where=(dist_to_1^=.)); 
  set have; 
  retain a1 b1 ; 
  if _n_=1 then do;a1=a; b1=b;end; 
  else dist_to_1=sqrt((a1-a)**2+((b1-b)**2); 
run; 
proc sort;
  by dist_to_1;
run;

Dataset WANT will have 1 fewer observations than have. And after the sort the first obs will be the closest to the first obs of have.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SeanZ · Posted 02-18-2017 02:24 AM

Thanks. In this case, I would like the code to automatically use all nonmissing values in each observation to calculate distance. 2-dimensional distance is just an example if only nonmissing data exist in this two dimensions. If for example, observation 1 and observation 5 have nonmissing data in all five dimension, I would like to get distance in five dimension.

PGStats · Posted 02-17-2017 11:23 PM

If you standardize your data with :

proc stdize data=have out=haveFilled method=mad oprefix sprefix=s;
var a -- e;
run;

the data will be centered around the median and rescaled with a robust measure of dispersion (the MAD). Missing values will be replaced with zeros (the new median).

PG

SeanZ · Posted 02-18-2017 02:21 AM

Thanks PGStats, my problem is that replacing missing values with zeros is not appropriate in the context. It's hard to predict what missings should be. Therefore, I will like to ignore missing data and use all available data in observations to calculate distance.

PGStats · Posted 02-18-2017 12:13 PM

Can you share a bit more about the context, in general terms?

PG

SeanZ · Posted 02-18-2017 02:35 PM

The problem is shown as below.

I have a dataset of eight variables, firm ID, year, treat, a, b, c, d, e. For each firm in each year, there are five firm characteristics a - e. Treat is dummy variable taking 1 if a firm is a treated firm and 0 if a firm is a controlled firm. For each year there is only ONE treated firm, and multiple controlled firms.

Now for each year each treated firm, I'd like to find a control firm that is the most similar to the treated firm based on a-e. That's why I want to calculate Euclidean distance between treated firm and other control firms.

The challenge here is that not all treated/controlled firms have full values for a -e. My idea is to use all available information for each treated firms to calculate distance. For example, if treated firm 001 has all values on a, b, c, d, e, then I would like to calculate distance between 001 and all controlled firms that have all a-e information in a particular year. if treated firm 002 has only values on a, b, c, then I would like to calculate distance between treated 002 and controlled firms that have available information on a, b, c.

Do you have any thoughts on this? Thanks.

mkeintz · Posted 02-18-2017 12:26 PM

@SeanZ wrote:

Thanks PGStats, my problem is that replacing missing values with zeros is not appropriate in the context. It's hard to predict what missings should be. Therefore, I will like to ignore missing data and use all available data in observations to calculate distance.

@SeanZ: Are you are saying that a set of two-dimensional distances with no missing data is superior for finding the nearest neighbor to point 1, than would be estimating missing instances of other dimensions and then using 5-dimensional distances.

I mean, if the 5-dimensional metric would be suitable for getting eucliden distance if only there were no missing values for some dimensions, then shouldn't there be an acceptable tool for estimating the missing values using correlations among a b c d and e? Even if the proc stdsize suggested by @PGStats is not exactly right.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SeanZ · Posted 02-18-2017 02:36 PM

Hi mkeintz.

The problem is shown as below.

I have a dataset of eight variables, firm ID, year, treat, a, b, c, d, e. For each firm in each year, there are five firm characteristics a - e. Treat is dummy variable taking 1 if a firm is a treated firm and 0 if a firm is a controlled firm. For each year there is only ONE treated firm, and multiple controlled firms.

Now for each year each treated firm, I'd like to find a control firm that is the most similar to the treated firm based on a-e. That's why I want to calculate Euclidean distance between treated firm and other control firms.

The challenge here is that not all treated/controlled firms have full values for a -e. My idea is to use all available information for each treated firms to calculate distance. For example, if treated firm 001 has all values on a, b, c, d, e, then I would like to calculate distance between 001 and all controlled firms that have all a-e information in a particular year. if treated firm 002 has only values on a, b, c, then I would like to calculate distance between treated 002 and controlled firms that have available information on a, b, c.

Do you have any thoughts on this? Thanks.

PGStats · Posted 02-19-2017 12:22 AM

So this is more a programming problem than a statistical problem. You could do something along these lines:

data have;
call streaminit(687687);
array v{*} a b c d e;
do year = 2000 to 2003;
    do firm = 1 to 10;
        treat = mod(firm,10) = mod(year,10);
        do i = 1 to dim(v);
            if rand('uniform') < 0.1 then call missing(v{i});
            else v{i} = rand('normal');
            end;
        output;
        end;
    end;
drop i;
run;

proc sort data=have; by year descending treat firm; run;

data want;
array v{*} a b c d e;
array _v{*} _a _b _c _d _e;
merge 
    have (where=(treat) rename=(a=_a b=_b c=_c d=_d e=_e firm=treatedFirm))
    have ;
by year;
drop = treat;
do _i = 1 to dim(v);
    if missing(_v{_i}) then call missing(v{_i});
    else if missing(v{_i}) then drop = 1;
    else v{_i} = v{_i} - _v{_i};
    end;
distance = euclid(of v{*});
drop _: ;
run;

proc sql;
select year, treatedFirm, firm, distance
from want 
where not drop
group by year
having distance=min(distance);
quit;

PG

Classroom Training Available!