DATA Step, Macro, Functions and more

proc distance with missing data

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 122
Accepted Solution

proc distance with missing data

Hi, I have a general question regarding proc distance. I have a dataset of five variables, a, b, c, d, e, and there are many observations, say, 100. I would like to get a distance matrix between each of 100. All five variables are used in calculation. But there are some missing values in a - e. Now I still want to get distance between each pair and ignore ONLY missing value(s) in that observationn, but not the whole observation.

 

For example, obs 1 has only a, b, c and d, and obs 2 only has a and c. Then I would like to calculate the distance based on a and c.

 

How to adjust that in proc distance? I also need to standardize these variables.

 

Thanks.


Accepted Solutions
Solution
‎02-23-2017 05:05 AM
Respected Advisor
Posts: 4,923

Re: proc distance with missing data

So this is more a programming problem than a statistical problem. You could do something along these lines:

 

data have;
call streaminit(687687);
array v{*} a b c d e;
do year = 2000 to 2003;
    do firm = 1 to 10;
        treat = mod(firm,10) = mod(year,10);
        do i = 1 to dim(v);
            if rand('uniform') < 0.1 then call missing(v{i});
            else v{i} = rand('normal');
            end;
        output;
        end;
    end;
drop i;
run;

proc sort data=have; by year descending treat firm; run;

data want;
array v{*} a b c d e;
array _v{*} _a _b _c _d _e;
merge 
    have (where=(treat) rename=(a=_a b=_b c=_c d=_d e=_e firm=treatedFirm))
    have ;
by year;
drop = treat;
do _i = 1 to dim(v);
    if missing(_v{_i}) then call missing(v{_i});
    else if missing(v{_i}) then drop = 1;
    else v{_i} = v{_i} - _v{_i};
    end;
distance = euclid(of v{*});
drop _: ;
run;

proc sql;
select year, treatedFirm, firm, distance
from want 
where not drop
group by year
having distance=min(distance);
quit;
PG

View solution in original post


All Replies
Trusted Advisor
Posts: 1,019

Re: proc distance with missing data

The what do you actually mean  by "distance".   At first I though you meant euclidean distiance in the 5-dimensional space with each point at it (a,b,c,d,e) dimension.   Are you saying that you want a 2-dimensional subspace of some distance just for a subset of points?

 

What does it mean for  distance X1 -- X2   to be larger/smaller/equal to  distance X1 -- X3  when  X1 and X2 are 5-dimensional values, but X3 is only 2 dimensinal.   What research purposes can be served.

 

Perhaps you should look at the correlations of the complete points, and use that to make inference about missing elements of the incomplete points.

 

Frequent Contributor
Posts: 122

Re: proc distance with missing data

Hi mkeintz,

 

Your first thought is quite right 

 

"At first I though you meant euclidean distiance in the 5-dimensional space with each point at it (a,b,c,d,e) dimension.   Are you saying that you want a 2-dimensional subspace of some distance just for a subset of points?"

 

In 100 observations, I only care the distance between 1st obs and all others, and try to find the nearest neighbor for the 1st obs. In that sense, I want the 2-dimensional euclidean distance for the example above.

 

Do you have any thought on how to do this?

 

Thanks.

Trusted Advisor
Posts: 1,019

Re: proc distance with missing data

[ Edited ]

If you really only want the 2-dimensional distance (using dimension a and b) from point P1 to  each of P2 ..P99, that's only 100 distances - no need to determine all the pairwise distances, so a simple DATA step  should work.

 

data want (drop=a1 b1 where=(dist_to_1^=.)); 
  set have; 
  retain a1 b1 ; 
  if _n_=1 then do;a1=a; b1=b;end; 
  else dist_to_1=sqrt((a1-a)**2+((b1-b)**2); 
run; 
proc sort;
by dist_to_1;
run;

 

Dataset WANT will have 1 fewer observations than have. And after the sort the first obs will be the closest to  the first obs of have.

Frequent Contributor
Posts: 122

Re: proc distance with missing data

Thanks. In this case, I would like the code to automatically use all nonmissing values in each observation to calculate distance. 2-dimensional distance is just an example if only nonmissing data exist in this two dimensions. If for example, observation 1 and observation 5 have nonmissing data in all five dimension, I would like to get distance in five dimension.

Respected Advisor
Posts: 4,923

Re: proc distance with missing data

If you standardize your data with :

 

proc stdize data=have out=haveFilled method=mad oprefix sprefix=s;
var a -- e;
run;

the data will be centered around the median and rescaled with a robust measure of dispersion (the MAD). Missing values will be replaced with zeros (the new median).  

PG
Frequent Contributor
Posts: 122

Re: proc distance with missing data

Thanks PGStats, my problem is that replacing missing values with zeros is not appropriate in the context. It's hard to predict what missings should be. Therefore, I will like to ignore missing data and use all available data in observations to calculate distance. 

Respected Advisor
Posts: 4,923

Re: proc distance with missing data

Can you share a bit more about the context, in general terms?

PG
Frequent Contributor
Posts: 122

Re: proc distance with missing data

The problem is shown as below.

 

I have a dataset of eight variables, firm ID, year, treat, a, b, c, d, e. For each firm in each year, there are five firm characteristics a - e. Treat is dummy variable taking 1 if a firm is a treated firm and 0 if a firm is a controlled firm. For each year there is only ONE treated firm, and multiple controlled firms. 

 

Now for each year each treated firm, I'd like to find a control firm that is the most similar to the treated firm based on a-e. That's why I want to calculate Euclidean distance between treated firm and other control firms.

 

The challenge here is that not all treated/controlled firms have full values for a -e. My idea is to use all available information for each treated firms to calculate distance. For example, if treated firm 001 has all values on a, b, c, d, e, then I would like to calculate distance between 001 and all controlled firms that have all a-e information in a particular year. if treated firm 002 has only values on a, b, c, then I would like to calculate distance between treated 002 and controlled firms that have available information on a, b, c.

 

Do you have any thoughts on this? Thanks.

Trusted Advisor
Posts: 1,019

Re: proc distance with missing data

[ Edited ]

SeanZ wrote:

Thanks PGStats, my problem is that replacing missing values with zeros is not appropriate in the context. It's hard to predict what missings should be. Therefore, I will like to ignore missing data and use all available data in observations to calculate distance. 


 

@SeanZ:  Are you are saying that a set of two-dimensional distances with no missing data is superior for finding the nearest neighbor to point 1, than would be estimating missing instances of other dimensions and then using 5-dimensional distances.

 

I mean, if the 5-dimensional metric would be suitable for getting eucliden distance if only there were no missing values for some dimensions, then shouldn't there be an acceptable tool for estimating the missing values using correlations among a b c d and e?   Even if the proc stdsize suggested by @PGStats is not exactly right.

 

 

Frequent Contributor
Posts: 122

Re: proc distance with missing data

Hi mkeintz.

 

The problem is shown as below.

 

I have a dataset of eight variables, firm ID, year, treat, a, b, c, d, e. For each firm in each year, there are five firm characteristics a - e. Treat is dummy variable taking 1 if a firm is a treated firm and 0 if a firm is a controlled firm. For each year there is only ONE treated firm, and multiple controlled firms. 

 

Now for each year each treated firm, I'd like to find a control firm that is the most similar to the treated firm based on a-e. That's why I want to calculate Euclidean distance between treated firm and other control firms.

 

The challenge here is that not all treated/controlled firms have full values for a -e. My idea is to use all available information for each treated firms to calculate distance. For example, if treated firm 001 has all values on a, b, c, d, e, then I would like to calculate distance between 001 and all controlled firms that have all a-e information in a particular year. if treated firm 002 has only values on a, b, c, then I would like to calculate distance between treated 002 and controlled firms that have available information on a, b, c.

 

Do you have any thoughts on this? Thanks.

Solution
‎02-23-2017 05:05 AM
Respected Advisor
Posts: 4,923

Re: proc distance with missing data

So this is more a programming problem than a statistical problem. You could do something along these lines:

 

data have;
call streaminit(687687);
array v{*} a b c d e;
do year = 2000 to 2003;
    do firm = 1 to 10;
        treat = mod(firm,10) = mod(year,10);
        do i = 1 to dim(v);
            if rand('uniform') < 0.1 then call missing(v{i});
            else v{i} = rand('normal');
            end;
        output;
        end;
    end;
drop i;
run;

proc sort data=have; by year descending treat firm; run;

data want;
array v{*} a b c d e;
array _v{*} _a _b _c _d _e;
merge 
    have (where=(treat) rename=(a=_a b=_b c=_c d=_d e=_e firm=treatedFirm))
    have ;
by year;
drop = treat;
do _i = 1 to dim(v);
    if missing(_v{_i}) then call missing(v{_i});
    else if missing(v{_i}) then drop = 1;
    else v{_i} = v{_i} - _v{_i};
    end;
distance = euclid(of v{*});
drop _: ;
run;

proc sql;
select year, treatedFirm, firm, distance
from want 
where not drop
group by year
having distance=min(distance);
quit;
PG
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 287 views
  • 5 likes
  • 3 in conversation