BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SeanZ
Obsidian | Level 7

Hi, I have a general question regarding proc distance. I have a dataset of five variables, a, b, c, d, e, and there are many observations, say, 100. I would like to get a distance matrix between each of 100. All five variables are used in calculation. But there are some missing values in a - e. Now I still want to get distance between each pair and ignore ONLY missing value(s) in that observationn, but not the whole observation.

 

For example, obs 1 has only a, b, c and d, and obs 2 only has a and c. Then I would like to calculate the distance based on a and c.

 

How to adjust that in proc distance? I also need to standardize these variables.

 

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

So this is more a programming problem than a statistical problem. You could do something along these lines:

 

data have;
call streaminit(687687);
array v{*} a b c d e;
do year = 2000 to 2003;
    do firm = 1 to 10;
        treat = mod(firm,10) = mod(year,10);
        do i = 1 to dim(v);
            if rand('uniform') < 0.1 then call missing(v{i});
            else v{i} = rand('normal');
            end;
        output;
        end;
    end;
drop i;
run;

proc sort data=have; by year descending treat firm; run;

data want;
array v{*} a b c d e;
array _v{*} _a _b _c _d _e;
merge 
    have (where=(treat) rename=(a=_a b=_b c=_c d=_d e=_e firm=treatedFirm))
    have ;
by year;
drop = treat;
do _i = 1 to dim(v);
    if missing(_v{_i}) then call missing(v{_i});
    else if missing(v{_i}) then drop = 1;
    else v{_i} = v{_i} - _v{_i};
    end;
distance = euclid(of v{*});
drop _: ;
run;

proc sql;
select year, treatedFirm, firm, distance
from want 
where not drop
group by year
having distance=min(distance);
quit;
PG

View solution in original post

11 REPLIES 11
mkeintz
PROC Star

The what do you actually mean  by "distance".   At first I though you meant euclidean distiance in the 5-dimensional space with each point at it (a,b,c,d,e) dimension.   Are you saying that you want a 2-dimensional subspace of some distance just for a subset of points?

 

What does it mean for  distance X1 -- X2   to be larger/smaller/equal to  distance X1 -- X3  when  X1 and X2 are 5-dimensional values, but X3 is only 2 dimensinal.   What research purposes can be served.

 

Perhaps you should look at the correlations of the complete points, and use that to make inference about missing elements of the incomplete points.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
SeanZ
Obsidian | Level 7

Hi mkeintz,

 

Your first thought is quite right 

 

"At first I though you meant euclidean distiance in the 5-dimensional space with each point at it (a,b,c,d,e) dimension.   Are you saying that you want a 2-dimensional subspace of some distance just for a subset of points?"

 

In 100 observations, I only care the distance between 1st obs and all others, and try to find the nearest neighbor for the 1st obs. In that sense, I want the 2-dimensional euclidean distance for the example above.

 

Do you have any thought on how to do this?

 

Thanks.

mkeintz
PROC Star

If you really only want the 2-dimensional distance (using dimension a and b) from point P1 to  each of P2 ..P99, that's only 100 distances - no need to determine all the pairwise distances, so a simple DATA step  should work.

 

data want (drop=a1 b1 where=(dist_to_1^=.)); 
  set have; 
  retain a1 b1 ; 
  if _n_=1 then do;a1=a; b1=b;end; 
  else dist_to_1=sqrt((a1-a)**2+((b1-b)**2); 
run; 
proc sort;
by dist_to_1;
run;

 

Dataset WANT will have 1 fewer observations than have. And after the sort the first obs will be the closest to  the first obs of have.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
SeanZ
Obsidian | Level 7

Thanks. In this case, I would like the code to automatically use all nonmissing values in each observation to calculate distance. 2-dimensional distance is just an example if only nonmissing data exist in this two dimensions. If for example, observation 1 and observation 5 have nonmissing data in all five dimension, I would like to get distance in five dimension.

PGStats
Opal | Level 21

If you standardize your data with :

 

proc stdize data=have out=haveFilled method=mad oprefix sprefix=s;
var a -- e;
run;

the data will be centered around the median and rescaled with a robust measure of dispersion (the MAD). Missing values will be replaced with zeros (the new median).  

PG
SeanZ
Obsidian | Level 7

Thanks PGStats, my problem is that replacing missing values with zeros is not appropriate in the context. It's hard to predict what missings should be. Therefore, I will like to ignore missing data and use all available data in observations to calculate distance. 

PGStats
Opal | Level 21

Can you share a bit more about the context, in general terms?

PG
SeanZ
Obsidian | Level 7

The problem is shown as below.

 

I have a dataset of eight variables, firm ID, year, treat, a, b, c, d, e. For each firm in each year, there are five firm characteristics a - e. Treat is dummy variable taking 1 if a firm is a treated firm and 0 if a firm is a controlled firm. For each year there is only ONE treated firm, and multiple controlled firms. 

 

Now for each year each treated firm, I'd like to find a control firm that is the most similar to the treated firm based on a-e. That's why I want to calculate Euclidean distance between treated firm and other control firms.

 

The challenge here is that not all treated/controlled firms have full values for a -e. My idea is to use all available information for each treated firms to calculate distance. For example, if treated firm 001 has all values on a, b, c, d, e, then I would like to calculate distance between 001 and all controlled firms that have all a-e information in a particular year. if treated firm 002 has only values on a, b, c, then I would like to calculate distance between treated 002 and controlled firms that have available information on a, b, c.

 

Do you have any thoughts on this? Thanks.

mkeintz
PROC Star

@SeanZ wrote:

Thanks PGStats, my problem is that replacing missing values with zeros is not appropriate in the context. It's hard to predict what missings should be. Therefore, I will like to ignore missing data and use all available data in observations to calculate distance. 


 

@SeanZ:  Are you are saying that a set of two-dimensional distances with no missing data is superior for finding the nearest neighbor to point 1, than would be estimating missing instances of other dimensions and then using 5-dimensional distances.

 

I mean, if the 5-dimensional metric would be suitable for getting eucliden distance if only there were no missing values for some dimensions, then shouldn't there be an acceptable tool for estimating the missing values using correlations among a b c d and e?   Even if the proc stdsize suggested by @PGStats is not exactly right.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
SeanZ
Obsidian | Level 7

Hi mkeintz.

 

The problem is shown as below.

 

I have a dataset of eight variables, firm ID, year, treat, a, b, c, d, e. For each firm in each year, there are five firm characteristics a - e. Treat is dummy variable taking 1 if a firm is a treated firm and 0 if a firm is a controlled firm. For each year there is only ONE treated firm, and multiple controlled firms. 

 

Now for each year each treated firm, I'd like to find a control firm that is the most similar to the treated firm based on a-e. That's why I want to calculate Euclidean distance between treated firm and other control firms.

 

The challenge here is that not all treated/controlled firms have full values for a -e. My idea is to use all available information for each treated firms to calculate distance. For example, if treated firm 001 has all values on a, b, c, d, e, then I would like to calculate distance between 001 and all controlled firms that have all a-e information in a particular year. if treated firm 002 has only values on a, b, c, then I would like to calculate distance between treated 002 and controlled firms that have available information on a, b, c.

 

Do you have any thoughts on this? Thanks.

PGStats
Opal | Level 21

So this is more a programming problem than a statistical problem. You could do something along these lines:

 

data have;
call streaminit(687687);
array v{*} a b c d e;
do year = 2000 to 2003;
    do firm = 1 to 10;
        treat = mod(firm,10) = mod(year,10);
        do i = 1 to dim(v);
            if rand('uniform') < 0.1 then call missing(v{i});
            else v{i} = rand('normal');
            end;
        output;
        end;
    end;
drop i;
run;

proc sort data=have; by year descending treat firm; run;

data want;
array v{*} a b c d e;
array _v{*} _a _b _c _d _e;
merge 
    have (where=(treat) rename=(a=_a b=_b c=_c d=_d e=_e firm=treatedFirm))
    have ;
by year;
drop = treat;
do _i = 1 to dim(v);
    if missing(_v{_i}) then call missing(v{_i});
    else if missing(v{_i}) then drop = 1;
    else v{_i} = v{_i} - _v{_i};
    end;
distance = euclid(of v{*});
drop _: ;
run;

proc sql;
select year, treatedFirm, firm, distance
from want 
where not drop
group by year
having distance=min(distance);
quit;
PG

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1832 views
  • 5 likes
  • 3 in conversation