<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: proc distance with missing data in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334122#M75425</link>
    <description>&lt;P&gt;The problem is shown as below.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have a dataset of eight variables, firm ID, year, treat, a, b, c, d, e. For each firm in each year, there are five firm characteristics a - e. Treat is dummy variable taking 1 if a&amp;nbsp;firm is a treated firm and 0 if a&amp;nbsp;firm is a controlled firm. For each year there is only ONE treated firm, and multiple controlled firms.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Now for each year each treated firm, I'd like to find a control firm that is the most similar to the treated firm based on a-e. That's why I want to calculate Euclidean distance between treated firm and other control firms.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The challenge here is that not all treated/controlled firms have full values for a -e. My idea is to use all available information for each treated firms to calculate distance. For example, if treated firm 001 has all values on a, b, c, d, e, then I would like to calculate distance between 001 and all controlled firms that have all a-e information in a particular year. if treated firm 002 has only values on a, b, c, then I would like to calculate distance between treated 002 and controlled firms that have available information on a, b, c.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do you have any thoughts on this? Thanks.&lt;/P&gt;</description>
    <pubDate>Sat, 18 Feb 2017 19:35:26 GMT</pubDate>
    <dc:creator>SeanZ</dc:creator>
    <dc:date>2017-02-18T19:35:26Z</dc:date>
    <item>
      <title>proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/333986#M75348</link>
      <description>&lt;P&gt;Hi, I have a general question regarding proc distance. I have a dataset of five variables, a, b, c, d, e, and there are many observations, say, 100. I would like to get a distance matrix between each of 100. All five variables are used in calculation. But there are some missing values in a - e. Now I still want to get distance between each pair and ignore ONLY missing value(s) in that observationn, but not the whole observation.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For example, obs 1 has only a, b, c and d, and obs 2 only has a and c. Then I would like to calculate the distance based on a and c.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;How to adjust that in proc distance? I also need to standardize these variables.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 17 Feb 2017 22:25:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/333986#M75348</guid>
      <dc:creator>SeanZ</dc:creator>
      <dc:date>2017-02-17T22:25:03Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/333990#M75352</link>
      <description>&lt;P&gt;The what do you actually mean&amp;nbsp; by "distance".&amp;nbsp;&amp;nbsp; At first I though you meant euclidean distiance in the 5-dimensional space with each point at it (a,b,c,d,e) dimension.&amp;nbsp;&amp;nbsp; Are you saying that you want a 2-dimensional subspace of some distance just for a subset of points?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What does it mean for&amp;nbsp; distance X1 -- X2&amp;nbsp;&amp;nbsp; to be larger/smaller/equal to &amp;nbsp;distance X1 -- X3&amp;nbsp; when&amp;nbsp; X1 and X2 are 5-dimensional values, but X3 is only 2 dimensinal.&amp;nbsp;&amp;nbsp; What research purposes can be served.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Perhaps you should look at the correlations of the complete points, and use that to make inference about missing elements of the incomplete points.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 17 Feb 2017 23:18:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/333990#M75352</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2017-02-17T23:18:42Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/333994#M75354</link>
      <description>&lt;P&gt;Hi mkeintz,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Your first thought is quite right&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;"&lt;SPAN&gt;At first I though you meant euclidean distiance in the 5-dimensional space with each point at it (a,b,c,d,e) dimension.&amp;nbsp;&amp;nbsp; Are you saying that you want a 2-dimensional subspace of some distance just for a subset of points?"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;In 100 observations, I only care the distance between 1st obs and all others, and try to find the nearest neighbor for the 1st obs. In that sense, I want the 2-dimensional euclidean distance for the example above.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Do you have any thought on how to do this?&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Thanks.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 17 Feb 2017 23:23:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/333994#M75354</guid>
      <dc:creator>SeanZ</dc:creator>
      <dc:date>2017-02-17T23:23:45Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334018#M75367</link>
      <description>&lt;P&gt;If you really only want the 2-dimensional distance (using dimension a and b)&amp;nbsp;from point P1 to&amp;nbsp; each of P2 ..P99, that's only 100 distances - no need to determine all the pairwise distances, so a simple DATA step&amp;nbsp; should work.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want (drop=a1 b1 where=(dist_to_1^=.)); 
  set have; 
  retain a1 b1 ; 
  if _n_=1 then do;a1=a; b1=b;end; 
  else dist_to_1=sqrt((a1-a)**2+((b1-b)**2); 
run; &lt;BR /&gt;proc sort;&lt;BR /&gt;  by dist_to_1;&lt;BR /&gt;run;&lt;BR /&gt;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Dataset WANT will have 1 fewer observations than have. And after the sort the first obs will be the closest to&amp;nbsp; the first obs of have.&lt;/P&gt;</description>
      <pubDate>Sat, 18 Feb 2017 02:49:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334018#M75367</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2017-02-18T02:49:17Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334027#M75373</link>
      <description>&lt;P&gt;If you standardize your data with :&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc stdize data=have out=haveFilled method=mad oprefix sprefix=s;
var a -- e;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;the data will be centered around the median and rescaled with a robust measure of dispersion (the MAD). Missing values will be replaced with zeros (the new median). &amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 18 Feb 2017 04:23:11 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334027#M75373</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2017-02-18T04:23:11Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334037#M75376</link>
      <description>&lt;P&gt;Thanks PGStats, my problem is that replacing missing values with zeros is not appropriate in the context. It's hard to predict what missings should be. Therefore, I will like to ignore missing data and use all available data in observations to calculate distance.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 18 Feb 2017 07:21:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334037#M75376</guid>
      <dc:creator>SeanZ</dc:creator>
      <dc:date>2017-02-18T07:21:24Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334038#M75377</link>
      <description>&lt;P&gt;Thanks. In this case, I would like the code to automatically use all nonmissing&amp;nbsp;values in each observation to calculate distance. 2-dimensional distance is just an example if only nonmissing&amp;nbsp;data exist in this two dimensions. If for example, observation 1 and observation 5 have nonmissing data in all five dimension, I would like to get distance in five dimension.&lt;/P&gt;</description>
      <pubDate>Sat, 18 Feb 2017 07:24:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334038#M75377</guid>
      <dc:creator>SeanZ</dc:creator>
      <dc:date>2017-02-18T07:24:51Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334103#M75411</link>
      <description>&lt;P&gt;Can you share a bit more about the context, in general terms?&lt;/P&gt;</description>
      <pubDate>Sat, 18 Feb 2017 17:13:16 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334103#M75411</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2017-02-18T17:13:16Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334107#M75413</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/52375"&gt;@SeanZ&lt;/a&gt; wrote:&lt;BR /&gt;
&lt;P&gt;Thanks PGStats, my problem is that replacing missing values with zeros is not appropriate in the context. It's hard to predict what missings should be. Therefore, I will like to ignore missing data and use all available data in observations to calculate distance.&amp;nbsp;&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/52375"&gt;@SeanZ&lt;/a&gt;:&amp;nbsp; Are you&amp;nbsp;are saying that a set of two-dimensional distances with no missing data is superior for finding the nearest neighbor to point 1, than would be estimating missing instances of other dimensions and then using&amp;nbsp;5-dimensional distances.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I mean, if the 5-dimensional metric would&amp;nbsp;be suitable for getting eucliden distance if only there were no missing values for some dimensions, then shouldn't there&amp;nbsp;be an acceptable tool for estimating the missing values using correlations among a b c d and e?&amp;nbsp;&amp;nbsp; Even if the proc stdsize suggested by &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/462"&gt;@PGStats&lt;/a&gt;&amp;nbsp;is not exactly right.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 18 Feb 2017 17:46:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334107#M75413</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2017-02-18T17:46:01Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334122#M75425</link>
      <description>&lt;P&gt;The problem is shown as below.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have a dataset of eight variables, firm ID, year, treat, a, b, c, d, e. For each firm in each year, there are five firm characteristics a - e. Treat is dummy variable taking 1 if a&amp;nbsp;firm is a treated firm and 0 if a&amp;nbsp;firm is a controlled firm. For each year there is only ONE treated firm, and multiple controlled firms.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Now for each year each treated firm, I'd like to find a control firm that is the most similar to the treated firm based on a-e. That's why I want to calculate Euclidean distance between treated firm and other control firms.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The challenge here is that not all treated/controlled firms have full values for a -e. My idea is to use all available information for each treated firms to calculate distance. For example, if treated firm 001 has all values on a, b, c, d, e, then I would like to calculate distance between 001 and all controlled firms that have all a-e information in a particular year. if treated firm 002 has only values on a, b, c, then I would like to calculate distance between treated 002 and controlled firms that have available information on a, b, c.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do you have any thoughts on this? Thanks.&lt;/P&gt;</description>
      <pubDate>Sat, 18 Feb 2017 19:35:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334122#M75425</guid>
      <dc:creator>SeanZ</dc:creator>
      <dc:date>2017-02-18T19:35:26Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334123#M75426</link>
      <description>&lt;P&gt;Hi mkeintz.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The problem is shown as below.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have a dataset of eight variables, firm ID, year, treat, a, b, c, d, e. For each firm in each year, there are five firm characteristics a - e. Treat is dummy variable taking 1 if a&amp;nbsp;firm is a treated firm and 0 if a&amp;nbsp;firm is a controlled firm. For each year there is only ONE treated firm, and multiple controlled firms.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Now for each year each treated firm, I'd like to find a control firm that is the most similar to the treated firm based on a-e. That's why I want to calculate Euclidean distance between treated firm and other control firms.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The challenge here is that not all treated/controlled firms have full values for a -e. My idea is to use all available information for each treated firms to calculate distance. For example, if treated firm 001 has all values on a, b, c, d, e, then I would like to calculate distance between 001 and all controlled firms that have all a-e information in a particular year. if treated firm 002 has only values on a, b, c, then I would like to calculate distance between treated 002 and controlled firms that have available information on a, b, c.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do you have any thoughts on this? Thanks.&lt;/P&gt;</description>
      <pubDate>Sat, 18 Feb 2017 19:36:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334123#M75426</guid>
      <dc:creator>SeanZ</dc:creator>
      <dc:date>2017-02-18T19:36:27Z</dc:date>
    </item>
    <item>
      <title>Re: proc distance with missing data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334165#M75445</link>
      <description>&lt;P&gt;So this is more a programming problem than a statistical problem. You could do something along these lines:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
call streaminit(687687);
array v{*} a b c d e;
do year = 2000 to 2003;
    do firm = 1 to 10;
        treat = mod(firm,10) = mod(year,10);
        do i = 1 to dim(v);
            if rand('uniform') &amp;lt; 0.1 then call missing(v{i});
            else v{i} = rand('normal');
            end;
        output;
        end;
    end;
drop i;
run;

proc sort data=have; by year descending treat firm; run;

data want;
array v{*} a b c d e;
array _v{*} _a _b _c _d _e;
merge 
    have (where=(treat) rename=(a=_a b=_b c=_c d=_d e=_e firm=treatedFirm))
    have ;
by year;
drop = treat;
do _i = 1 to dim(v);
    if missing(_v{_i}) then call missing(v{_i});
    else if missing(v{_i}) then drop = 1;
    else v{_i} = v{_i} - _v{_i};
    end;
distance = euclid(of v{*});
drop _: ;
run;

proc sql;
select year, treatedFirm, firm, distance
from want 
where not drop
group by year
having distance=min(distance);
quit;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sun, 19 Feb 2017 05:22:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/proc-distance-with-missing-data/m-p/334165#M75445</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2017-02-19T05:22:14Z</dc:date>
    </item>
  </channel>
</rss>

