topic Re: Euclidean Distance Without SAS/STATS in Statistical Procedures

Euclidean Distance Without SAS/STATS

jawhitmire — Wed, 28 Nov 2018 23:12:06 GMT

Current version: 9.04.01M5P091317

Thank you for viewing this post.

I do not have access to SAS/STATS and thus, cannot use "proc distance", but need to calculate the Euclidean distance between multi-dimensional points. A point is represented by each column of data.

The first column of the dataset is a list of compounds for a batch number (batchno) equal to 0 with name _0.

The remaining 22583 columns (not shown) are other batches labeled _375, _376, _377,...,_22583 and stem from a transpose of the original dataset.

There are 11 observations per batch (column), all type numeric 8, and there are no missing values.

The goal is to calculate the Euclidean distance between batch _0 and all the other batches to determine where batch _0 came from based on the minimum distance.

The square of the difference is calculated for some batches in the source code shown below. Each line creates a new column in the table view. Then the sum can be calculated using "proc means".

I acknowledged that the square root and minimum distance are also necessary to complete the process.

data euclid;
	set fewerbatches;
		b375=((_0-_375))**2;
		b376=((_0-_376))**2;
		b377=((_0-_375))**2;
		b378=((_0-_376))**2;
		b379=((_0-_375))**2;
		b380=((_0-_376))**2;
		b381=((_0-_375))**2;
		b382=((_0-_376))**2;
		b383=((_0-_375))**2;
		b384=((_0-_376))**2;
		b385=((_0-_375))**2;
		b386=((_0-_376))**2;
		b387=((_0-_375))**2;
		b388=((_0-_376))**2;
		b389=((_0-_375))**2;
		b390=((_0-_376))**2;
	run;

proc means data=euclid sum;
	var b375;
	run;

It is clear to me that an array or macro would reduce the potential size of the code. For example, I reduced the size and created an array:

array btchs(142) b375-b515;

This created several more columns of missing data. Attempts to fill the columns with difference of squares was partially successful, but equally as redundant as what I have shown above.

Again, thank you in advance for the assistance.

Jane

Re: Euclidean Distance Without SAS/STATS

Reeza — Thu, 29 Nov 2018 00:15:05 GMT

So your picture is what you have and what do you want as output. Please post data as text, include what you have and show your desired output. I would also suggest not a macro but either SQL or a data step, but I think a data step is enough here. Euclidean distance is a straightforward calculation, so it’s just finding the fastest option. I suspect a data transformation would be needed as well.

Re: Euclidean Distance Without SAS/STATS

ballardw — Thu, 29 Nov 2018 00:24:37 GMT

I think that you want something based on this:

data want;
 set have;
 array a {375:22583} _375-_22583;
 array b {375:22583} b375 - b22583;
 do i= 375 to 22583;
   b[i] = ( _0 - a[i])**2;
 end;
run;

This form of defining the definition allows use of 375 to reference the first item in the array, 376 the second and so on.

I would try this with MUCH reduced set variables and indices as there is the potential to create some really outrageous sets.

I suspect that the transpose wasn't needed if there is some identification information you didn't share as by group processing getting the initial value as the first of a by group and retaining it for use with the later batches may have been an alternate approach:

data have;
   input group value;
datalines;
1  23
1  345
1  567
2  1.3
2  1.9
2  .4
3  555
3  666
3  777
;
run;

data want;
   set have;
   by group;
   retain firstval;
   if first.group then firstval=value;
   else b = (firstval - value)**2;
run;

Re: Euclidean Distance Without SAS/STATS

PGStats — Thu, 29 Nov 2018 04:23:34 GMT

Even with proc distance you would want to work with the compounds as columns and batches as rows. Using a datastep, read the first batch on first iteration into a retained array and then read other batches into another array. Substract the _0 batch compound values and compute the euclidian norm with function EUCLID().

Re: Euclidean Distance Without SAS/STATS

jawhitmire — Thu, 29 Nov 2018 15:55:49 GMT

Thank you for the quick reply. I reduced the size of the file and the following code worked to square each variable. Cheers!

data euclid;

set fewerbatches;

array a {375:400} _375-_400;

array b {375:400} b375-b400;

do i=375 to 400;

b[i]=(_0-a[i])**2;

end;

drop i;

run;

The original file had batches by rows with no grouping variable as each batch is unique.

Now that I have the difference of squares, I need to find the sum of each batch column, b375-b400, take the square root of the sum, then perhaps use proc means to find the minimum Euclidean distance. Based on feedback received from 3 SAS community experts, it seems the next step is easier when I transpose dataset "Euclid" back to rows?

Thank you to all for the quick reply and assistance.