Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Euclidean Distance Without SAS/STATS

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 11-28-2018 06:12 PM
(907 views)

Current version: 9.04.01M5P091317

Thank you for viewing this post.

I do not have access to SAS/STATS and thus, cannot use "proc distance", but need to calculate the Euclidean distance between multi-dimensional points. A point is represented by each column of data.

The first column of the dataset is a list of compounds for a batch number (batchno) equal to 0 with name _0.

The remaining 22583 columns (not shown) are other batches labeled _375, _376, _377,...,_22583 and stem from a transpose of the original dataset.

There are 11 observations per batch (column), all type numeric 8, and there are no missing values.

The goal is to calculate the Euclidean distance between batch _0 and all the other batches to determine where batch _0 came from based on the minimum distance.

The square of the difference is calculated for some batches in the source code shown below. Each line creates a new column in the table view. Then the sum can be calculated using "proc means".

I acknowledged that the square root and minimum distance are also necessary to complete the process.

```
data euclid;
set fewerbatches;
b375=((_0-_375))**2;
b376=((_0-_376))**2;
b377=((_0-_375))**2;
b378=((_0-_376))**2;
b379=((_0-_375))**2;
b380=((_0-_376))**2;
b381=((_0-_375))**2;
b382=((_0-_376))**2;
b383=((_0-_375))**2;
b384=((_0-_376))**2;
b385=((_0-_375))**2;
b386=((_0-_376))**2;
b387=((_0-_375))**2;
b388=((_0-_376))**2;
b389=((_0-_375))**2;
b390=((_0-_376))**2;
run;
proc means data=euclid sum;
var b375;
run;
```

It is clear to me that an array or macro would reduce the potential size of the code. For example, I reduced the size and created an array:

array btchs(142) b375-b515;

This created several more columns of missing data. Attempts to fill the columns with difference of squares was partially successful, but equally as redundant as what I have shown above.

Again, thank you in advance for the assistance.

Jane

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I think that you want something based on this:

data want; set have; array a {375:22583} _375-_22583; array b {375:22583} b375 - b22583; do i= 375 to 22583; b[i] = ( _0 - a[i])**2; end; run;

This form of defining the definition allows use of 375 to reference the first item in the array, 376 the second and so on.

I would try this with MUCH reduced set variables and indices as there is the potential to create some really outrageous sets.

I suspect that the transpose wasn't needed if there is some identification information you didn't share as by group processing getting the initial value as the first of a by group and retaining it for use with the later batches may have been an alternate approach:

data have; input group value; datalines; 1 23 1 345 1 567 2 1.3 2 1.9 2 .4 3 555 3 666 3 777 ; run; data want; set have; by group; retain firstval; if first.group then firstval=value; else b = (firstval - value)**2; run;

4 REPLIES 4

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

So your picture is what you have and what do you want as output. Please post data as text, include what you have and show your desired output. I would also suggest not a macro but either SQL or a data step, but I think a data step is enough here. Euclidean distance is a straightforward calculation, so it’s just finding the fastest option. I suspect a data transformation would be needed as well.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I think that you want something based on this:

data want; set have; array a {375:22583} _375-_22583; array b {375:22583} b375 - b22583; do i= 375 to 22583; b[i] = ( _0 - a[i])**2; end; run;

This form of defining the definition allows use of 375 to reference the first item in the array, 376 the second and so on.

I would try this with MUCH reduced set variables and indices as there is the potential to create some really outrageous sets.

I suspect that the transpose wasn't needed if there is some identification information you didn't share as by group processing getting the initial value as the first of a by group and retaining it for use with the later batches may have been an alternate approach:

data have; input group value; datalines; 1 23 1 345 1 567 2 1.3 2 1.9 2 .4 3 555 3 666 3 777 ; run; data want; set have; by group; retain firstval; if first.group then firstval=value; else b = (firstval - value)**2; run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for the quick reply. I reduced the size of the file and the following code worked to square each variable. Cheers!

**data** euclid;

set fewerbatches;

array a {**375**:**400**} _375-_400;

array b {**375**:**400**} b375-b400;

do i=**375** to **400**;

b[i]=(_0-a[i])****2**;

end;

drop i;

**run**;

The original file had batches by rows with no grouping variable as each batch is unique.

Now that I have the difference of squares, I need to find the sum of each batch column, b375-b400, take the square root of the sum, then perhaps use proc means to find the minimum Euclidean distance. Based on feedback received from 3 SAS community experts, it seems the next step is easier when I transpose dataset "Euclid" back to rows?

Thank you to all for the quick reply and assistance.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Even with proc distance you would want to work with the compounds as columns and batches as rows. Using a datastep, read the first batch on first iteration into a retained array and then read other batches into another array. Substract the _0 batch compound values and compute the euclidian norm with function EUCLID().

PG

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.