BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jawhitmire
Quartz | Level 8

Current version: 9.04.01M5P091317

 

Thank you for viewing this post.

I do not have access to SAS/STATS and thus, cannot use "proc distance", but need to calculate the Euclidean distance between multi-dimensional points.  A point is represented by each column of data.

The first column of the dataset is a list of compounds for a batch number (batchno) equal to 0 with name _0.

 

1.JPG

The remaining 22583 columns (not shown) are other batches labeled _375, _376, _377,...,_22583 and stem from a transpose of the original dataset.

There are 11 observations per batch (column), all type numeric 8, and there are no missing values.

The goal is to calculate the Euclidean distance between batch _0 and all the other batches to determine where batch _0 came from based on the minimum distance.

The square of the difference is calculated for some batches in the source code shown below.  Each line creates a new column in the table view.  Then the sum can be calculated using "proc means". 

 I acknowledged that the square root and minimum distance are also necessary to complete the process.

 

data euclid;
	set fewerbatches;
		b375=((_0-_375))**2;
		b376=((_0-_376))**2;
		b377=((_0-_375))**2;
		b378=((_0-_376))**2;
		b379=((_0-_375))**2;
		b380=((_0-_376))**2;
		b381=((_0-_375))**2;
		b382=((_0-_376))**2;
		b383=((_0-_375))**2;
		b384=((_0-_376))**2;
		b385=((_0-_375))**2;
		b386=((_0-_376))**2;
		b387=((_0-_375))**2;
		b388=((_0-_376))**2;
		b389=((_0-_375))**2;
		b390=((_0-_376))**2;
	run;

proc means data=euclid sum;
	var b375;
	run;

 

It is clear to me that an array or macro would reduce the potential size of the code.  For example, I reduced the size and created an array:

array btchs(142) b375-b515;

This created several more columns of missing data.  Attempts to fill the columns with difference of squares was partially successful, but equally as redundant as what I have shown above.

 

Again, thank you in advance for the assistance.

 

Jane

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

I think that you want something based on this:

 

data want;
 set have;
 array a {375:22583} _375-_22583;
 array b {375:22583} b375 - b22583;
 do i= 375 to 22583;
   b[i] = ( _0 - a[i])**2;
 end;
run;

This form of defining the definition allows use of 375 to reference the first item in the array, 376 the second and so on.

 

I would try this with MUCH reduced set variables and indices as there is the potential to create some really outrageous sets.

 

I suspect that the transpose wasn't needed if there is some identification information you didn't share as by group processing getting the initial value as the first of a by group and retaining it for use with the later batches may have been an alternate approach:

data have;
   input group value;
datalines;
1  23
1  345
1  567
2  1.3
2  1.9
2  .4
3  555
3  666
3  777
;
run;

data want;
   set have;
   by group;
   retain firstval;
   if first.group then firstval=value;
   else b = (firstval - value)**2;
run;

View solution in original post

4 REPLIES 4
Reeza
Super User
So your picture is what you have and what do you want as output. Please post data as text, include what you have and show your desired output. I would also suggest not a macro but either SQL or a data step, but I think a data step is enough here. Euclidean distance is a straightforward calculation, so it’s just finding the fastest option. I suspect a data transformation would be needed as well.
ballardw
Super User

I think that you want something based on this:

 

data want;
 set have;
 array a {375:22583} _375-_22583;
 array b {375:22583} b375 - b22583;
 do i= 375 to 22583;
   b[i] = ( _0 - a[i])**2;
 end;
run;

This form of defining the definition allows use of 375 to reference the first item in the array, 376 the second and so on.

 

I would try this with MUCH reduced set variables and indices as there is the potential to create some really outrageous sets.

 

I suspect that the transpose wasn't needed if there is some identification information you didn't share as by group processing getting the initial value as the first of a by group and retaining it for use with the later batches may have been an alternate approach:

data have;
   input group value;
datalines;
1  23
1  345
1  567
2  1.3
2  1.9
2  .4
3  555
3  666
3  777
;
run;

data want;
   set have;
   by group;
   retain firstval;
   if first.group then firstval=value;
   else b = (firstval - value)**2;
run;
jawhitmire
Quartz | Level 8

Thank you for the quick reply.  I reduced the size of the file and the following code worked to square each variable.  Cheers!

 

data euclid;

set fewerbatches;

array a {375:400} _375-_400;

array b {375:400} b375-b400;

do i=375 to 400;

b[i]=(_0-a[i])**2;

end;

drop i;

run;

 

The original file had batches by rows with no grouping variable as each batch is unique.

 

Now that I have the difference of squares, I need to find the sum of each batch column, b375-b400, take the square root of the sum, then perhaps use proc means to find the minimum Euclidean distance.  Based on feedback received from 3 SAS community experts, it seems the next step is easier when I transpose dataset "Euclid" back to rows? 

 

Thank you to all for the quick reply and assistance.

PGStats
Opal | Level 21

Even with proc distance you would want to work with the compounds as columns and batches as rows. Using a datastep, read the first batch on first iteration into a retained array and then read other batches into another array. Substract the _0 batch compound values and compute the euclidian norm with function EUCLID().

PG

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1057 views
  • 0 likes
  • 4 in conversation