BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
windyboo
Calcite | Level 5

Hello,

I have a problem that might be solved be a loop (I guess).

I must calculate the square root of the sommation of the differences (square) between the weights invested between two banks. The application of the equation 1 at the page 5: http://www.ccfr.org.cn/cicf2010/papers/20091210013047.pdf

I have 20 banks. I have 20 tables with 11 columns. One to identify the years and ten others to identify the weights invested in each of the ten industries (the sommation on one row of the ten weights is equal to 1, of course).

The structure of 20 tables is the same. I must take each pair of banks to apply the equation. My result should be a number for each year for two banks compared.

Can you help me with the code or its structure?

Thanks a lot.

1 ACCEPTED SOLUTION

Accepted Solutions
Vince28_Statcan
Quartz | Level 8

In this case, a slight tweak to the example code i've listed in post#7 will get you the desired output.

Sadly, proc distance requires a character variable for ID and doesn't compute square root so it will have to be done in 3 steps (would be 4 if your single dataset wasn't sorted by year/id)

step1 create a character id

data have;

     set have;

     bank_id="_"||put(ultimateparentid, best8.);

run;

step2 calculate distance

proc distance data=have out=want method=sqeuclid;

     var interval (w_:); /* use w_: to represent the group of all variables starting with w_*/

     id bank_id;

     by year;

run;

step3 calculate square root

data want;

     set want;

     array dist {*} _: ; /* as proc distance creates a square matrix with variables named after bank_id and we used _ to preceed all numeric bank ID, we can use _: to index all such created variables through an array */

     do over dist;

          if dist NE . then dist=sqrt(dist); /* I believe this is the appropriate syntax for do over but its been a while, at worse, you can use do i=1 to dim(dist) and index dist{i} inside the loop */

     end;

run;

You will notice a variable containing all _id and 20 variables _id with a lower triangular matrix of calculated distances. it's up to you to convert that in a different format if you want.

View solution in original post

15 REPLIES 15
Reeza
Super User

How many observations per table?

Reeza
Super User

You should also post a sample of what your data looks like and the sample output...I can read the paper but don't actually want to Smiley Happy

windyboo
Calcite | Level 5

I will.

windyboo
Calcite | Level 5


18 rows, 11 columns - the column of the years is just to identify.

Vince28_Statcan
Quartz | Level 8

At first I thought "IML JOB" but since you don't want it output in a matrix format and just want the lower(or upper) triangle of calculations, it is probably not worth learning proc iml. Esp. since a big chunk of the work is data manipulation and not quite the element-wise squared difference of vectors. Anyway, here's how I would tackle the problem:

STEP 1. Preprocess your 20  per bank files to obtain 18 per year files and change the year variable to a bank code or name variable. This can be done reasonably easily using the IN= dataset option in a single data step like

data y1995 y1996 y1997 /* ... */ y2013;

     set bank1(IN=b1) bank2(in=b1) /* ... */ bank20(in=b20);

     array bcode {*} b1-b20;

     case (yearvariable);

          when (1995) then output y1995;

          /*...*/

          when (2013 then output y2013;

          /* you can use an otherwise statement for outstanding year data for validation */

     end;

run;

STEP 2 - Calculate each statistic and output that to a new dataset with bank-indicator variables

/* use macro to loop over over the 18 new datasets, providing the logic example below */

data want1995(keep=b1-b20 outputstat);

     array outer {30} _temporary_;

     array dv {*} b1-b20 w1-w10 /* there should be exactly 30 variables, the 20 bank indicators and the 10 weights adjust weight variable names accordingly */

     do i=1 to outerobs;

          set y1995(nobs=outerobs drop=yearvariable);

          do k=1 to 30;

               outer{k}=dv{k}; /*store outer record in a temporary array to have it available while looping over all subsequent rows in the file */

          end;

            

          do j=(i+1) to innerobs;

               set y1995(nobs=innerobs);

               do k=1 to 20;

                    dv{k}=max(dv{k}, outer{k}); /*replace one column of 0 with a 1 to indicate which 2 banks contribute to the output calculation */

               end;

               temp=0; /*initialize Sum of Squares*/

               do k=21 to 30;

                    temp=temp+(dv{k}-outer{k})**2;

               end;

               outputstat=sqrt(temp);

               output;

          end;

     end;

run;

This is all untested but it provides a logical path on how to handle the looping and data manipulation. The resulting wantYYYY dataset will have n(n-1)/2 rows for each possible pairs with 20 "bank indicators" variables, 18 of which are set to 0, 2 of which are set to 1 and a single output statistic.

Hope this helps  

Vince

Reeza
Super User

Wouldn't there be a way to set all the datasets together and then use by groups and proc distance with squared euclidean distance?

This would avoid loops and makes it much easier to understand.

Need data though Smiley Happy

Vince28_Statcan
Quartz | Level 8

I didn't know about that procedure but it appears you can. I haven't figured out a way to have the square root calculated directly in the procedure but it can calculate the SS easily for the lower triangle by group.

Something such as:

data bankset1;

     input year w1 w2 w3 w4 w5;

     datalines;

2000 0.1 0.2 0.2 0.4 0.1

2001 0.2 0.1 0.2 0.4 0.1

2002 0.5 0.1 0.1 0.1 0.2

;

run;

data bankset2;

     input year w1 w2 w3 w4 w5;

     datalines;

2000 0.2 0.1 0.1 0.5 0.1

2001 0.2 0.2 0.2 0.2 0.2

2002 0.3 0.3 0.1 0.1 0.2

;

run;

data bankset3;

     input year w1 w2 w3 w4 w5;

     datalines;

2000 0.2 0.2 0.2 0.3 0.1

2001 0.7 0.1 0.1 0.0 0.1

2002 0.5 0.1 0.3 0.0 0.1

;

run;

%macro m();

data have;

     %do i=1 %to 3;

     set bankset&i.;

     bank_id="bank&i";

     output;

     %end;

%mend;

%m();

proc sort data=have;

     by year bank_id;

run;

proc distance data=have out=want method=sqeuclid;

     var interval (w1-w5);

     id bank_id;

     by year;

run;

just gotta calculate square root!

Thanks Reeza for the lesson

Reeza
Super User

You don't need the macro either Vince, assuming SAS 9.2+ Smiley Happy

data have;

set bankset1-bankset3 indsname=source;

bank_id=source;

run;

windyboo
Calcite | Level 5

Thanks Vince and Reeza for your answers.

Sorry for the delay. I rebuild a new table with all data.

I take a printscreen to show you what it looks like.

My table has 360 rows (20 banks x 18 years) and 12 columns.

printscreen_20banks.png

I create a column of ids (UltimateParentID). So, my code must do this:

(Legend: Variable, UltimateParentID, year)

square root [((W_AG,7052,1995)-(W_AG,7135,1995))^2+ ((W_MIN,7052,1995)-(W_MIN,7135,1995))^2+((W_CON,7052,1995)-(W_CON,7135,1995))^2+((W_TRAN,7052,1995)-(W_TRAN,7135,1995))^2+

((W_WHOLE,7052,1995)-(W_WHOLE,7135,1995))^2+((W_RE,7052,1995)-(W_RE,7135,1995))^2+((W_SER,7052,1995)-(W_SER,7135,1995))^2+((W_FIN,7052,1995)-(W_FIN,7135,1995))^2+

((W_PUBLIC,7052,1995)-(W_PUBLIC,7135,1995))^2+((W_Autre,7052,1995)-(W_Autre,7135,1995))^2)]

This calculation must be repeated between all group of two banks for each year.

n(n-1)/2=190 results per year

I made clarifications. I think my problem was incorrectly formulated.

Vince28_Statcan
Quartz | Level 8

In this case, a slight tweak to the example code i've listed in post#7 will get you the desired output.

Sadly, proc distance requires a character variable for ID and doesn't compute square root so it will have to be done in 3 steps (would be 4 if your single dataset wasn't sorted by year/id)

step1 create a character id

data have;

     set have;

     bank_id="_"||put(ultimateparentid, best8.);

run;

step2 calculate distance

proc distance data=have out=want method=sqeuclid;

     var interval (w_:); /* use w_: to represent the group of all variables starting with w_*/

     id bank_id;

     by year;

run;

step3 calculate square root

data want;

     set want;

     array dist {*} _: ; /* as proc distance creates a square matrix with variables named after bank_id and we used _ to preceed all numeric bank ID, we can use _: to index all such created variables through an array */

     do over dist;

          if dist NE . then dist=sqrt(dist); /* I believe this is the appropriate syntax for do over but its been a while, at worse, you can use do i=1 to dim(dist) and index dist{i} inside the loop */

     end;

run;

You will notice a variable containing all _id and 20 variables _id with a lower triangular matrix of calculated distances. it's up to you to convert that in a different format if you want.

windyboo
Calcite | Level 5

Thanks a lot Vince.

I tried your codes. I have a problem with the step 3 like you seem to expect.

''over'' is underlined. The number 184 appears below and it tells me that ''The instruction DO OVER cannot be operated with tables that have explicit inferior indexes''. 

I tried to figure out by myself. It's my first complex code so I didn't find an answer yet. If you have an idea to make it work, thank you again!

Vince28_Statcan
Quartz | Level 8

I'm sorry about it. Oddly enough the short documentation I have about the DO OVER statement requires an array defined through explicit variables range which is what _: is supposed to define. Hopefully it is not an issue with _: picking up variable classes like _numeric_ or _character_ (but I would doubt it). Nonetheless, you can simply replace the entire do; end; block by the equivalent block below where all array referencing has been explicitely done with variable i. You may want to add a drop i; statement at the end of your datastep to remove it from your final dataset.

do i=1 to dim(dist);

     if dist{i} NE . then dist{i}=sqrt(dist{i});

end;

windyboo
Calcite | Level 5

Thank you so much! It works!

Reeza
Super User

I think if you use Method=euclid rather than method=sqeuclid; you can remove that last step.

Euclidean Distance measurement:

Euclidean distance - Wikipedia, the free encyclopedia

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 4031 views
  • 13 likes
  • 4 in conversation