DATA Step, Macro, Functions and more

loop on multiple tables

Accepted Solution Solved
Reply
Contributor
Posts: 62
Accepted Solution

loop on multiple tables

Hello,

I have a problem that might be solved be a loop (I guess).

I must calculate the square root of the sommation of the differences (square) between the weights invested between two banks. The application of the equation 1 at the page 5: http://www.ccfr.org.cn/cicf2010/papers/20091210013047.pdf

I have 20 banks. I have 20 tables with 11 columns. One to identify the years and ten others to identify the weights invested in each of the ten industries (the sommation on one row of the ten weights is equal to 1, of course).

The structure of 20 tables is the same. I must take each pair of banks to apply the equation. My result should be a number for each year for two banks compared.

Can you help me with the code or its structure?

Thanks a lot.


Accepted Solutions
Solution
‎10-30-2013 01:17 PM
Super Contributor
Posts: 339

Re: loop on multiple tables

In this case, a slight tweak to the example code i've listed in post#7 will get you the desired output.

Sadly, proc distance requires a character variable for ID and doesn't compute square root so it will have to be done in 3 steps (would be 4 if your single dataset wasn't sorted by year/id)

step1 create a character id

data have;

     set have;

     bank_id="_"||put(ultimateparentid, best8.);

run;

step2 calculate distance

proc distance data=have out=want method=sqeuclid;

     var interval (w_Smiley Happy; /* use w_: to represent the group of all variables starting with w_*/

     id bank_id;

     by year;

run;

step3 calculate square root

data want;

     set want;

     array dist {*} _: ; /* as proc distance creates a square matrix with variables named after bank_id and we used _ to preceed all numeric bank ID, we can use _: to index all such created variables through an array */

     do over dist;

          if dist NE . then dist=sqrt(dist); /* I believe this is the appropriate syntax for do over but its been a while, at worse, you can use do i=1 to dim(dist) and index dist{i} inside the loop */

     end;

run;

You will notice a variable containing all _id and 20 variables _id with a lower triangular matrix of calculated distances. it's up to you to convert that in a different format if you want.

View solution in original post


All Replies
Super User
Posts: 19,789

Re: loop on multiple tables

How many observations per table?

Super User
Posts: 19,789

Re: loop on multiple tables

You should also post a sample of what your data looks like and the sample output...I can read the paper but don't actually want to Smiley Happy

Contributor
Posts: 62

Re: loop on multiple tables

I will.

Contributor
Posts: 62

Re: loop on multiple tables


18 rows, 11 columns - the column of the years is just to identify.

Super Contributor
Posts: 339

Re: loop on multiple tables

At first I thought "IML JOB" but since you don't want it output in a matrix format and just want the lower(or upper) triangle of calculations, it is probably not worth learning proc iml. Esp. since a big chunk of the work is data manipulation and not quite the element-wise squared difference of vectors. Anyway, here's how I would tackle the problem:

STEP 1. Preprocess your 20  per bank files to obtain 18 per year files and change the year variable to a bank code or name variable. This can be done reasonably easily using the IN= dataset option in a single data step like

data y1995 y1996 y1997 /* ... */ y2013;

     set bank1(IN=b1) bank2(in=b1) /* ... */ bank20(in=b20);

     array bcode {*} b1-b20;

     case (yearvariable);

          when (1995) then output y1995;

          /*...*/

          when (2013 then output y2013;

          /* you can use an otherwise statement for outstanding year data for validation */

     end;

run;

STEP 2 - Calculate each statistic and output that to a new dataset with bank-indicator variables

/* use macro to loop over over the 18 new datasets, providing the logic example below */

data want1995(keep=b1-b20 outputstat);

     array outer {30} _temporary_;

     array dv {*} b1-b20 w1-w10 /* there should be exactly 30 variables, the 20 bank indicators and the 10 weights adjust weight variable names accordingly */

     do i=1 to outerobs;

          set y1995(nobs=outerobs drop=yearvariable);

          do k=1 to 30;

               outer{k}=dv{k}; /*store outer record in a temporary array to have it available while looping over all subsequent rows in the file */

          end;

            

          do j=(i+1) to innerobs;

               set y1995(nobs=innerobs);

               do k=1 to 20;

                    dv{k}=max(dv{k}, outer{k}); /*replace one column of 0 with a 1 to indicate which 2 banks contribute to the output calculation */

               end;

               temp=0; /*initialize Sum of Squares*/

               do k=21 to 30;

                    temp=temp+(dv{k}-outer{k})**2;

               end;

               outputstat=sqrt(temp);

               output;

          end;

     end;

run;

This is all untested but it provides a logical path on how to handle the looping and data manipulation. The resulting wantYYYY dataset will have n(n-1)/2 rows for each possible pairs with 20 "bank indicators" variables, 18 of which are set to 0, 2 of which are set to 1 and a single output statistic.

Hope this helps  

Vince

Super User
Posts: 19,789

Re: loop on multiple tables

Posted in reply to Vince28_Statcan

Wouldn't there be a way to set all the datasets together and then use by groups and proc distance with squared euclidean distance?

This would avoid loops and makes it much easier to understand.

Need data though Smiley Happy

Super Contributor
Posts: 339

Re: loop on multiple tables

I didn't know about that procedure but it appears you can. I haven't figured out a way to have the square root calculated directly in the procedure but it can calculate the SS easily for the lower triangle by group.

Something such as:

data bankset1;

     input year w1 w2 w3 w4 w5;

     datalines;

2000 0.1 0.2 0.2 0.4 0.1

2001 0.2 0.1 0.2 0.4 0.1

2002 0.5 0.1 0.1 0.1 0.2

;

run;

data bankset2;

     input year w1 w2 w3 w4 w5;

     datalines;

2000 0.2 0.1 0.1 0.5 0.1

2001 0.2 0.2 0.2 0.2 0.2

2002 0.3 0.3 0.1 0.1 0.2

;

run;

data bankset3;

     input year w1 w2 w3 w4 w5;

     datalines;

2000 0.2 0.2 0.2 0.3 0.1

2001 0.7 0.1 0.1 0.0 0.1

2002 0.5 0.1 0.3 0.0 0.1

;

run;

%macro m();

data have;

     %do i=1 %to 3;

     set bankset&i.;

     bank_id="bank&i";

     output;

     %end;

%mend;

%m();

proc sort data=have;

     by year bank_id;

run;

proc distance data=have out=want method=sqeuclid;

     var interval (w1-w5);

     id bank_id;

     by year;

run;

just gotta calculate square root!

Thanks Reeza for the lesson

Super User
Posts: 19,789

Re: loop on multiple tables

Posted in reply to Vince28_Statcan

You don't need the macro either Vince, assuming SAS 9.2+ Smiley Happy

data have;

set bankset1-bankset3 indsname=source;

bank_id=source;

run;

Contributor
Posts: 62

Re: loop on multiple tables

Thanks Vince and Reeza for your answers.

Sorry for the delay. I rebuild a new table with all data.

I take a printscreen to show you what it looks like.

My table has 360 rows (20 banks x 18 years) and 12 columns.

printscreen_20banks.png

I create a column of ids (UltimateParentID). So, my code must do this:

(Legend: Variable, UltimateParentID, year)

square root [((W_AG,7052,1995)-(W_AG,7135,1995))^2+ ((W_MIN,7052,1995)-(W_MIN,7135,1995))^2+((W_CON,7052,1995)-(W_CON,7135,1995))^2+((W_TRAN,7052,1995)-(W_TRAN,7135,1995))^2+

((W_WHOLE,7052,1995)-(W_WHOLE,7135,1995))^2+((W_RE,7052,1995)-(W_RE,7135,1995))^2+((W_SER,7052,1995)-(W_SER,7135,1995))^2+((W_FIN,7052,1995)-(W_FIN,7135,1995))^2+

((W_PUBLIC,7052,1995)-(W_PUBLIC,7135,1995))^2+((W_Autre,7052,1995)-(W_Autre,7135,1995))^2)]

This calculation must be repeated between all group of two banks for each year.

n(n-1)/2=190 results per year

I made clarifications. I think my problem was incorrectly formulated.

Solution
‎10-30-2013 01:17 PM
Super Contributor
Posts: 339

Re: loop on multiple tables

In this case, a slight tweak to the example code i've listed in post#7 will get you the desired output.

Sadly, proc distance requires a character variable for ID and doesn't compute square root so it will have to be done in 3 steps (would be 4 if your single dataset wasn't sorted by year/id)

step1 create a character id

data have;

     set have;

     bank_id="_"||put(ultimateparentid, best8.);

run;

step2 calculate distance

proc distance data=have out=want method=sqeuclid;

     var interval (w_Smiley Happy; /* use w_: to represent the group of all variables starting with w_*/

     id bank_id;

     by year;

run;

step3 calculate square root

data want;

     set want;

     array dist {*} _: ; /* as proc distance creates a square matrix with variables named after bank_id and we used _ to preceed all numeric bank ID, we can use _: to index all such created variables through an array */

     do over dist;

          if dist NE . then dist=sqrt(dist); /* I believe this is the appropriate syntax for do over but its been a while, at worse, you can use do i=1 to dim(dist) and index dist{i} inside the loop */

     end;

run;

You will notice a variable containing all _id and 20 variables _id with a lower triangular matrix of calculated distances. it's up to you to convert that in a different format if you want.

Contributor
Posts: 62

Re: loop on multiple tables

Posted in reply to Vince28_Statcan

Thanks a lot Vince.

I tried your codes. I have a problem with the step 3 like you seem to expect.

''over'' is underlined. The number 184 appears below and it tells me that ''The instruction DO OVER cannot be operated with tables that have explicit inferior indexes''. 

I tried to figure out by myself. It's my first complex code so I didn't find an answer yet. If you have an idea to make it work, thank you again!

Super Contributor
Posts: 339

Re: loop on multiple tables

I'm sorry about it. Oddly enough the short documentation I have about the DO OVER statement requires an array defined through explicit variables range which is what _: is supposed to define. Hopefully it is not an issue with _: picking up variable classes like _numeric_ or _character_ (but I would doubt it). Nonetheless, you can simply replace the entire do; end; block by the equivalent block below where all array referencing has been explicitely done with variable i. You may want to add a drop i; statement at the end of your datastep to remove it from your final dataset.

do i=1 to dim(dist);

     if dist{i} NE . then dist{i}=sqrt(dist{i});

end;

Contributor
Posts: 62

Re: loop on multiple tables

Posted in reply to Vince28_Statcan

Thank you so much! It works!

Super User
Posts: 19,789

Re: loop on multiple tables

I think if you use Method=euclid rather than method=sqeuclid; you can remove that last step.

Euclidean Distance measurement:

Euclidean distance - Wikipedia, the free encyclopedia

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 809 views
  • 13 likes
  • 4 in conversation