Solved: Re: OLS with no intercept by pairs of observations

rfrancis · Posted 05-16-2020 07:37 PM

Hello, I am working to obtain OLS parameter estimates (with no intercept) for each possible pair of observations within subgroups. Let's say the year 1991 represents a group, then let's say there are two subgroups within 1991; the first subgroup contains 3 observations, which produces 3 possible pairs or combination (1-2, 2-3 and 1-3); and the second subgroup contains 4 observations, which produces 6 possible pairs or combinations (1-2, 1-3, 1-4, 2-3, 2-4 and 3-4). The goal is to generate OLS parameter estimates for a model such as Y = B1 + B2 + e (i.e., two indep vars and no intercept). If this sounds like the Theil-Sen method, you are spot on. However, note that this is the "multivariate version", which requires more than computing the slope for y = mx + b. A fish much larger than I from the academic world has determined that OLS with no intercept and two observations will generate Theil-Sen slopes for B1 and B2 mentioned earlier (a modified r-square is necessary for OLS with no intercept, and will typically be between 99.5% and 100% when using the big fish methodology).. I have the "data step" SAS code from the big fish, but having witnessed the power and efficiency of IML, I set out to improve the lengthy processing time of the "data step" code. The IML code I developed below uses the OLS code from one of Rick Wicklin's blog, along with a loop developed by a SAS Community member. It works at the group level, but I lack the IML knowledge to make it work at the subgroup level. I believe the key is to insert a DO loop within the existing DO loop to capture the dynamics of the subgroups. This will require reconstructing a new X matrix for each possible pair within a subgroup in order to compute the two slope values. If interested or helpful, I can supply the code from the big fish. Thank you for any ideas or suggestions! Rick

data untrimmed;
     input group lag2cvrank lagcfo_ts cfo_ts lag2cfo_ts lag2acr_ts;
     cards;
     1991 0 155 175 165 35
     1991 0 200 225 250 75
     1991 0 75 125 135 65
     1991 1 350 375 400 55
     1991 1 155 175 165 85 
     1991 1 200 225 250 100
     1991 1 75 125 135 125
     ;
  run;

/* find unique BY-group combinations */
proc freq data=untrimmed;
tables group*lag2cvrank / out=FreqOut;
run;

proc iml;
start regress(XY);
        *c = allcomb(nrow(XY),2);       /* all "N choose 2" combinations of pairs */
		*c_rows=nrow(c);
		group = XY[1,4];        /* extract group from XY, to be used as a BY variable later */
        lag2cvrank = XY[1,5];   /* extract lag2cvrank from XY, to be used as a BY variable later */
		/* Extract x from XY */
		X = XY[c[i],{1 2}];       /* extract X from XY */ /* extract pairs, i.e, each combo in C */
		/* Extract y from XY */
		Y = XY[c[,],3];       /* extract Y from XY */
		xpx = x`*x;    /*cross-products*/
		xpy = x`*y;
		/*solve linear system*/
		/*Solution 1: compute inverse with INV (inefficient)*/
		xpxi = inv(xpx);    /*form inverse crossproducts*/
		b = xpxi*xpy;    /*solve for parameter estimates*/
		* Or a better solution ***;
		/*Solution 2: compute solution with SOLVE. More efficient*/
		b = (solve(xpx, xpy))`;    /* solve for parameter estimates*/
		t = nrow(XY);           /* number of rows in XY */
		group_col = group;
       	       	lag2cvrank_col = lag2cvrank;
		return (b || group_col || lag2cvrank_col);	
		end;	
  finish;

  /* read the BY groups */
  use FreqOut nobs NumGroups;
  read all var {group lag2cvrank};
  close FreqOut;

  use work.untrimmed;      
  create ts_fcst1_IML var {m m1 group_col lag2cvrank_col};
  setin work.untrimmed;    
  setout ts_fcst1_IML;

  inVarNames = {"lag2cfo_ts" "lag2acr_ts" "lagcfo_ts" "group" "lag2cvrank"};
  do i = 1 to NumGroups;                    /* for each BY group */
     read all var inVarNames into XY 
         where(group=(group[i]) & lag2cvrank=(lag2cvrank[i]));
		print xy;
	  /* X contains data for i_th group; analyze it */
	  c = allcomb(nrow(XY),2);       /* all "N choose 2" combinations of pairs */
		c_rows=nrow(c);	
	  do i = 1 to c_rows;  /* this is my feeble attempt at the new DO loop */
		G = regress(XY);
	 	/* extract the columns of the matrix */
     	        m=G[,1]; m1=G[,2]; group_col=G[,3]; lag2cvrank_col=G[,4];
                append;
	  end;
  end;

  close work.untrimmed;
  close ts_fcst1_IML;
quit;

data ts_fcst1_IML;
	set ts_fcst1_IML;
	proc print;run;

rfrancis · Posted 05-18-2020 09:55 AM

I developed the code below to solve this problem. The downside is that IML processing time is infinitely greater than than data-step processing time. Thank you SAS Community! Rick

data untrimmed;
     input group lag2cvrank lagcfo_ts cfo_ts lag2cfo_ts lag2acr_ts;
     cards;
     1991 0 155 175 165 35
     1991 0 200 225 250 75
	 1991 0 75 125 135 65
     1991 1 350 375 400 55
     1991 1 155 175 165 85 
     1991 1 200 225 250 100
     1991 1 75 125 135 125
	 1992 3 100 250 175 35 
	 1992 3 125 175 250 100
	 1992 3 200 225 300 175
     ;
  run;


/* find unique BY-group combinations */
proc freq data=untrimmed;
tables group*lag2cvrank / out=FreqOut;
run;

proc iml;

start regress(XY);
        *c = allcomb(nrow(XY),2);       /* all "N choose 2" combinations of pairs */
		*c_rows=nrow(c);
		c = allcomb(nrow(XY),2);       /* all "N choose 2" combinations of pairs */
	    c_rows=nrow(c);
		b_storage_first={0 0};
		do i = 1 to c_rows;
			XY_NEW=XY[c[i,],]; 
			group = XY[1,4];        /* extract group from XY, to be used as a BY variable later */
       		lag2cvrank = XY[1,5];   /* extract lag2cvrank from XY, to be used as a BY variable later */

			/* Extract x from XY */
			X = XY_NEW[,{1 2}];       /* extract X from XY */ /* extract pairs, i.e, each combo in C */
			
			/* Extract y from XY */
			Y = XY_NEW[,3];       /* extract Y from XY */
			xpx = x`*x;    /*cross-products*/
			xpy = x`*y;
			
			/*solve linear system*/
			/*Solution 1: compute inverse with INV (inefficient)*/
			xpxi = inv(xpx);    /*form inverse crossproducts*/
			b = xpxi*xpy;    /*solve for parameter estimates*/
			* Or a better solution ***;
			/*Solution 2: compute solution with SOLVE. More efficient*/
			b = (solve(xpx, xpy))`;    /* solve for parameter estimates*/
			t = nrow(XY_NEW);           /* number of rows in XY */
			group_col      = J(c_rows, 1, group);
			lag2cvrank_col = J(c_rows, 1, lag2cvrank);
			If i=1 then b_storage=b_storage_first+b; 
			else b_storage=b_storage//b;
			print b_storage;
			end;
			return (b_storage || group_col || lag2cvrank_col);	
finish;

  /* read the BY groups */
  use FreqOut nobs NumGroups;
  read all var {group lag2cvrank};
  close FreqOut;

  use work.untrimmed;      
  create ts_fcst1_IML var {group_col lag2cvrank_col med_b1 med_b2};*m m1;
  setin work.untrimmed;    
  setout ts_fcst1_IML;

  inVarNames = {"lag2cfo_ts" "lag2acr_ts" "lagcfo_ts" "group" "lag2cvrank"};
  do i = 1 to NumGroups;                    /* for each BY group */
     read all var inVarNames into XY 
        where(group=(group[i]) & lag2cvrank=(lag2cvrank[i]));
		/* X contains data for i_th group; analyze it */
	     G = regress(XY);
	 	/* extract the columns of the matrix */
		m=G[1,1]; m1=G[1,2]; group_col=G[1,3]; lag2cvrank_col=G[1,4]; /* Save matrix components into variables */
		/* Compute median slope values from all pairs for group/lag2cvrank combinations */
		med_b1=median(m); 
		med_b2=median(m1); 
		append;
	
  end;

  close work.untrimmed;
  close ts_fcst1_IML;
quit;

data ts_fcst1_IML;
	set ts_fcst1_IML;
	proc print;run;

View solution in original post

Rick_SAS · Posted 05-17-2020 08:55 AM

Since it sounds like this is an academic research project, I encourage you to think about the statistical computation you want to make for each combination from the subgroups. Suppose that cases 1 and 2 are being used for the first subgroup and cases 3 and 4 are being used for the second subgroup. How do you estimate B1 and B2 for those observations? Be sure you completely understand the answer and discuss it with your advisor if you have any questions.

1. Write a function that performs that computation to make sure you really understand it. Make sure it also works when cases 1&2 are used for the first subgroup and cases 1&2 are used for the second subgroup.

2. Now think about how to handle all combinations from the subgroups. Use the ALLCOMB function on each subgroup to generate all pairwise combinations from subgroup 1 and from subgroup 2.

3. Leverage the function you wrote in Step 1 to get all estimates over all pairwise combinations from the subgroups.

4. Try to encapsulate steps 1-3 into a single function. The input to the function is a group. The output is the set of Sen-Theil estimates for the subgroups.

5. After you have successfully completed steps 1-4, you can leverage your existing program to read each group, call the function in Step 4, and write the estimates to a SAS data set.

You are creating a big program, so you must break it into smaller pieces and then implement each small piece as efficiently as possible. In computer science lingo, this is sometimes called top-down design and bottom-up construction. Make sure you understand the underlying math/stats at each step before you begin to write any code.

Good luck!

rfrancis · Posted 05-18-2020 09:55 AM

I developed the code below to solve this problem. The downside is that IML processing time is infinitely greater than than data-step processing time. Thank you SAS Community! Rick

data untrimmed;
     input group lag2cvrank lagcfo_ts cfo_ts lag2cfo_ts lag2acr_ts;
     cards;
     1991 0 155 175 165 35
     1991 0 200 225 250 75
	 1991 0 75 125 135 65
     1991 1 350 375 400 55
     1991 1 155 175 165 85 
     1991 1 200 225 250 100
     1991 1 75 125 135 125
	 1992 3 100 250 175 35 
	 1992 3 125 175 250 100
	 1992 3 200 225 300 175
     ;
  run;


/* find unique BY-group combinations */
proc freq data=untrimmed;
tables group*lag2cvrank / out=FreqOut;
run;

proc iml;

start regress(XY);
        *c = allcomb(nrow(XY),2);       /* all "N choose 2" combinations of pairs */
		*c_rows=nrow(c);
		c = allcomb(nrow(XY),2);       /* all "N choose 2" combinations of pairs */
	    c_rows=nrow(c);
		b_storage_first={0 0};
		do i = 1 to c_rows;
			XY_NEW=XY[c[i,],]; 
			group = XY[1,4];        /* extract group from XY, to be used as a BY variable later */
       		lag2cvrank = XY[1,5];   /* extract lag2cvrank from XY, to be used as a BY variable later */

			/* Extract x from XY */
			X = XY_NEW[,{1 2}];       /* extract X from XY */ /* extract pairs, i.e, each combo in C */
			
			/* Extract y from XY */
			Y = XY_NEW[,3];       /* extract Y from XY */
			xpx = x`*x;    /*cross-products*/
			xpy = x`*y;
			
			/*solve linear system*/
			/*Solution 1: compute inverse with INV (inefficient)*/
			xpxi = inv(xpx);    /*form inverse crossproducts*/
			b = xpxi*xpy;    /*solve for parameter estimates*/
			* Or a better solution ***;
			/*Solution 2: compute solution with SOLVE. More efficient*/
			b = (solve(xpx, xpy))`;    /* solve for parameter estimates*/
			t = nrow(XY_NEW);           /* number of rows in XY */
			group_col      = J(c_rows, 1, group);
			lag2cvrank_col = J(c_rows, 1, lag2cvrank);
			If i=1 then b_storage=b_storage_first+b; 
			else b_storage=b_storage//b;
			print b_storage;
			end;
			return (b_storage || group_col || lag2cvrank_col);	
finish;

  /* read the BY groups */
  use FreqOut nobs NumGroups;
  read all var {group lag2cvrank};
  close FreqOut;

  use work.untrimmed;      
  create ts_fcst1_IML var {group_col lag2cvrank_col med_b1 med_b2};*m m1;
  setin work.untrimmed;    
  setout ts_fcst1_IML;

  inVarNames = {"lag2cfo_ts" "lag2acr_ts" "lagcfo_ts" "group" "lag2cvrank"};
  do i = 1 to NumGroups;                    /* for each BY group */
     read all var inVarNames into XY 
        where(group=(group[i]) & lag2cvrank=(lag2cvrank[i]));
		/* X contains data for i_th group; analyze it */
	     G = regress(XY);
	 	/* extract the columns of the matrix */
		m=G[1,1]; m1=G[1,2]; group_col=G[1,3]; lag2cvrank_col=G[1,4]; /* Save matrix components into variables */
		/* Compute median slope values from all pairs for group/lag2cvrank combinations */
		med_b1=median(m); 
		med_b2=median(m1); 
		append;
	
  end;

  close work.untrimmed;
  close ts_fcst1_IML;
quit;

data ts_fcst1_IML;
	set ts_fcst1_IML;
	proc print;run;

IanWakeling · Posted 05-19-2020 08:56 AM

You say the IML solution is taking a lot longer, but there are probably ways in which it could be made more efficient. It is generally a bad idea to 'grow' a matrix inside a loop with syntax like:

b_storage=b_storage//b;

as this makes a new matrix for each iteration, and moves data around in memory unnecessarily. It is better to create b_storage at it final size before the loop, and then write one row at a time.

I am guessing the data step solution you mention uses a short cut to compute regression coefficients directly, since there are only two points and no error. You could do the same in IML to speed things up. For an even faster solution, it could be vectorized to eliminate the loop over the combinations.

Rick_SAS · Posted 05-19-2020 09:04 AM

To learn more about Ian's comment, see the article "Friends don't let friends concatenate results inside a loop."

I'll also mention that you are solving the regression problem twice in every loop. You can delete the unnecessary statements for Solution 1.

rfrancis · Posted 05-19-2020 09:36 AM

Thanks Rick! Your link provides the how that I need. Solving the regression problem twice is definitely increasing the processing time. Thank again! Rick

rfrancis · Posted 05-19-2020 09:33 AM

Thank you Ian. No question that my code is inefficient, just simply have little knowledge for IML. Creating the matrix at it's final size would be great if I only knew how to make this happen. The data step code that I mentioned consists of PROC REG inside of a macro loop, nothing special. With help from a system administrator to increase MEMSIZE to 12 GB, the data step code executes fairly quickly. And I suspect that the IML code with a few tweaks would be comparable. I apologize if it seemed I was criticizing IML for the slow processing time. Thanks again for your time! Rick

IanWakeling · Posted 05-19-2020 10:49 AM

My suggestion is to declare the storage immediately before the loop like so:

b_storage = j(c_rows, 2, .);

That is a matrix with as many rows as combinations, 2 columns and filled with missing values. Once you have the matrix b in the loop, then copy it to the ith row of the storage matrix as follows:

b_storage[i, ] = b;

The if/then/else statement can then be deleted, as can the definition of matrix b_storage_first.

Incidentally the statements that define group_col and lag2cvrank_col can be moved outside of the loop, they are being overwritten on every iteration of the loop - this will save a bit more time.

Finally, have you thought what happens if you have two data rows with identical pairs of x-values?

rfrancis · Posted 05-19-2020 11:44 AM

Thanks Ian! I see how you are updating the matrix in the loop, makes a lotta sense. The nature of the data (10-K filings) makes duplicate values pretty remote. I have posed this very question to the big dog that developed the data step code, and he is unconcerned to say the least. But your point is well-taken nonetheless. Thank you again for your time and willingness to help. Rick

Register Today!