BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Sinistrum
Quartz | Level 8

Dear community,

 

I would please like to ask for your help concerning the following issue.

 

In fact, it is a follow up from this question of mine, which has been resolved greatly;

Given I want to output the data underlying the cdfplot from proc quantreg, QuantPred-columns appear to get truncated at 150 observations (i.e., QuantPred1 - QuantPred150 are contained, though no further QuantPredn, such that n > 150).

 

It would be great, if you could please provide help with respect to increase this maximum. Is this maybe some kind of option?

 

Sample-Code is embedded in the spoiler;

Spoiler
data	pro_fit;
	seed 	=	1;
	call	streaminit(seed);
	do		i	=	1	to		1000;
		x1	= 	rand ('normal',	10,	2);
		x2	=	rand ('normal',	5,	1);
		y1	=	1	+	2 * x1 + 5 * x2 + rand ('normal',	0,	0.5);
	output;
	end;
	drop seed;
run;
data	pro_predict;
	seed 	=	2;
	call	streaminit(seed);
	do		i	=	1	to		200; 			/*		I.e., the data set has 200 observations		*/
		x1	= 	rand ('normal',	10,	2);
		x2	=	rand ('normal',	5,	1);
		y1	=	51;
	output;
	end;
	drop seed;
run;
proc  quantreg
      data  		=     pro_fit
      ci        	=           none;
      model
            y1    = x1 x2
			/
			quantlev
				=
					fqpr
						(
							n	=	100
						)
						;
      conddist
			hr
			testdata
				(
					so	
					hr
					hf
				)
					=		pro_predict 
            plot  	=         
							(
								cdfplot	
							)
							;
			ods	output
				cdfplot		=			cdftest_pred		/*		Here lies the issue, for it contains only 150 columns of the type I require		*/
			;
run;

 

Edit:

The "truncation" may I call it scales with the fqpr (n=n_i) option. E.g., increasing n_i = 100 from the above example, hidden in the spoiler, to n_i = 200 decreases the reported values to 72.

 

Yours sincerely.

Mario

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

I talked this over with my friend and he agrees that you can work around the limitation of the CDFPLOT technique, but you'll need some way to predict the CDF for the test data by using the estimates in the ProcessEst table. This requires matrix multiplication so I will use PROC IML, although you can also implement it in the DATA step.

 

Your example uses only main effects for continuous indep variables, but I will show how to also handle the case where you have CLASS and/or use the EFFECT statement.   

The basic idea is to use the OUTDESIGN= option on the PROC QUANTSELECT to generate the design matrix for the model. You can then predict CDFs for the testing obs as beta*X`. As I said, you don't really need the OUTDESIGN= trick if you have only continuous indep vars.  The following program has lots of comments, so hopefully it is clear.

 

/* Generate training obs. */
data  pro_fit;
   seed  =  1;
   call  streaminit(seed);
   role = 'fit ';               /* Add a role variable to indicate training obs. */
   do    i  =  1  to    1000;
      x1 =  rand ('normal',   10,   2);
      x2 =  rand ('normal',   5, 1);
      y1 =  1  +  2 * x1 + 5 * x2 + rand ('normal',   0, 0.5);
   output;
   end;
   drop seed;
run;

/* Generate testing obs. */
data  pro_predict;
   seed  =  2;
   call  streaminit(seed);
   role = 'test';              /* Add a role variable to indicate testing obs. */
   do    i  =  1  to    200;
      x1 =  rand ('normal',   10,   2);
      x2 =  rand ('normal',   5, 1);
   output;
   end;
   drop seed;
run;

/* Merge training and testing data.
   The merging process assigns missing response values to testing obs because pro_predict does not have y1. */
data pro_merge;
    set pro_fit pro_predict;
    drop i;
run; 

/* Use PROC QUANTSELECT to fit quantile process model and output design matrix for the merged data set.
   Because y1 has missing values for all testing obs. The fitted model only relies on the obs in pro_fit. */
ods exclude all;
proc  quantselect
   data  =  pro_merge
        /* Use OUTDESIGN= data set option to output design matrix. */
        outdesign(ADDINPUTVARS) = pro_outDesign;
   model
      y1    = x1 x2/selection=none quantlev=fqpr(n=40);
   /* Use ODS OUTPUT statement to output quantile-process model. */  
        ods output ProcessEst=process_Est; /* this contains the betas */
run;
ods exclude none;

/* Use PROC IML to compute CDFs for testing obs */
proc iml;
   /* Read quantile-process parameter estimates into the beta matrix
      which contains an extra quantile-level column as the first column.*/
   use process_Est;
   read all var _all_ into beta[colname=betaVarNames];
   close;

   /* Show the column names for beta. */
   print betaVarNames;

   /* Get the number of columns in beta. */
   nPar = ncol(beta); 

   /* Save the quantile-level column into qntLev vector. */
   qntLev       = beta[,1]; 

   /* Drop the quantile-level column from beta. */
   beta         = beta[,2:nPar]; 
   betaVarNames = betaVarNames[2:nPar];
   nPar         = nPar-1;

   /* Read ONLY test data. Compute CDF as beta*X` */
   use pro_outDesign where(role='test');
   read all var betaVarNames into x;
   close;
   
   /* Predict CDFs for the testing data. */
   cdfs = qntLev||beta*t(x);
   
   /* Make column names for the CDFs matrix whose first column contains the quantile levels. */
   predNames = 'quantLevel'||('quantPred1':('quantPred'+ (char(nrow(x)))));
   
   /* Output CDFs into a SAS data set. */
   create pro_predicted from cdfs[colname=predNames];
   append from cdfs;
   close;
quit;

 

 

View solution in original post

4 REPLIES 4
Rick_SAS
SAS Super FREQ

I talked this over with my friend and he agrees that you can work around the limitation of the CDFPLOT technique, but you'll need some way to predict the CDF for the test data by using the estimates in the ProcessEst table. This requires matrix multiplication so I will use PROC IML, although you can also implement it in the DATA step.

 

Your example uses only main effects for continuous indep variables, but I will show how to also handle the case where you have CLASS and/or use the EFFECT statement.   

The basic idea is to use the OUTDESIGN= option on the PROC QUANTSELECT to generate the design matrix for the model. You can then predict CDFs for the testing obs as beta*X`. As I said, you don't really need the OUTDESIGN= trick if you have only continuous indep vars.  The following program has lots of comments, so hopefully it is clear.

 

/* Generate training obs. */
data  pro_fit;
   seed  =  1;
   call  streaminit(seed);
   role = 'fit ';               /* Add a role variable to indicate training obs. */
   do    i  =  1  to    1000;
      x1 =  rand ('normal',   10,   2);
      x2 =  rand ('normal',   5, 1);
      y1 =  1  +  2 * x1 + 5 * x2 + rand ('normal',   0, 0.5);
   output;
   end;
   drop seed;
run;

/* Generate testing obs. */
data  pro_predict;
   seed  =  2;
   call  streaminit(seed);
   role = 'test';              /* Add a role variable to indicate testing obs. */
   do    i  =  1  to    200;
      x1 =  rand ('normal',   10,   2);
      x2 =  rand ('normal',   5, 1);
   output;
   end;
   drop seed;
run;

/* Merge training and testing data.
   The merging process assigns missing response values to testing obs because pro_predict does not have y1. */
data pro_merge;
    set pro_fit pro_predict;
    drop i;
run; 

/* Use PROC QUANTSELECT to fit quantile process model and output design matrix for the merged data set.
   Because y1 has missing values for all testing obs. The fitted model only relies on the obs in pro_fit. */
ods exclude all;
proc  quantselect
   data  =  pro_merge
        /* Use OUTDESIGN= data set option to output design matrix. */
        outdesign(ADDINPUTVARS) = pro_outDesign;
   model
      y1    = x1 x2/selection=none quantlev=fqpr(n=40);
   /* Use ODS OUTPUT statement to output quantile-process model. */  
        ods output ProcessEst=process_Est; /* this contains the betas */
run;
ods exclude none;

/* Use PROC IML to compute CDFs for testing obs */
proc iml;
   /* Read quantile-process parameter estimates into the beta matrix
      which contains an extra quantile-level column as the first column.*/
   use process_Est;
   read all var _all_ into beta[colname=betaVarNames];
   close;

   /* Show the column names for beta. */
   print betaVarNames;

   /* Get the number of columns in beta. */
   nPar = ncol(beta); 

   /* Save the quantile-level column into qntLev vector. */
   qntLev       = beta[,1]; 

   /* Drop the quantile-level column from beta. */
   beta         = beta[,2:nPar]; 
   betaVarNames = betaVarNames[2:nPar];
   nPar         = nPar-1;

   /* Read ONLY test data. Compute CDF as beta*X` */
   use pro_outDesign where(role='test');
   read all var betaVarNames into x;
   close;
   
   /* Predict CDFs for the testing data. */
   cdfs = qntLev||beta*t(x);
   
   /* Make column names for the CDFs matrix whose first column contains the quantile levels. */
   predNames = 'quantLevel'||('quantPred1':('quantPred'+ (char(nrow(x)))));
   
   /* Output CDFs into a SAS data set. */
   create pro_predicted from cdfs[colname=predNames];
   append from cdfs;
   close;
quit;

 

 

Sinistrum
Quartz | Level 8

Well, what should I write?

 

I am so amazed... So grateful... Thank you so much.

Thank you very much indeed. This is breathtaking.

 

It shows so much effort,

teaches to much new concept, and

completely resolves my issues.

 

I hope - as I need them, too - I am now able to obtain density estimates, too.

The cdf is obtained by cdfs = qntLev||beta*t(x);

though, obtaining pdfs will involve some consecutive steps.

Rick_SAS
SAS Super FREQ

You can get the PDFs from the CDFs as a first difference approximation to the derivative. For example, in the DATA step you can write:
PDF = dif(CDF) / dif(X);     /* ( CDF(X_{n+1}) - CDF(X_{n}) ) / (X_{n+1}) - X_{n}) */

 

Data NormalCDF;
do x = -3 to 3 by 0.1;
   CDF = cdf("Normal", x);
   output;
end;
run;
/* get PDF as finite difference approximation of CDF */
data PDF;
set NormalCDF;
PDF_FD = dif(CDF) / dif(x);     /* finite difference approx */
PDF_Exact = pdf("Normal", x);
run;

proc sgplot data=PDF;
   series x=x y=PDF_FD;
   series x=x y=PDF_Exact;
run;

For more on finite-difference derivatives, see "The DIF function: Compute lagged differences and finite differences"

Sinistrum
Quartz | Level 8

Oh my, thank you very much once again.

 

This certainly is an option;

However, I "need" to replicate the "ordinary" output 1:1, which obtains an estimate for the pdf using kernel density estimation based on the cdf in the quantile-level grid. In particular, my application requires an triangular Kernel function.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 621 views
  • 3 likes
  • 2 in conversation