Dear community,
I would please like to ask for your help concerning the following issue.
In fact, it is a follow up from this question of mine, which has been resolved greatly;
Given I want to output the data underlying the cdfplot from proc quantreg, QuantPred-columns appear to get truncated at 150 observations (i.e., QuantPred1 - QuantPred150 are contained, though no further QuantPredn, such that n > 150).
It would be great, if you could please provide help with respect to increase this maximum. Is this maybe some kind of option?
Sample-Code is embedded in the spoiler;
data pro_fit;
seed = 1;
call streaminit(seed);
do i = 1 to 1000;
x1 = rand ('normal', 10, 2);
x2 = rand ('normal', 5, 1);
y1 = 1 + 2 * x1 + 5 * x2 + rand ('normal', 0, 0.5);
output;
end;
drop seed;
run;
data pro_predict;
seed = 2;
call streaminit(seed);
do i = 1 to 200; /* I.e., the data set has 200 observations */
x1 = rand ('normal', 10, 2);
x2 = rand ('normal', 5, 1);
y1 = 51;
output;
end;
drop seed;
run;
proc quantreg
data = pro_fit
ci = none;
model
y1 = x1 x2
/
quantlev
=
fqpr
(
n = 100
)
;
conddist
hr
testdata
(
so
hr
hf
)
= pro_predict
plot =
(
cdfplot
)
;
ods output
cdfplot = cdftest_pred /* Here lies the issue, for it contains only 150 columns of the type I require */
;
run;
Edit:
The "truncation" may I call it scales with the fqpr (n=n_i) option. E.g., increasing n_i = 100 from the above example, hidden in the spoiler, to n_i = 200 decreases the reported values to 72.
Yours sincerely.
Mario
I talked this over with my friend and he agrees that you can work around the limitation of the CDFPLOT technique, but you'll need some way to predict the CDF for the test data by using the estimates in the ProcessEst table. This requires matrix multiplication so I will use PROC IML, although you can also implement it in the DATA step.
Your example uses only main effects for continuous indep variables, but I will show how to also handle the case where you have CLASS and/or use the EFFECT statement.
The basic idea is to use the OUTDESIGN= option on the PROC QUANTSELECT to generate the design matrix for the model. You can then predict CDFs for the testing obs as beta*X`. As I said, you don't really need the OUTDESIGN= trick if you have only continuous indep vars. The following program has lots of comments, so hopefully it is clear.
/* Generate training obs. */
data pro_fit;
seed = 1;
call streaminit(seed);
role = 'fit '; /* Add a role variable to indicate training obs. */
do i = 1 to 1000;
x1 = rand ('normal', 10, 2);
x2 = rand ('normal', 5, 1);
y1 = 1 + 2 * x1 + 5 * x2 + rand ('normal', 0, 0.5);
output;
end;
drop seed;
run;
/* Generate testing obs. */
data pro_predict;
seed = 2;
call streaminit(seed);
role = 'test'; /* Add a role variable to indicate testing obs. */
do i = 1 to 200;
x1 = rand ('normal', 10, 2);
x2 = rand ('normal', 5, 1);
output;
end;
drop seed;
run;
/* Merge training and testing data.
The merging process assigns missing response values to testing obs because pro_predict does not have y1. */
data pro_merge;
set pro_fit pro_predict;
drop i;
run;
/* Use PROC QUANTSELECT to fit quantile process model and output design matrix for the merged data set.
Because y1 has missing values for all testing obs. The fitted model only relies on the obs in pro_fit. */
ods exclude all;
proc quantselect
data = pro_merge
/* Use OUTDESIGN= data set option to output design matrix. */
outdesign(ADDINPUTVARS) = pro_outDesign;
model
y1 = x1 x2/selection=none quantlev=fqpr(n=40);
/* Use ODS OUTPUT statement to output quantile-process model. */
ods output ProcessEst=process_Est; /* this contains the betas */
run;
ods exclude none;
/* Use PROC IML to compute CDFs for testing obs */
proc iml;
/* Read quantile-process parameter estimates into the beta matrix
which contains an extra quantile-level column as the first column.*/
use process_Est;
read all var _all_ into beta[colname=betaVarNames];
close;
/* Show the column names for beta. */
print betaVarNames;
/* Get the number of columns in beta. */
nPar = ncol(beta);
/* Save the quantile-level column into qntLev vector. */
qntLev = beta[,1];
/* Drop the quantile-level column from beta. */
beta = beta[,2:nPar];
betaVarNames = betaVarNames[2:nPar];
nPar = nPar-1;
/* Read ONLY test data. Compute CDF as beta*X` */
use pro_outDesign where(role='test');
read all var betaVarNames into x;
close;
/* Predict CDFs for the testing data. */
cdfs = qntLev||beta*t(x);
/* Make column names for the CDFs matrix whose first column contains the quantile levels. */
predNames = 'quantLevel'||('quantPred1':('quantPred'+ (char(nrow(x)))));
/* Output CDFs into a SAS data set. */
create pro_predicted from cdfs[colname=predNames];
append from cdfs;
close;
quit;
I talked this over with my friend and he agrees that you can work around the limitation of the CDFPLOT technique, but you'll need some way to predict the CDF for the test data by using the estimates in the ProcessEst table. This requires matrix multiplication so I will use PROC IML, although you can also implement it in the DATA step.
Your example uses only main effects for continuous indep variables, but I will show how to also handle the case where you have CLASS and/or use the EFFECT statement.
The basic idea is to use the OUTDESIGN= option on the PROC QUANTSELECT to generate the design matrix for the model. You can then predict CDFs for the testing obs as beta*X`. As I said, you don't really need the OUTDESIGN= trick if you have only continuous indep vars. The following program has lots of comments, so hopefully it is clear.
/* Generate training obs. */
data pro_fit;
seed = 1;
call streaminit(seed);
role = 'fit '; /* Add a role variable to indicate training obs. */
do i = 1 to 1000;
x1 = rand ('normal', 10, 2);
x2 = rand ('normal', 5, 1);
y1 = 1 + 2 * x1 + 5 * x2 + rand ('normal', 0, 0.5);
output;
end;
drop seed;
run;
/* Generate testing obs. */
data pro_predict;
seed = 2;
call streaminit(seed);
role = 'test'; /* Add a role variable to indicate testing obs. */
do i = 1 to 200;
x1 = rand ('normal', 10, 2);
x2 = rand ('normal', 5, 1);
output;
end;
drop seed;
run;
/* Merge training and testing data.
The merging process assigns missing response values to testing obs because pro_predict does not have y1. */
data pro_merge;
set pro_fit pro_predict;
drop i;
run;
/* Use PROC QUANTSELECT to fit quantile process model and output design matrix for the merged data set.
Because y1 has missing values for all testing obs. The fitted model only relies on the obs in pro_fit. */
ods exclude all;
proc quantselect
data = pro_merge
/* Use OUTDESIGN= data set option to output design matrix. */
outdesign(ADDINPUTVARS) = pro_outDesign;
model
y1 = x1 x2/selection=none quantlev=fqpr(n=40);
/* Use ODS OUTPUT statement to output quantile-process model. */
ods output ProcessEst=process_Est; /* this contains the betas */
run;
ods exclude none;
/* Use PROC IML to compute CDFs for testing obs */
proc iml;
/* Read quantile-process parameter estimates into the beta matrix
which contains an extra quantile-level column as the first column.*/
use process_Est;
read all var _all_ into beta[colname=betaVarNames];
close;
/* Show the column names for beta. */
print betaVarNames;
/* Get the number of columns in beta. */
nPar = ncol(beta);
/* Save the quantile-level column into qntLev vector. */
qntLev = beta[,1];
/* Drop the quantile-level column from beta. */
beta = beta[,2:nPar];
betaVarNames = betaVarNames[2:nPar];
nPar = nPar-1;
/* Read ONLY test data. Compute CDF as beta*X` */
use pro_outDesign where(role='test');
read all var betaVarNames into x;
close;
/* Predict CDFs for the testing data. */
cdfs = qntLev||beta*t(x);
/* Make column names for the CDFs matrix whose first column contains the quantile levels. */
predNames = 'quantLevel'||('quantPred1':('quantPred'+ (char(nrow(x)))));
/* Output CDFs into a SAS data set. */
create pro_predicted from cdfs[colname=predNames];
append from cdfs;
close;
quit;
Well, what should I write?
I am so amazed... So grateful... Thank you so much.
Thank you very much indeed. This is breathtaking.
It shows so much effort,
teaches to much new concept, and
completely resolves my issues.
I hope - as I need them, too - I am now able to obtain density estimates, too.
The cdf is obtained by cdfs = qntLev||beta*t(x);
though, obtaining pdfs will involve some consecutive steps.
You can get the PDFs from the CDFs as a first difference approximation to the derivative. For example, in the DATA step you can write:
PDF = dif(CDF) / dif(X); /* ( CDF(X_{n+1}) - CDF(X_{n}) ) / (X_{n+1}) - X_{n}) */
Data NormalCDF;
do x = -3 to 3 by 0.1;
CDF = cdf("Normal", x);
output;
end;
run;
/* get PDF as finite difference approximation of CDF */
data PDF;
set NormalCDF;
PDF_FD = dif(CDF) / dif(x); /* finite difference approx */
PDF_Exact = pdf("Normal", x);
run;
proc sgplot data=PDF;
series x=x y=PDF_FD;
series x=x y=PDF_Exact;
run;
For more on finite-difference derivatives, see "The DIF function: Compute lagged differences and finite differences"
Oh my, thank you very much once again.
This certainly is an option;
However, I "need" to replicate the "ordinary" output 1:1, which obtains an estimate for the pdf using kernel density estimation based on the cdf in the quantile-level grid. In particular, my application requires an triangular Kernel function.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.