Posted 12-03-2018 03:43 PM
Hi,

I've been tasked (voluntold) to convert some Stata code to SAS. It's using NPREGRESS which appears to be a non-parameteric regression.

I'm looking at using GAMPL procedure in SAS to see if that's the equivalent. If anyone has any thoughts on which proc may be appropriate that would be highly appreciated. Just need pointers to the procs I should be looking at.

Thanks!

Personally, I'd recommend ADAPTIVEREG for a problem that has several hundred parameters. The GAMPL procedure solves a big optimization problem and is mostly focused on 1-D transformations of each individual variable. The ADAPTIVEREG procedure should obtain a solution faster while still being flexible in fitting the data.

Kernel regression is old technology. I do not recommend it for high-dimensional problems.

I'm filing "voluntold" away for future use.

Meanwhile, I personally have no clue. But @Rick_SAS has this pertinent blog post that might be useful. And he is wonderfully wise.

I hope this helps!

SAS has many nonparametric and semi-parametric procedures. Please tell us

1. How many explanatory variables you have and whether any are classification variables

2. What is the nature of the response variable? Continuous? Counts? Binary?

GAMPL is one choice, as is the ADAPTIVEREG procedure (click here for a discussion and example of 2-d regression of binary response). For 1-D or 2-D data and a continuous response, I like the nonparametric smoothers in the LOESS.procedure.

Thanks Rick. I've been playing around with GAMPL and was planning to look into ADAPTIVEREG as well as LOESS, though if it can only handle 2D data that won't work.

The response is continuous and for explanatory variables I was hoping for 10 to 300 (kitchen sink). I do have about 3 million observations.

It looks like npregress is a kernel regression and from your blog post SAS doesn't do kernel regression for multivariate data, at least out of the box.

Personally, I'd recommend ADAPTIVEREG for a problem that has several hundred parameters. The GAMPL procedure solves a big optimization problem and is mostly focused on 1-D transformations of each individual variable. The ADAPTIVEREG procedure should obtain a solution faster while still being flexible in fitting the data.

Kernel regression is old technology. I do not recommend it for high-dimensional problems.

Awesome, thanks Rick! One last quick question, with 3 million rows, my data set is 13GB (too big) and I have a computer with 16GB of RAM, will that be enough to run this? It took 45 minutes with one variable. Guess I'll try and see if it explodes 🙂

Thanks, I'll keep working on it and let you know how it goes.

