I have a large dataset that combines multiple rows for each of multiple units (zip codes). I need to build one regression trees for each unit and use them to impute missing values in the DV. I understand that hpsplit does not implement "by" processing, so I thought of using a macro to split, build tree, score missing values, and append to an output data set.
This is how I found this question Splitting the dataset using macros and the recommendation not to use macros. Can anyone suggest a better approach?
The version of SAS is 9.4, SAS/STAT 15.2
Thanks in advance.
There are missing values among the explanatory variables but that is precisely the reason I am considering trees for this task (so I do not need to throw observations away).
SAS has some built-in methods of imputing missing values, such as PROC MI. Whether or not this would work for you, I can't say.
I am only aware of HPSPLIT, which, as you mentioned, does not allow processing by group. This is precisely the reason I thought I may have to use macros.
Yes, I don't personally have qualms about creating a macro to do BY group processing in this case where the PROC you want to use does not have a BY statement. I think the disadvantages of using macros here would be very few and minor.
If needed, I could try mode complex methods (e.g., random forests), but I am afraid it may not be feasible because the dataset has over 100,000,000 observartions [sic] and a few hundred groups.
Perhaps SAS Viya, which has the ability to distribute the task between many different machines and also has PROC TREESPLIT, might be a good way to get the results you want in a short amount of time.
Thanks for the prompt reply. We are clear on your point 1.
About point 2, I meant I want to impute the dependent variable, so I would use the trees for scoring.
There are missing values among the explanatory variables but that is precisely the reason I am considering trees for this task (so I do not need to throw observations away).
You don't say what PROC or other functions in SAS you are going to use to create regression trees.
For example, if you are going to use PROC HPSPLIT, there is no BY statement, so some sort of macro is probably unavoidable. Other methods of creating regression trees may have the BY statement, so we definitely need to know how you are planning to do this.
Macros have advantages and disadvantages. Without a more clear explanation of what you are planning to do, there's really no way to discuss advantages and disadvantages of macros.
Thanks for the reply. I did not specify a specific tool precisely because I am open to suggestions. I am only aware of HPSPLIT, which, as you mentioned, does not allow processing by group. This is precisely the reason I thought I may have to use macros.
If needed, I could try mode complex methods (e.g., random forests), but I am afraid it may not be feasible because the dataset has over 100,000,000 observartions and a few hundred groups.
Would you happen to be aware of alternatives to HPSPLIT that could handle this volume of information by group?
There are missing values among the explanatory variables but that is precisely the reason I am considering trees for this task (so I do not need to throw observations away).
SAS has some built-in methods of imputing missing values, such as PROC MI. Whether or not this would work for you, I can't say.
I am only aware of HPSPLIT, which, as you mentioned, does not allow processing by group. This is precisely the reason I thought I may have to use macros.
Yes, I don't personally have qualms about creating a macro to do BY group processing in this case where the PROC you want to use does not have a BY statement. I think the disadvantages of using macros here would be very few and minor.
If needed, I could try mode complex methods (e.g., random forests), but I am afraid it may not be feasible because the dataset has over 100,000,000 observartions [sic] and a few hundred groups.
Perhaps SAS Viya, which has the ability to distribute the task between many different machines and also has PROC TREESPLIT, might be a good way to get the results you want in a short amount of time.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.