Thank you @Reeza for this question. Your statement is correct. The combinations will 0, 1, and 0&1 together versus 0, 1, and 0&1 together. I have 12 variables (corvar1 through corvar12), and each corvar takes the values of 0 and 1. And the other dummy, allmdummy, takes the values of 0 and 1 too. I want to be able to run the same regression for every single combination, which amounts to 108. All the possible combinations for the first variable will look like the following:
corvar1=0 and allmdummy=0,
corvar1=0 and allmdummy=1
corvar1=0 and allmdummy= 0 or 1
corvar1=1 and allmdummy=0,
corvar1=1 and allmdummy=1
corvar1=1 and allmdummy= 0 or 1
corvar1=0,1, allmdummy=0,
corvar1=0,1 allmdummy=1
corvar1=0,1 allmdummy= 0 or 1
Ideally, I will be able to create a unique sampleid for each combination and then run proc reg and proc means command for each combination (sampleid). I hope this helps. Please let me know if you have any follow-up questions.
Thank you so much!
@finans_sas wrote:
Thank you @Reeza for this question. Your statement is correct. The combinations will 0, 1, and 0&1 together versus 0, 1, and 0&1 together. I have 12 variables (corvar1 through corvar12), and each corvar takes the values of 0 and 1. And the other dummy, allmdummy, takes the values of 0 and 1 too. I want to be able to run the same regression for every single combination, which amounts to 108. All the possible combinations for the first variable will look like the following:
corvar1=0 and allmdummy=0,
corvar1=0 and allmdummy=1
corvar1=0 and allmdummy= 0 or 1
corvar1=1 and allmdummy=0,
corvar1=1 and allmdummy=1
corvar1=1 and allmdummy= 0 or 1
corvar1=0,1, allmdummy=0,
corvar1=0,1 allmdummy=1
corvar1=0,1 allmdummy= 0 or 1
Ideally, I will be able to create a unique sampleid for each combination and then run proc reg and proc means command for each combination (sampleid). I hope this helps. Please let me know if you have any follow-up questions.
Thank you so much!
This is a trivial change to the solution I posted earlier.
You're approaching inefficiently in my view.
You should transpose your data such that all of the variables you want to filter by are one variable with another variable holding the variable name.
Then I think you can run two or three regressions using a BY statement instead of macros or loops anywhere.
Your data should be something like this, with the transpose replicating your data for you. The overlaps are handled with the different multiple regressions.
VariableName VariableValue AllMDummy ........
ColVar1   0  0
ColVar2 1 0
...Then run:
proc reg data=transposed;
by variableName VariableValue;
.....;
run;
proc reg data=transposed;
by variableName VariableValue ALLMDummy;
.....;
run;
proc reg data=originalData;
*NO BY statement, your 0/1 * 0/1 results in all included but use originalData;
run;
@finans_sas wrote:
Thank you @Reeza for this question. Your statement is correct. The combinations will 0, 1, and 0&1 together versus 0, 1, and 0&1 together. I have 12 variables (corvar1 through corvar12), and each corvar takes the values of 0 and 1. And the other dummy, allmdummy, takes the values of 0 and 1 too. I want to be able to run the same regression for every single combination, which amounts to 108. All the possible combinations for the first variable will look like the following:
corvar1=0 and allmdummy=0,
corvar1=0 and allmdummy=1
corvar1=0 and allmdummy= 0 or 1
corvar1=1 and allmdummy=0,
corvar1=1 and allmdummy=1
corvar1=1 and allmdummy= 0 or 1
corvar1=0,1, allmdummy=0,
corvar1=0,1 allmdummy=1
corvar1=0,1 allmdummy= 0 or 1
Ideally, I will be able to create a unique sampleid for each combination and then run proc reg and proc means command for each combination (sampleid). I hope this helps. Please let me know if you have any follow-up questions.
Thank you so much!
Thank you so much for sharing this other perspective, @Reeza . I need to think about how it will work with the data. The starting dataset has around 50 million rows of data (another one that I need to apply the same code to has about 250 million).
@finans_sas wrote:
Thank you so much for sharing this other perspective, @Reeza . I need to think about how it will work with the data. The starting dataset has around 50 million rows of data (another one that I need to apply the same code to has about 250 million).
That's an exciting twist to the story... If you take one of the approaches where you expand the data to have one record per SampleID, with 108 different sampleIDs, do you have a sense of how much that will increase the number of rows? Each record can't be in all 108 samples, so it wouldn't be 50M*108, but could it be 50M*20?
That's getting big. I don't work with big data much, but I'd be inclined to try Tom's approach of just running 108 PROC REG steps (one for each subset), rather than expanding the data into billions of rows. And hopefully you've got enough memory...
@finans_sas wrote:
Thank you so much for sharing this other perspective, @Reeza . I need to think about how it will work with the data. The starting dataset has around 50 million rows of data (another one that I need to apply the same code to has about 250 million).
Hello, @finans_sas
When important pieces of information are not included in your problem statement, and then revealed to us later, this leads to a frustrating process where code presented prior to these disclosures may not be best. So far, you have not told us in the original problem statement that you are doing this "the results are robust to different ways of measuring the variable of interest"; that the values of the variables in the code are 0,1 variables and not continuous variables; and that 50 million records are in the data set. It would be better to state this in the first post on this problem. In addition, I want to know what other important pieces of information haven't yet been mentioned.
With regards to "the results are robust to different ways of measuring the variable of interest":
There are bootstrap and jackknife methods available in SAS which can perhaps achieve this without running all possible combinations of regressions here. There are also methods where you random sample, instead of trying every possible combination of everything.
In any event, I still think my solution which passes through the data set once, rather than passes through the data set 108 times is the way to go, and I feel even more convinced of this with the revelation that you have 50 million pieces of data in the data set. Any solution that uses macros and looping will be orders of magnitude slower.
You are absolutely right, @PaigeMiller. Next time If I have another question, I will make sure to include all the important details. I sincerely apologize for this. Thank you so much for all your help and the code that your proposed. It worked like a charm.
Just curious, when you expand your 50M row dataset, how many records do you end up with?
Hi @Quentin ! I actually ended up running the code three times (4 corvars at a time) since the computer kept running out of disk space. If I recall correctly, it was 20 billion or so for each run.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
