You keep mentioning a "subsequent analysis" or "core analysis" to be done after you "retain variables" or pick "important variables".
I don't really think you can talk about retaining variabiles or picking important variables in any meaningful way unless we know what the "subsequent analysis" or "core analysis" is, and you haven't told us.
You can pick x3, x17 and x2309 as being the "important variables" for "Subsequent Analysis-1", but if you end up doing "Subsequent Analysis-2", those variables could be relatively meaningless.
So I guess there are two issues, one is the hugeness of the data; and the second is the proper analysis; and the maybe one dictates the other or maybe not, I don't know.
Why do you keep mentioning STATA here? Shouldn't we be discussing SAS?
I am not familiar with STATA. A quick search informs me that STATA has a CART module. I suggest you start there. First, remove any transformed variable from your dataset, CART is insensitive to monotonic transforations (logs, powers, dummy variables, etc). Then if what's left of your data still doesn't fit in STATA, subsample. That should allow you to see what's meaningful and what's not.
This being a SAS forum, I should also mention that JMP has a decent CART module as well.
PG
makes a good suggestion as well, that probably works in many analysis situations
However, the original question remains so vague and undefined that I can't reconcile all the good suggestions in this thread with the original question. Specifically, CART requires a dependent variable, while the original problem asked about principal components, which cannot make use of a dependent variable; and in fact if you are going to do a future analysis with a dependent variable, principal components probably isn't a good first step, and if you are going to do an analysis that doesn't have a dependent variable, then CART probably doesn't fit.
So I'm coming to the conclusion that the whole idea of picking an analysis method at this point makes no sense here without much more additional information from the original poster.
Dear SAS Friends
Sure, as some of you have mentioned, my statistical protocol is not precise as of now. This project intention
is to produce an accurate report on a sub-Saharan Africa country health situation, focusing on
the “socioeconomic determinants of health inequality and inequity”. The work to be based on the
previously applied protocol in two studies; one in Europe (link-doi: 10.1056/NEJMsa0707519) and another one in North Africa (link-doi:10.1186/1475-9276-10-23).
Preliminary data understanding revealed that evaluating all the 4K variables will definitely be a hard job. As noted in
my earlier communication, i was most interested in splicing the data arbitrary
into small sets, followed by grouping (where i suggested use of principalcomponent analysis).
My goal sharing was to get hint on simple and efficient ways of summarizing
or grouping these variables with limited error risks. I am following with keen interest
this discussion and hope to refine my methodology for the work. Therefore, you suggestions towards this end would
be highly appreciated.
Thafu
The links to those two studies don't appear to work, so we can't know what is in them
Also, I don't know why your reply appears in such small font, but could you please avoid using such a small font in the future? Thank you
http://www.nejm.org/doi/full/10.1056/NEJMsa0707519
and
http://www.equityhealthj.com/content/10/1/23
those worked for me.
PG
Thanks for the links, , I'll read these a little more carefully as time permits, but a quick browse indicates that these links are rather light on the methodology used. In any event, Thafu says:
My goal sharing was to get hint on simple and efficient ways of summarizing
or grouping these variables with limited error risks.
Yeah, I agree, we'd all like that in complicated empirical analyses of huge data sets. I don't think you are going to be able to accomplish all of these things simultaneously, I don't think anyone has ever accomplished all of these thing simultaneously on a complicated empirircal analysis of a large data set, as simple and efficient and limited error risks seems to be opposites. And the idea of arbitrary splitting of the data, as I have said 32 times before, seems to be a terrible idea, and definitely in the opposite direction of simple, opposite direction of efficient, and opposite direction of limited error risk.
Now, if the problem is that you don't have enough computer power to handle 3997 variables in one analysis, then you ought to consider doing a PCA where you don't extract all 3997 components from the data; you could extract, for example, the first 20 components, which ought to give you a good starting point, without straining your computer's memory too terribly. I can advise on how to do that in SAS. If that is still is too taxing for your computer, I would advise you to get a better computer. I would advise you to abandon the idea of arbitrary splitting of the data as the first step, even though you have mentioned it several times now.
So I have now read the NEJM article online -- I assume that it is a summary of a larger article that I do not have access to -- and I fail to see any connection to the study described by Thafu.
I see disconnects on the following points:
As one other piece of advice, it seems to me that this is probably not the kind of study that one can properly advise via an internet forum. I think Thafu needs to sit down with a consulting statistician, describe the entire project from one end to the other, starting with the data he has (or the data he'd like to collect) and then describing where he'd like to go, eventually discussing the hypotheses that will be tested. I think it is impossible to jump into the middle and say, "how do I summarize the data?" without understanding the entirety of the project. Be prepared to spend a fair amount of time and a number of iterations trying to get to the final result, there is no turnkey approach nor is there likely to be a straight line to the final result ... there is not likely to be one single approach that someone can advise that if you do this, then you're going to get the answer.
To all contributors, namely; PaigeMiller, PGStats, Jaap Karman, stat@sas, RW9, data_null_; and LinusH; Thank you very much.
I have gained sufficient insight on the issues and I will continue on from here. Should I need further aid, I shall be in touch.
Sincerely
Thafu
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.