BookmarkSubscribeRSS Feed
PaigeMiller
Diamond | Level 26

You keep mentioning a "subsequent analysis" or "core analysis" to be done after you "retain variables" or pick "important variables".

I don't really think you can talk about retaining variabiles or picking important variables in any meaningful way unless we know what the "subsequent analysis" or "core analysis" is, and you haven't told us.

You can pick x3, x17 and x2309 as being the "important variables" for "Subsequent Analysis-1", but if you end up doing "Subsequent Analysis-2", those variables could be relatively meaningless.


So I guess there are two issues, one is the hugeness of the data; and the second is the proper analysis; and the maybe one dictates the other or maybe not, I don't know.


Why do you keep mentioning STATA here? Shouldn't we be discussing SAS?

--
Paige Miller
PGStats
Opal | Level 21

I am not familiar with STATA. A quick search informs me that STATA has a CART module. I suggest you start there. First, remove any transformed variable from your dataset, CART is insensitive to monotonic transforations (logs, powers, dummy variables, etc). Then if what's left of your data still doesn't fit in STATA, subsample. That should allow you to see what's meaningful and what's not.

This being a SAS forum, I should also mention that JMP has a decent CART module as well.

PG

PG
PaigeMiller
Diamond | Level 26

makes a good suggestion as well, that probably works in many analysis situations

However, the original question remains so vague and undefined that I can't reconcile all the good suggestions in this thread with the original question. Specifically, CART requires a dependent variable, while the original problem asked about principal components, which cannot make use of a dependent variable; and in fact if you are going to do a future analysis with a dependent variable, principal components probably isn't a good first step, and if you are going to do an analysis that doesn't have a dependent variable, then CART probably doesn't fit.

So I'm coming to the conclusion that the whole idea of picking an analysis method at this point makes no sense here without much more additional information from the original poster.

--
Paige Miller
thafu
Calcite | Level 5

Dear SAS Friends

Sure, as some of you have mentioned, my statistical protocol is not precise as of now. This project intention
is to produce an accurate report on a sub-Saharan Africa country health situation, focusing on
the “socioeconomic determinants of health inequality and inequity”. The work to be based on the
previously applied protocol in two studies; one in Europe (link-
doi: 10.1056/NEJMsa0707519) and another one in North Africa (link-doi:10.1186/1475-9276-10-23).

Preliminary data understanding revealed that evaluating all the 4K variables will definitely be a hard job. As noted in
my earlier communication, i was most interested in splicing the data arbitrary
into small sets, followed by grouping (where i suggested use of principalcomponent analysis).  

My goal sharing was to get hint on simple and efficient ways of summarizing
or grouping these variables with limited error risks. I am following with keen interest
this discussion and hope to refine my methodology for the work.  Therefore, you suggestions towards this end would
be highly appreciated.

Thafu

PaigeMiller
Diamond | Level 26

The links to those two studies don't appear to work, so we can't know what is in them

Also, I don't know why your reply appears in such small font, but could you please avoid using such a small font in the future? Thank you

--
Paige Miller
PaigeMiller
Diamond | Level 26

Thanks for the links, , I'll read these a little more carefully as time permits, but a quick browse indicates that these links are rather light on the methodology used. In any event, Thafu says:

My goal sharing was to get hint on simple and efficient ways of summarizing
or grouping these variables with limited error risks.

Yeah, I agree, we'd all like that in complicated empirical analyses of huge data sets. I don't think you are going to be able to accomplish all of these things simultaneously, I don't think anyone has ever accomplished all of these thing simultaneously on a complicated empirircal analysis of a large data set, as simple and efficient and limited error risks seems to be opposites. And the idea of arbitrary splitting of the data, as I have said 32 times before, seems to be a terrible idea, and definitely in the opposite direction of simple, opposite direction of efficient, and opposite direction of limited error risk.

Now, if the problem is that you don't have enough computer power to handle 3997 variables in one analysis, then you ought to consider doing a PCA where you don't extract all 3997 components from the data; you could extract, for example, the first 20 components, which ought to give you a good starting point, without straining your computer's memory too terribly. I can advise on how to do that in SAS. If that is still is too taxing for your computer, I would advise you to get a better computer. I would advise you to abandon the idea of arbitrary splitting of the data as the first step, even though you have mentioned it several times now.

--
Paige Miller
PaigeMiller
Diamond | Level 26

So I have now read the NEJM article online -- I assume that it is a summary of a larger article that I do not have access to -- and I fail to see any connection to the study described by Thafu.

I see disconnects on the following points:

  1. The NEJM article online doesn't get into a lot of detail about data summarization, which is what we have been discussing primarily
  2. The NEJM article online seems to be, at least based on my reading of it, using a subject matter approach to finding meaningful variables for analysis, whereas Thafu continues to describe an empirical approach of finding important variables amongst his 3997 original variables
  3. The methods of analysis mentioned in the NEJM article online are regression and Poisson regression, meaning that Principal Components analysis is quite simply an inappropriate (or to use Thafu's word, not "efficient") method to summarize data prior to performing regression or Poisson regression

As one other piece of advice, it seems to me that this is probably not the kind of study that one can properly advise via an internet forum. I think Thafu needs to sit down with a consulting statistician, describe the entire project from one end to the other, starting with the data he has (or the data he'd like to collect) and then describing where he'd like to go, eventually discussing the hypotheses that will be tested. I think it is impossible to jump into the middle and say, "how do I summarize the data?" without understanding the entirety of the project. Be prepared to spend a fair amount of time and a number of iterations trying to get to the final result, there is no turnkey approach nor is there likely to be a straight line to the final result ... there is not likely to be one single approach that someone can advise that if you do this, then you're going to get the answer.

--
Paige Miller
thafu
Calcite | Level 5

To all contributors, namely; PaigeMiller, PGStats, Jaap Karman, stat@sas, RW9, data_null_; and LinusH; Thank you very much.

I have gained sufficient insight on the issues and  I will continue on from here. Should I need further aid, I shall be in touch.

Sincerely

Thafu

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 23 replies
  • 1634 views
  • 4 likes
  • 8 in conversation