BookmarkSubscribeRSS Feed
Beanpot
Fluorite | Level 6

Hello,

I have a dataset with approximately 7,000 observations. There is one binary outcome variable (true/false) where about 2,000 observations have the outcome (=true). I suspect there may be combinations of the outcome positive observations which may be significant for certain variables but I have no way to know which variables in order to categorize them. I'd like to have SAS test combinations of the observations with the outcome to see which observations might be similar to each other based on the variables. Is there a way to have SAS categorize the "true" observations into smaller groups to test? I suspect this would be some sort of factorial (ex. observations 1+2, 1+3, 1+4, ... , n! where it only uses observations with the outcome of interest).

11 REPLIES 11
PaigeMiller
Diamond | Level 26

I am struggling to understand the problem, and I really don't know what you are trying to do. It seems like you want to throw away some observations and keep others to achieve ... well, I don't really know but it sounds like you want to find observations to do a test ... what test? Shouldn't you keep all observations and do the test on all of them? 

 

Maybe you could write a more complete and clear problem statement, stating clearly what you are planning to learn from this data and what tests you might consider in the analysis.

--
Paige Miller
Beanpot
Fluorite | Level 6

So the issue is the variables while having the same outcome, may have different biological activities/pathways which lead to the outcome, but I don't know what activity/pathway. If I knew which acted similarly to one another I could create a subset of including only these observations. In lieu of that I'm trying to figure out which combination of observations with the outcome yield significant results to the predictor variables in order to determine these subgroups to later test empirically to determine the mode of action.

Quentin
Super User

I'm also having a hard time understanding your question.  The first two sentences are clear, but I can't understand "I suspect there may be combinations of the outcome positive observations which may be significant for certain variables" 

 

 In addition to more description to the problem, example data might help.  Perhaps an example dataset with say four variables (your one outcome/dependent variable and three variables that are predictors / independent variables).  Example data should help you  explain what you mean by "significant for certain variables" and "which observations might be similar to each other based on the variables."

Tom
Super User Tom
Super User

Are you asking how to do CLUSTER analysis?

quickbluefish
Barite | Level 11

Are you trying to do some sort of bootstrapping?  I'm also not really following, but you might start by assigning a random number to all of your TRUE observations, separating them out, sorting them by that random number, and then just take chunks of them each time through the loop.  I'm sure there's some sort of fancy SURVEYSELECT way also.  

 


%let chunksize=1000;

data 
    false (drop=r)
    true
    ;
set have (end=last);
call streaminit(14561436);
retain ntrue 0;
if x="true" then do;
  r=rand('uniform');
  ntrue+1;
  output true;
end;
else output false;
if last then call symputx("nchunks", ceil(ntrue/&chunksize));
run;

proc sort data=true; by r; run;

%let startrow=1;
%do cnum=1 %to &nchunks;
    data sample_true;
    set true;
    if &startrow<=_N_<(&startrow+&chunksize);
    run;

    %let startrow=%eval(&startrow+&chunksize);

    *** other stuff for your model... ;
%end;

...not tested -- obv. needs to be inside a macro.

Beanpot
Fluorite | Level 6

Essentially but the clusters are unknown/undefined.

Tom
Super User Tom
Super User

@Beanpot wrote:

Essentially but the clusters are unknown/undefined.


Isn't that what a cluster analysis intends to find out?  Instead of trying to fit a model that reduces the distance between the predicted value and the observed value you try to figure out how to the classify the observations into a clusters to reduce the within cluster distance.

 

SAS has a number of PROCS for this.

 

But if you think you did not directly measure the important variables then perhaps you need to consider latent variable analysis instead.

 

In any case you need to ask for STATISTICAL help and not PROGRAMMING help. Once you know what analysis you need to perform you can ask for help on how to program it.

PaigeMiller
Diamond | Level 26

I said: "Describe the data and describe what you would like to learn from this data, in relatively simple terms so that those of us not involved in this field of study can understand what the goals are. Do not try to describe the algorithm/mathematical/computational/statistical steps you might follow."

 

@Tom said: "In any case you need to ask for STATISTICAL help and not PROGRAMMING help. Once you know what analysis you need to perform you can ask for help on how to program it."

 

Please don't ignore these requests. This is, in my opinion, the only way to move forward.

--
Paige Miller
Ksharp
Super User

Plot a mosaic graph is a very good start:

https://blogs.sas.com/content/iml/2013/11/04/create-mosaic-plots-in-sas-by-using-proc-freq.html

 

proc freq data=sashelp.heart order=freq;
table status*bp_status/plot=(mosaic) ;
run;

Ksharp_0-1763538381777.png

 

PaigeMiller
Diamond | Level 26

Digging into the problem as stated 

 

I suspect there may be combinations of the outcome positive observations which may be significant for certain variables ...

 

Observations are never significant in the standard usage of the statistical term "significant". However, variables can be significant in a specified model or specified statistical test (which could be a simple t-test), but you don't ever in any statistical practice (that I am aware of) select certain observations that make the variable(s) significant. Furthermore if all you select are observations where the outcome variable is positive then no variable will ever be significant as there is no variability to predict or test.

 

May I make a suggestion? Describe the data and describe what you would like to learn from this data, in relatively simple terms so that those of us not involved in this field of study can understand what the goals are. Do not try to describe the algorithm/mathematical/computational/statistical steps you might follow.

--
Paige Miller
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 454 views
  • 0 likes
  • 6 in conversation