topic RE: Splitting large data by variables in SAS Programming

RE: Splitting large data by variables

thafu — Sun, 18 May 2014 17:05:35 GMT

Hello friends,

I need some help in data management, I have a large dataset of 17375 observations and 3997 variables. I wish to split this date into three sets of 17375 observations 1333 variables, while retaining all the observations and the unique identification code for future re-merging.

I wish to get help in developing this SAS code for doing the splitting

Thanks in advance, I would appreciate your assistance

Fred

Re: RE: Splitting large data by variables

PGStats — Sun, 18 May 2014 18:58:30 GMT

3997 variables, that's a lot of variables indeed. But splitting them arbitrarily into three sets might not be the best strategy. It might be better to organize your data differently and keep them in the same dataset. What are these variables?

Re: RE: Splitting large data by variables

jakarman — Sun, 18 May 2014 19:28:16 GMT

Just guessing with a calculation.
17K observations is not much 4k variables is. Most DBMS systems do not support that amount of columns. 17k * 4K * 8bytes is about 640Mb still not big. Unless longer characters are part of the dataset you do not get to the 32-bit / 2Gb limit. As PGStats is asking what these variables are, what is the real reason to want a split up?

Re: RE: Splitting large data by variables

LinusH — Mon, 19 May 2014 07:34:02 GMT

I agree with , having that amount of variables is inconvenient in many ways. Imagine how to write programs to address all the variable by name. The oinly use case I've seen is with data mining that needs the data stacked in variables/columns.

So without knowing your requirements, my guess is that you are better off transposing your data in some way.

Re: RE: Splitting large data by variables

RW9 — Mon, 19 May 2014 08:24:41 GMT

Completely agree with all previous posts, just wanted to add that you could reduce large datasets into smaller ones using RDBMS theory.

Re: RE: Splitting large data by variables

data_null__ — Mon, 19 May 2014 10:55:51 GMT

/* The input data set and key variable(s) to include in all data sets */

%let data=sashelp.heart;
%let keys=ageatstart;

/* A list of variable names withOUT the KEYS*/
proc contents noprint out=sansid(keep=name varnum) data=&data(drop=&keys);
   run;

/* put them in varnum order */
proc sort data=sansid;
   by varnum;
   run;

/* create 3 approximately equal groups */     
proc rank out=sansid groups=3; 
   var varnum;
   ranks group;
   run;

/* Generate data set name with keep= data set option with a */

/* name range variable list from the first and last name in each GROUP*/

/* write the generated code to a file*/
filename codegen temp; 
data _null_; 
   file codegen;
   set sansid;
   by group;
   if first.group then put +3 'vgroup' group '(keep=' "&keys" +1 name '--' @@;
   if last.group then put +1 name ')'; 
   run;

/* create the new data sets*/
data 
   %inc codegen / source2;
   ;

   /*This merge is important to unsure the keys are on the left and not in the name ranges*/     

   merge &data(keep=&keys) &data(drop=&keys);
   run;

Message was edited by: data _null_

Re: RE: Splitting large data by variables

thafu — Mon, 19 May 2014 11:14:39 GMT

Thank you all for your generosity,

To respond to your questions; this is a government data of household surveillance. The study objective is to "investigate the socio-economic déterminants of health inequality and inequity". The first step was to arbitrary break this large dataset into smaller ones, then asses the important variables for use. The selected varaibles could then be consolidate into afew manageble number via the principle component analysis. Finally, i am to merge the consolidated dataset and perform the core analysis of the study.

I also wished to split this dataset to enable alternative statistical manipulation in STATA 13 Platform, which i have more compétence, but the Platform has limited amount of data it can handle.

Warm regards

Re: RE: Splitting large data by variables

thafu — Mon, 19 May 2014 11:33:34 GMT

Hello data_null_,

Thanks for the codes. However, for clarity, i have one request, could you please add brief descriptors to your codes to enable me follow through

Regards

Fred

Re: RE: Splitting large data by variables

jakarman — Mon, 19 May 2014 11:50:09 GMT

thafu, You are saying it is government data of household surveillance. Your first job will be understanding the data.
I assume the records are organized by households. The big number of variables could be caused by some repetition of measurements by time.
Those could be evaluated as a time-series analysis possible given one predictor.

Having your cleaned optimized that way you can do a next step. Hypothesis testing or using the predictive analytics common with data mining.

The way you are going to do things with your data may be different on those two.
The data mining approach is needing one or several target values on wich you are going to train and validate. A separation on your data is needed with that.

I am missing that in your question.

Re: RE: Splitting large data by variables

PaigeMiller — Mon, 19 May 2014 13:07:40 GMT

The first step was to arbitrary break this large dataset into smaller ones, then asses the important variables for use. The selected varaibles could then be consolidate into afew manageble number via the principle component analysis. Finally, i am to merge the consolidated dataset and perform the core analysis of the study.

In my opinion, if the goal is to find "important variables" by some statistical method like principal (not principle) components analysis, then you don't want to split the data at all, you want to run the analysis on the ENTIRE data set. I realize that this may cause problems if your computer doesn't have enough memory, but there are algorithms that would allow principal components to extract a few components (instead of all 3997 components) that would be much less likely to cause issues where you run out of memory.

Splitting the dataset into "arbitrary" thirds is simply the wrong way to go here with any statistical procedure. Furthermore, the unspecified "core analysis" of this study could greatly suffer depending on how you select these "important variables", and it WILL greatly suffer if you select these "important variables" from arbitrary thirds of the data.

Re: RE: Splitting large data by variables

stat_sas — Mon, 19 May 2014 15:24:23 GMT

Being a subject matter specialist you better know which variables should be used for data reduction and may be this is the reason for splitting files or you want to retain only numeric variables in one of the splitted files to apply PCA. If you are considering using principal components then you will have to rely on the principal components instead of original variables for further analyses. If you are looking to retain original variables in the analysis please try to explore proc varclus.

Re: RE: Splitting large data by variables

PaigeMiller — Mon, 19 May 2014 16:01:48 GMT

stat@sas wrote:

Being a subject matter specialist you better know which variables should be used for data reduction and may be this is the reason for splitting files

Thafu admitted splitting the data into three groups was "arbitrary", I can't see how this corresponds to using any subject matter expertise

Furthermore with 3997 variables, I don't see how anyone can use subject matter expertise to pick out the important ones that will matter in a subsequent "core analysis", but that's just me — the whole thing screams "empirical" to me

Re: RE: Splitting large data by variables

stat_sas — Mon, 19 May 2014 16:30:10 GMT

@PaigeMiller - thanks for clarification. Arbitrary grouping will make analysis more complicated. In designing surveys this is up to the subject matter specialist how to design surveys. This is a normal practice to put introductory questions in the begining, questions relating to subject in the middle and demo questions at the end on the questionnaire. Questions in the middle section of survey usually contain numeric variables which may be useful for analysis, while questions in the start and end provide classification variables.

Re: RE: Splitting large data by variables

jakarman — Mon, 19 May 2014 17:05:56 GMT

@ stat@sas as PaigeMiller is worried about the first steps for the analyses I agree with him. The Focus thafu shows on the coding work but not being experienced in SAS is the reason.

Re: RE: Splitting large data by variables

thafu — Mon, 19 May 2014 17:42:25 GMT

OK, I get the concerns projected.

I am at the initial stage of this project, currently trying to understand the entire data before deciding which variables to retain for the subsequent analysis.

Actually, the enormity(biggness) of the data and the inability to be evaluated (read) in its current form, in the STATA version in my possession, are what prompted my request for arbiterary subdivision.

I should probably have asked if there are any better ways to handle such enormous data, may be by automatically grouping related variables without having to go through variables visually and then physically coding for the selection from the ~4K variables.

Re: RE: Splitting large data by variables

PaigeMiller — Mon, 19 May 2014 18:01:12 GMT

You keep mentioning a "subsequent analysis" or "core analysis" to be done after you "retain variables" or pick "important variables".

I don't really think you can talk about retaining variabiles or picking important variables in any meaningful way unless we know what the "subsequent analysis" or "core analysis" is, and you haven't told us.

You can pick x3, x17 and x2309 as being the "important variables" for "Subsequent Analysis-1", but if you end up doing "Subsequent Analysis-2", those variables could be relatively meaningless.

So I guess there are two issues, one is the hugeness of the data; and the second is the proper analysis; and the maybe one dictates the other or maybe not, I don't know.

Why do you keep mentioning STATA here? Shouldn't we be discussing SAS?

Re: RE: Splitting large data by variables

PGStats — Mon, 19 May 2014 19:25:35 GMT

I am not familiar with STATA. A quick search informs me that STATA has a CART module. I suggest you start there. First, remove any transformed variable from your dataset, CART is insensitive to monotonic transforations (logs, powers, dummy variables, etc). Then if what's left of your data still doesn't fit in STATA, subsample. That should allow you to see what's meaningful and what's not.

This being a SAS forum, I should also mention that JMP has a decent CART module as well.

Re: RE: Splitting large data by variables

PaigeMiller — Tue, 20 May 2014 12:32:58 GMT

makes a good suggestion as well, that probably works in many analysis situations

However, the original question remains so vague and undefined that I can't reconcile all the good suggestions in this thread with the original question. Specifically, CART requires a dependent variable, while the original problem asked about principal components, which cannot make use of a dependent variable; and in fact if you are going to do a future analysis with a dependent variable, principal components probably isn't a good first step, and if you are going to do an analysis that doesn't have a dependent variable, then CART probably doesn't fit.

So I'm coming to the conclusion that the whole idea of picking an analysis method at this point makes no sense here without much more additional information from the original poster.

Re: RE: Splitting large data by variables

thafu — Wed, 21 May 2014 17:00:42 GMT

Dear SAS Friends

Sure, as some of you have mentioned, my statistical protocol is not precise as of now. This project intention
is to produce an accurate report on a sub-Saharan Africa country health situation, focusing on
the “socioeconomic determinants of health inequality and inequity”. The work to be based on the
previously applied protocol in two studies; one in Europe (link-doi: 10.1056/NEJMsa0707519) and another one in North Africa (link-doi:10.1186/1475-9276-10-23).

Preliminary data understanding revealed that evaluating all the 4K variables will definitely be a hard job. As noted in
my earlier communication, i was most interested in splicing the data arbitrary
into small sets, followed by grouping (where i suggested use of principalcomponent analysis).

My goal sharing was to get hint on simple and efficient ways of summarizing
or grouping these variables with limited error risks. I am following with keen interest
this discussion and hope to refine my methodology for the work. Therefore, you suggestions towards this end would
be highly appreciated.

Thafu

Re: RE: Splitting large data by variables

PaigeMiller — Wed, 21 May 2014 17:46:16 GMT

The links to those two studies don't appear to work, so we can't know what is in them

Also, I don't know why your reply appears in such small font, but could you please avoid using such a small font in the future? Thank you