RE: Splitting large data by variables

Reply
Occasional Contributor
Posts: 8

RE: Splitting large data by variables

Hello friends,

I need some help in data management, I have a large dataset of  17375 observations and 3997 variables. I wish to split this date into three sets of 17375 observations 1333 variables, while retaining all the observations and the unique identification code for future re-merging.

I wish to get help in developing this SAS code for doing the splitting

Thanks in advance, I would appreciate your assistance

Fred

Respected Advisor
Posts: 4,641

Re: RE: Splitting large data by variables

3997 variables, that's a lot of variables indeed. But splitting them arbitrarily into three sets might not be the best strategy. It might be better to organize your data differently and keep them in the same dataset. What are these variables?

PG

PG
Valued Guide
Posts: 3,208

Re: RE: Splitting large data by variables

Just guessing with a calculation.
17K observations is not much 4k variables is. Most DBMS systems do not support that amount of columns. 17k * 4K * 8bytes is about 640Mb still not big. Unless longer characters are part of the dataset you do not get to the 32-bit / 2Gb limit. As PGStats is asking what these variables are, what is the real reason to want a split up?     

---->-- ja karman --<-----
Super User
Posts: 5,255

Re: RE: Splitting large data by variables

I agree with , having that amount of variables is inconvenient in many ways. Imagine how to write programs to address all the variable by name. The oinly use case I've seen is with data mining that needs the data stacked in variables/columns.

So without knowing your requirements, my guess is that you are better off transposing your data in some way.

Data never sleeps
Super User
Super User
Posts: 7,392

Re: RE: Splitting large data by variables

Completely agree with all previous posts, just wanted to add that you could reduce large datasets into smaller ones using RDBMS theory.

Respected Advisor
Posts: 3,777

Re: RE: Splitting large data by variables

/* The input data set and key variable(s) to include in all data sets */
%let data=sashelp.heart;
%let keys=ageatstart;
/* A list of variable names withOUT the KEYS*/
proc contents noprint out=sansid(keep=name varnum) data=&data(drop=&keys);
   run;
/* put them in varnum order */
proc sort data=sansid;
   by varnum;
   run;
/* create 3 approximately equal groups */    
proc rank out=sansid groups=3;
  
var varnum;
   ranks group;
   run;
/* Generate data set name with keep= data set option with a */
/* name range variable list from the first and last name in each GROUP*/
/* write the generated code to a file*/
filename codegen temp;
data _null_;
  
file codegen;
   set sansid;
   by group;
   if first.group then put +3 'vgroup' group '(keep=' "&keys" +1 name '--' @@;
   if last.group then put +1 name ')';
  
run;
/* create the new data sets*/
data
  
%inc codegen / source2;
   ;
   /*This merge is important to unsure the keys are on the left and not in the name ranges*/    
   merge &data(keep=&keys) &data(drop=&keys);
   run;

Message was edited by: data _null_

Occasional Contributor
Posts: 8

Re: RE: Splitting large data by variables


Hello data_null_,

Thanks for the codes. However, for clarity, i have one request, could you please add brief descriptors to your codes to enable me follow through

Regards

Fred

Occasional Contributor
Posts: 8

Re: RE: Splitting large data by variables

Thank you all for your generosity,

To respond to your questions; this is a government data of household surveillance. The study objective is to "investigate the socio-economic déterminants of health inequality and inequity". The first step was to arbitrary break this large dataset into smaller ones, then asses the important variables for use. The selected varaibles could then be consolidate into afew manageble number via the principle component analysis. Finally, i am to merge the consolidated dataset and perform the core analysis of the study.

I also wished to split this dataset to enable alternative statistical manipulation in STATA 13 Platform, which i have more compétence, but the Platform has limited amount of data it can handle.

Warm regards

Trusted Advisor
Posts: 1,608

Re: RE: Splitting large data by variables

The first step was to arbitrary break this large dataset into smaller ones, then asses the important variables for use. The selected varaibles could then be consolidate into afew manageble number via the principle component analysis. Finally, i am to merge the consolidated dataset and perform the core analysis of the study.

In my opinion, if the goal is to find "important variables" by some statistical method like principal (not principle) components analysis, then you don't want to split the data at all, you want to run the analysis on the ENTIRE data set. I realize that this may cause problems if your computer doesn't have enough memory, but there are algorithms that would allow principal components to extract a few components (instead of all 3997 components) that would be much less likely to cause issues where you run out of memory.

Splitting the dataset into "arbitrary" thirds is simply the wrong way to go here with any statistical procedure. Furthermore, the unspecified "core analysis" of this study could greatly suffer depending on how you select these "important variables", and it WILL greatly suffer if you select these "important variables" from arbitrary thirds of the data.

Valued Guide
Posts: 3,208

Re: RE: Splitting large data by variables

thafu,  You are saying it is government data of household surveillance. Your first job will be understanding the data.
I assume the records are organized by households. The big number of variables could be caused by some repetition of measurements by time.
Those could be evaluated as a time-series analysis possible given one predictor.

Having your cleaned optimized that way you can do a next step. Hypothesis testing or using the predictive analytics common with data mining.

The way you are going to do things with your data may be different on those two.
The data mining approach is needing one or several target values on wich you are going to train  and validate. A separation on your data is needed with that.

I am missing that in your question.

---->-- ja karman --<-----
Trusted Advisor
Posts: 1,204

Re: RE: Splitting large data by variables

Being a subject matter specialist you better know which variables should be used for data reduction and may be this is the reason for splitting files or  you want to retain only numeric variables in one of the splitted files to apply PCA. If you are considering using principal components then you will have to rely on the principal components instead of original variables for further analyses. If you are looking to retain original variables in the analysis please try to explore proc varclus.

Trusted Advisor
Posts: 1,608

Re: RE: Splitting large data by variables

stat@sas wrote:

Being a subject matter specialist you better know which variables should be used for data reduction and may be this is the reason for splitting files

Thafu admitted splitting the data into three groups was "arbitrary", I can't see how this corresponds to using any subject matter expertise

Furthermore with 3997 variables, I don't see how anyone can use subject matter expertise to pick out the important ones that will matter in a subsequent "core analysis", but that's just me — the whole thing screams "empirical" to me

Trusted Advisor
Posts: 1,204

Re: RE: Splitting large data by variables

@PaigeMiller - thanks for clarification. Arbitrary grouping will make analysis more complicated. In designing surveys this is up to the subject matter specialist how to design surveys. This is a normal practice to put introductory questions in the begining, questions relating to subject in the middle and demo questions at the end on the questionnaire. Questions in the middle section of survey usually contain numeric variables which may be useful for analysis, while questions in the start and end provide classification variables.

Valued Guide
Posts: 3,208

Re: RE: Splitting large data by variables

@ stat@sas as PaigeMiller is worried about the first steps for the analyses I agree with him. The Focus shows on the coding work but not being experienced in SAS is the reason.                 

---->-- ja karman --<-----
Occasional Contributor
Posts: 8

Re: RE: Splitting large data by variables

OK, I get the concerns projected.

I am at the initial stage of this project, currently trying to understand the entire data before deciding which variables to retain for the subsequent analysis.

Actually, the enormity(biggness) of the data and the inability to be  evaluated (read) in its current form, in the STATA version in my possession, are what prompted my request for arbiterary subdivision.

I should probably have asked if there are any better ways to handle such enormous data, may be by automatically grouping related variables without having to go through variables visually and then physically coding for the selection from the ~4K variables.

Ask a Question
Discussion stats
  • 23 replies
  • 685 views
  • 4 likes
  • 8 in conversation