05-18-2014 01:05 PM
I need some help in data management, I have a large dataset of 17375 observations and 3997 variables. I wish to split this date into three sets of 17375 observations 1333 variables, while retaining all the observations and the unique identification code for future re-merging.
I wish to get help in developing this SAS code for doing the splitting
Thanks in advance, I would appreciate your assistance
05-18-2014 02:58 PM
3997 variables, that's a lot of variables indeed. But splitting them arbitrarily into three sets might not be the best strategy. It might be better to organize your data differently and keep them in the same dataset. What are these variables?
05-18-2014 03:28 PM
Just guessing with a calculation.
17K observations is not much 4k variables is. Most DBMS systems do not support that amount of columns. 17k * 4K * 8bytes is about 640Mb still not big. Unless longer characters are part of the dataset you do not get to the 32-bit / 2Gb limit. As PGStats is asking what these variables are, what is the real reason to want a split up?
05-19-2014 03:34 AM
I agree with , having that amount of variables is inconvenient in many ways. Imagine how to write programs to address all the variable by name. The oinly use case I've seen is with data mining that needs the data stacked in variables/columns.
So without knowing your requirements, my guess is that you are better off transposing your data in some way.
05-19-2014 06:55 AM
Message was edited by: data _null_
05-19-2014 07:33 AM
Thanks for the codes. However, for clarity, i have one request, could you please add brief descriptors to your codes to enable me follow through
05-19-2014 07:14 AM
Thank you all for your generosity,
To respond to your questions; this is a government data of household surveillance. The study objective is to "investigate the socio-economic déterminants of health inequality and inequity". The first step was to arbitrary break this large dataset into smaller ones, then asses the important variables for use. The selected varaibles could then be consolidate into afew manageble number via the principle component analysis. Finally, i am to merge the consolidated dataset and perform the core analysis of the study.
I also wished to split this dataset to enable alternative statistical manipulation in STATA 13 Platform, which i have more compétence, but the Platform has limited amount of data it can handle.
05-19-2014 09:07 AM
The first step was to arbitrary break this large dataset into smaller ones, then asses the important variables for use. The selected varaibles could then be consolidate into afew manageble number via the principle component analysis. Finally, i am to merge the consolidated dataset and perform the core analysis of the study.
In my opinion, if the goal is to find "important variables" by some statistical method like principal (not principle) components analysis, then you don't want to split the data at all, you want to run the analysis on the ENTIRE data set. I realize that this may cause problems if your computer doesn't have enough memory, but there are algorithms that would allow principal components to extract a few components (instead of all 3997 components) that would be much less likely to cause issues where you run out of memory.
Splitting the dataset into "arbitrary" thirds is simply the wrong way to go here with any statistical procedure. Furthermore, the unspecified "core analysis" of this study could greatly suffer depending on how you select these "important variables", and it WILL greatly suffer if you select these "important variables" from arbitrary thirds of the data.
05-19-2014 07:50 AM
thafu, You are saying it is government data of household surveillance. Your first job will be understanding the data.
I assume the records are organized by households. The big number of variables could be caused by some repetition of measurements by time.
Those could be evaluated as a time-series analysis possible given one predictor.
Having your cleaned optimized that way you can do a next step. Hypothesis testing or using the predictive analytics common with data mining.
The way you are going to do things with your data may be different on those two.
The data mining approach is needing one or several target values on wich you are going to train and validate. A separation on your data is needed with that.
I am missing that in your question.
05-19-2014 11:24 AM
Being a subject matter specialist you better know which variables should be used for data reduction and may be this is the reason for splitting files or you want to retain only numeric variables in one of the splitted files to apply PCA. If you are considering using principal components then you will have to rely on the principal components instead of original variables for further analyses. If you are looking to retain original variables in the analysis please try to explore proc varclus.
05-19-2014 12:01 PM
Being a subject matter specialist you better know which variables should be used for data reduction and may be this is the reason for splitting files
Thafu admitted splitting the data into three groups was "arbitrary", I can't see how this corresponds to using any subject matter expertise
Furthermore with 3997 variables, I don't see how anyone can use subject matter expertise to pick out the important ones that will matter in a subsequent "core analysis", but that's just me — the whole thing screams "empirical" to me
05-19-2014 12:30 PM
@PaigeMiller - thanks for clarification. Arbitrary grouping will make analysis more complicated. In designing surveys this is up to the subject matter specialist how to design surveys. This is a normal practice to put introductory questions in the begining, questions relating to subject in the middle and demo questions at the end on the questionnaire. Questions in the middle section of survey usually contain numeric variables which may be useful for analysis, while questions in the start and end provide classification variables.
05-19-2014 01:05 PM
05-19-2014 01:42 PM
OK, I get the concerns projected.
I am at the initial stage of this project, currently trying to understand the entire data before deciding which variables to retain for the subsequent analysis.
Actually, the enormity(biggness) of the data and the inability to be evaluated (read) in its current form, in the STATA version in my possession, are what prompted my request for arbiterary subdivision.
I should probably have asked if there are any better ways to handle such enormous data, may be by automatically grouping related variables without having to go through variables visually and then physically coding for the selection from the ~4K variables.