BookmarkSubscribeRSS Feed
thafu
Calcite | Level 5

Hello friends,

I need some help in data management, I have a large dataset of  17375 observations and 3997 variables. I wish to split this date into three sets of 17375 observations 1333 variables, while retaining all the observations and the unique identification code for future re-merging.

I wish to get help in developing this SAS code for doing the splitting

Thanks in advance, I would appreciate your assistance

Fred

23 REPLIES 23
PGStats
Opal | Level 21

3997 variables, that's a lot of variables indeed. But splitting them arbitrarily into three sets might not be the best strategy. It might be better to organize your data differently and keep them in the same dataset. What are these variables?

PG

PG
jakarman
Barite | Level 11

Just guessing with a calculation.
17K observations is not much 4k variables is. Most DBMS systems do not support that amount of columns. 17k * 4K * 8bytes is about 640Mb still not big. Unless longer characters are part of the dataset you do not get to the 32-bit / 2Gb limit. As PGStats is asking what these variables are, what is the real reason to want a split up?     

---->-- ja karman --<-----
LinusH
Tourmaline | Level 20

I agree with , having that amount of variables is inconvenient in many ways. Imagine how to write programs to address all the variable by name. The oinly use case I've seen is with data mining that needs the data stacked in variables/columns.

So without knowing your requirements, my guess is that you are better off transposing your data in some way.

Data never sleeps
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Completely agree with all previous posts, just wanted to add that you could reduce large datasets into smaller ones using RDBMS theory.

data_null__
Jade | Level 19
/* The input data set and key variable(s) to include in all data sets */
%let data=sashelp.heart;
%let keys=ageatstart;
/* A list of variable names withOUT the KEYS*/
proc contents noprint out=sansid(keep=name varnum) data=&data(drop=&keys);
   run;
/* put them in varnum order */
proc sort data=sansid;
   by varnum;
   run;
/* create 3 approximately equal groups */    
proc rank out=sansid groups=3;
  
var varnum;
   ranks group;
   run;
/* Generate data set name with keep= data set option with a */
/* name range variable list from the first and last name in each GROUP*/
/* write the generated code to a file*/
filename codegen temp;
data _null_;
  
file codegen;
   set sansid;
   by group;
   if first.group then put +3 'vgroup' group '(keep=' "&keys" +1 name '--' @@;
   if last.group then put +1 name ')';
  
run;
/* create the new data sets*/
data
  
%inc codegen / source2;
   ;
   /*This merge is important to unsure the keys are on the left and not in the name ranges*/    
   merge &data(keep=&keys) &data(drop=&keys);
   run;

Message was edited by: data _null_

thafu
Calcite | Level 5


Hello data_null_,

Thanks for the codes. However, for clarity, i have one request, could you please add brief descriptors to your codes to enable me follow through

Regards

Fred

thafu
Calcite | Level 5

Thank you all for your generosity,

To respond to your questions; this is a government data of household surveillance. The study objective is to "investigate the socio-economic déterminants of health inequality and inequity". The first step was to arbitrary break this large dataset into smaller ones, then asses the important variables for use. The selected varaibles could then be consolidate into afew manageble number via the principle component analysis. Finally, i am to merge the consolidated dataset and perform the core analysis of the study.

I also wished to split this dataset to enable alternative statistical manipulation in STATA 13 Platform, which i have more compétence, but the Platform has limited amount of data it can handle.

Warm regards

PaigeMiller
Diamond | Level 26

The first step was to arbitrary break this large dataset into smaller ones, then asses the important variables for use. The selected varaibles could then be consolidate into afew manageble number via the principle component analysis. Finally, i am to merge the consolidated dataset and perform the core analysis of the study.

In my opinion, if the goal is to find "important variables" by some statistical method like principal (not principle) components analysis, then you don't want to split the data at all, you want to run the analysis on the ENTIRE data set. I realize that this may cause problems if your computer doesn't have enough memory, but there are algorithms that would allow principal components to extract a few components (instead of all 3997 components) that would be much less likely to cause issues where you run out of memory.

Splitting the dataset into "arbitrary" thirds is simply the wrong way to go here with any statistical procedure. Furthermore, the unspecified "core analysis" of this study could greatly suffer depending on how you select these "important variables", and it WILL greatly suffer if you select these "important variables" from arbitrary thirds of the data.

--
Paige Miller
jakarman
Barite | Level 11

thafu,  You are saying it is government data of household surveillance. Your first job will be understanding the data.
I assume the records are organized by households. The big number of variables could be caused by some repetition of measurements by time.
Those could be evaluated as a time-series analysis possible given one predictor.

Having your cleaned optimized that way you can do a next step. Hypothesis testing or using the predictive analytics common with data mining.

The way you are going to do things with your data may be different on those two.
The data mining approach is needing one or several target values on wich you are going to train  and validate. A separation on your data is needed with that.

I am missing that in your question.

---->-- ja karman --<-----
stat_sas
Ammonite | Level 13

Being a subject matter specialist you better know which variables should be used for data reduction and may be this is the reason for splitting files or  you want to retain only numeric variables in one of the splitted files to apply PCA. If you are considering using principal components then you will have to rely on the principal components instead of original variables for further analyses. If you are looking to retain original variables in the analysis please try to explore proc varclus.

PaigeMiller
Diamond | Level 26

stat@sas wrote:

Being a subject matter specialist you better know which variables should be used for data reduction and may be this is the reason for splitting files

Thafu admitted splitting the data into three groups was "arbitrary", I can't see how this corresponds to using any subject matter expertise

Furthermore with 3997 variables, I don't see how anyone can use subject matter expertise to pick out the important ones that will matter in a subsequent "core analysis", but that's just me — the whole thing screams "empirical" to me

--
Paige Miller
stat_sas
Ammonite | Level 13

@PaigeMiller - thanks for clarification. Arbitrary grouping will make analysis more complicated. In designing surveys this is up to the subject matter specialist how to design surveys. This is a normal practice to put introductory questions in the begining, questions relating to subject in the middle and demo questions at the end on the questionnaire. Questions in the middle section of survey usually contain numeric variables which may be useful for analysis, while questions in the start and end provide classification variables.

jakarman
Barite | Level 11

@ stat@sas as PaigeMiller is worried about the first steps for the analyses I agree with him. The Focus shows on the coding work but not being experienced in SAS is the reason.                 

---->-- ja karman --<-----
thafu
Calcite | Level 5

OK, I get the concerns projected.

I am at the initial stage of this project, currently trying to understand the entire data before deciding which variables to retain for the subsequent analysis.

Actually, the enormity(biggness) of the data and the inability to be  evaluated (read) in its current form, in the STATA version in my possession, are what prompted my request for arbiterary subdivision.

I should probably have asked if there are any better ways to handle such enormous data, may be by automatically grouping related variables without having to go through variables visually and then physically coding for the selection from the ~4K variables.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 23 replies
  • 2589 views
  • 4 likes
  • 8 in conversation