BookmarkSubscribeRSS Feed
Barbara_66
Calcite | Level 5

I've been asked to help someone split a dataset into smaller datasets by groups of variables.  The input dataset is 1 million records and 12,000 variables.  All they want to do is run a proc compare between 2 versions of this dataset, but it is too large for this to work.  Is there a way to split this into about 12 datasets with about 1000 variables in each? 

10 REPLIES 10
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Its not going to be any quicker spltting the file up, in fact, taking into account the time to split the dataset, read/write, and loop over the datasets, its most likely to be longer.  Why do you have a dataset with 12000 variables?  100 variable is way too much for a dataset.  Consider approaching that with an eye to  re-modelling the data.

Barbara_66
Calcite | Level 5
It's not my data. It won't be quicker to split it up, but I can't even sort this dataset because I run out of workspace.
Barbara_66
Calcite | Level 5
This example splits the data into groups by observations, but I want to maintain all the observations and create datasets that only have 10% of the variables.
data_null__
Jade | Level 19

@Barbara_66 wrote:
It's not my data. It won't be quicker to split it up, but I can't even sort this dataset because I run out of workspace.

Why do you think you need to sort the data? 

ballardw
Super User

@Barbara_66 wrote:

I've been asked to help someone split a dataset into smaller datasets by groups of variables.  The input dataset is 1 million records and 12,000 variables.  All they want to do is run a proc compare between 2 versions of this dataset, but it is too large for this to work.  Is there a way to split this into about 12 datasets with about 1000 variables in each? 


 

Note: I hope the two sets are sorted by the same variables before you start.

 

 

If there are specific variables of interest then you can reduce the problem by only bringing in the variables of concern and/or using WITH and VAR options or using the FIRSTOBS and OBS or WHERE dataset options

Some examples:

proc compare base   =dataset1 (where=(groupvar='3'))
             compare=dataset2 (where=(groupvar='3'))
;
   var basevar1 basevar2;
   with comparevar1 comparevar2;
run;

proc compare base   =dataset1 (where=(groupvar='3'))
             compare=dataset2 (where=(groupvar='3'))
;
run;

proc compare base   =dataset1 (firstobs=10000 obs=20000)
             compare=dataset2 (firstobs=10000 obs=20000)
;
run;

proc compare base   =dataset1 (keep=var1 var2 var3 var4)
             compare=dataset2 (keep=var1 var2 var3 var4)
;
run;

Also which output options are they using? NO one is going to manually look through millions of lines of output.

 

FriedEgg
SAS Employee

%let lib=SASHELP;
%let data=CARS;
%let keys=MAKE MODEL;
%let splits=3;

%macro qlist(list,dlm=%str( ));
(%unquote(%str(%")%qsysfunc(tranwrd(&list,%str( ),%str("&dlm")))%str(%")))
%mend;

data _null_;
call streaminit(23463);
array v[&splits] $ 32767 _temporary_;

do until (done);
set sashelp.vcolumn end=done;
where libname="%upcase(&lib)" and
memname="%upcase(&data)" and
upcase(name) not in %qlist(&keys);

i=(ceil(&splits * rand('uniform')));
v[i] = catx(' ', v[i], name);
end;

do i=1 to &splits;
call execute('data ' || cats("&data",i) || ';'
|| "set &lib..&data (keep=&keys " || strip(v[i]) || ');'
|| 'run;');
call symputx(cats('keep',i),v[i]);
end;
run;

rogerjdeangelis
Barite | Level 11
* splitting sashelp.cars into 'make' datasets;

MAKE             Frequency
-----------------------------
Acura                   7
Audi                   19
BMW                    20
Buick                   9
Cadillac                8
...
Suzuki                  8
Toyota                 28
Volkswagen             15
Volvo                  12


%symdel make; * just in case;
data _null_;
   set sashelp.cars;
   by make;
   if first.make then do;
     call symputx('make',make);
     rc=dosubl('
        data &make;
           set sashelp.cars(where=(make="&make"));
        run;quit;
     ');
   end;
run;quit;

SYMBOLGEN:  Macro variable MAKE resolves to Acura
SYMBOLGEN:  Macro variable MAKE resolves to Acura
NOTE: There were 7 observations read from the data set SASHELP.CARS.
      WHERE make='Acura';
...
SYMBOLGEN:  Macro variable MAKE resolves to Volvo
SYMBOLGEN:  Macro variable MAKE resolves to Volvo
NOTE: There were 12 observations read from the data set SASHELP.CARS.
      WHERE make='Volvo';



FriedEgg
SAS Employee
@rogerdeangelis, this is a vertical split. User wanted a horizontal split. I made the same mistake originally. However, as many have already commented, splitting this file at all makes little sense and there are numerous ways the OP can approach their actual problem without splitting.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2707 views
  • 1 like
  • 6 in conversation