Re: How do I split a dataset into smaller data set by variables

Barbara_66 · Posted 02-09-2017 09:38 AM

I've been asked to help someone split a dataset into smaller datasets by groups of variables. The input dataset is 1 million records and 12,000 variables. All they want to do is run a proc compare between 2 versions of this dataset, but it is too large for this to work. Is there a way to split this into about 12 datasets with about 1000 variables in each?

RW9 · Posted 02-09-2017 09:51 AM

Its not going to be any quicker spltting the file up, in fact, taking into account the time to split the dataset, read/write, and loop over the datasets, its most likely to be longer. Why do you have a dataset with 12000 variables? 100 variable is way too much for a dataset. Consider approaching that with an eye to re-modelling the data.

Barbara_66 · Posted 02-09-2017 10:08 AM

It's not my data. It won't be quicker to split it up, but I can't even sort this dataset because I run out of workspace.

FriedEgg · Posted 02-09-2017 10:21 AM

http://blogs.sas.com/content/sasdummy/2015/01/26/how-to-split-one-data-set-into-many/

Barbara_66 · Posted 02-09-2017 10:33 AM

This example splits the data into groups by observations, but I want to maintain all the observations and create datasets that only have 10% of the variables.

FriedEgg · Posted 02-09-2017 10:40 AM

Sorry, I missed that.

You should sort using tagsort if you are running out of space.

https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146878.htm

data_null__ · Posted 02-09-2017 12:04 PM

@Barbara_66 wrote:
It's not my data. It won't be quicker to split it up, but I can't even sort this dataset because I run out of workspace.

Why do you think you need to sort the data?

ballardw · Posted 02-09-2017 10:43 AM

@Barbara_66 wrote:

I've been asked to help someone split a dataset into smaller datasets by groups of variables. The input dataset is 1 million records and 12,000 variables. All they want to do is run a proc compare between 2 versions of this dataset, but it is too large for this to work. Is there a way to split this into about 12 datasets with about 1000 variables in each?

Note: I hope the two sets are sorted by the same variables before you start.

If there are specific variables of interest then you can reduce the problem by only bringing in the variables of concern and/or using WITH and VAR options or using the FIRSTOBS and OBS or WHERE dataset options

Some examples:

proc compare base   =dataset1 (where=(groupvar='3'))
             compare=dataset2 (where=(groupvar='3'))
;
   var basevar1 basevar2;
   with comparevar1 comparevar2;
run;

proc compare base   =dataset1 (where=(groupvar='3'))
             compare=dataset2 (where=(groupvar='3'))
;
run;

proc compare base   =dataset1 (firstobs=10000 obs=20000)
             compare=dataset2 (firstobs=10000 obs=20000)
;
run;

proc compare base   =dataset1 (keep=var1 var2 var3 var4)
             compare=dataset2 (keep=var1 var2 var3 var4)
;
run;

Also which output options are they using? NO one is going to manually look through millions of lines of output.

FriedEgg · Posted 02-09-2017 11:05 AM

%let lib=SASHELP;
%let data=CARS;
%let keys=MAKE MODEL;
%let splits=3;

%macro qlist(list,dlm=%str( ));
(%unquote(%str(%")%qsysfunc(tranwrd(&list,%str( ),%str("&dlm")))%str(%")))
%mend;

data _null_;
call streaminit(23463);
array v[&splits] $ 32767 _temporary_;

do until (done);
set sashelp.vcolumn end=done;
where libname="%upcase(&lib)" and
memname="%upcase(&data)" and
upcase(name) not in %qlist(&keys);

i=(ceil(&splits * rand('uniform')));
v[i] = catx(' ', v[i], name);
end;

do i=1 to &splits;
call execute('data ' || cats("&data",i) || ';'
|| "set &lib..&data (keep=&keys " || strip(v[i]) || ');'
|| 'run;');
call symputx(cats('keep',i),v[i]);
end;
run;

rogerjdeangelis · Posted 02-09-2017 12:49 PM

* splitting sashelp.cars into 'make' datasets;

MAKE             Frequency
-----------------------------
Acura                   7
Audi                   19
BMW                    20
Buick                   9
Cadillac                8
...
Suzuki                  8
Toyota                 28
Volkswagen             15
Volvo                  12


%symdel make; * just in case;
data _null_;
   set sashelp.cars;
   by make;
   if first.make then do;
     call symputx('make',make);
     rc=dosubl('
        data &make;
           set sashelp.cars(where=(make="&make"));
        run;quit;
     ');
   end;
run;quit;

SYMBOLGEN:  Macro variable MAKE resolves to Acura
SYMBOLGEN:  Macro variable MAKE resolves to Acura
NOTE: There were 7 observations read from the data set SASHELP.CARS.
      WHERE make='Acura';
...
SYMBOLGEN:  Macro variable MAKE resolves to Volvo
SYMBOLGEN:  Macro variable MAKE resolves to Volvo
NOTE: There were 12 observations read from the data set SASHELP.CARS.
      WHERE make='Volvo';

FriedEgg · Posted 02-09-2017 12:57 PM

@rogerdeangelis, this is a vertical split. User wanted a horizontal split. I made the same mistake originally. However, as many have already commented, splitting this file at all makes little sense and there are numerous ways the OP can approach their actual problem without splitting.

Register Today!

SAS Training: Just a Click Away