DATA Step, Macro, Functions and more

How do I split a dataset into smaller data set by variables

Reply
New Contributor
Posts: 3

How do I split a dataset into smaller data set by variables

I've been asked to help someone split a dataset into smaller datasets by groups of variables.  The input dataset is 1 million records and 12,000 variables.  All they want to do is run a proc compare between 2 versions of this dataset, but it is too large for this to work.  Is there a way to split this into about 12 datasets with about 1000 variables in each? 

Super User
Super User
Posts: 7,401

Re: How do I split a dataset into smaller data set by variables

Its not going to be any quicker spltting the file up, in fact, taking into account the time to split the dataset, read/write, and loop over the datasets, its most likely to be longer.  Why do you have a dataset with 12000 variables?  100 variable is way too much for a dataset.  Consider approaching that with an eye to  re-modelling the data.

New Contributor
Posts: 3

Re: How do I split a dataset into smaller data set by variables

It's not my data. It won't be quicker to split it up, but I can't even sort this dataset because I run out of workspace.
Trusted Advisor
Posts: 1,300

Re: How do I split a dataset into smaller data set by variables

New Contributor
Posts: 3

Re: How do I split a dataset into smaller data set by variables

This example splits the data into groups by observations, but I want to maintain all the observations and create datasets that only have 10% of the variables.
Trusted Advisor
Posts: 1,300

Re: How do I split a dataset into smaller data set by variables

Sorry, I missed that.

You should sort using tagsort if you are running out of space.

https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146878.htm
Respected Advisor
Posts: 3,777

Re: How do I split a dataset into smaller data set by variables


Barbara_66 wrote:
It's not my data. It won't be quicker to split it up, but I can't even sort this dataset because I run out of workspace.

Why do you think you need to sort the data? 

Super User
Posts: 10,500

Re: How do I split a dataset into smaller data set by variables


Barbara_66 wrote:

I've been asked to help someone split a dataset into smaller datasets by groups of variables.  The input dataset is 1 million records and 12,000 variables.  All they want to do is run a proc compare between 2 versions of this dataset, but it is too large for this to work.  Is there a way to split this into about 12 datasets with about 1000 variables in each? 


 

Note: I hope the two sets are sorted by the same variables before you start.

 

 

If there are specific variables of interest then you can reduce the problem by only bringing in the variables of concern and/or using WITH and VAR options or using the FIRSTOBS and OBS or WHERE dataset options

Some examples:

proc compare base   =dataset1 (where=(groupvar='3'))
             compare=dataset2 (where=(groupvar='3'))
;
   var basevar1 basevar2;
   with comparevar1 comparevar2;
run;

proc compare base   =dataset1 (where=(groupvar='3'))
             compare=dataset2 (where=(groupvar='3'))
;
run;

proc compare base   =dataset1 (firstobs=10000 obs=20000)
             compare=dataset2 (firstobs=10000 obs=20000)
;
run;

proc compare base   =dataset1 (keep=var1 var2 var3 var4)
             compare=dataset2 (keep=var1 var2 var3 var4)
;
run;

Also which output options are they using? NO one is going to manually look through millions of lines of output.

 

Trusted Advisor
Posts: 1,300

Re: How do I split a dataset into smaller data set by variables

%let lib=SASHELP;
%let data=CARS;
%let keys=MAKE MODEL;
%let splits=3;

%macro qlist(list,dlm=%str( ));
(%unquote(%str(%")%qsysfunc(tranwrd(&list,%str( ),%str("&dlm")))%str(%")))
%mend;

data _null_;
call streaminit(23463);
array v[&splits] $ 32767 _temporary_;

do until (done);
set sashelp.vcolumn end=done;
where libname="%upcase(&lib)" and
memname="%upcase(&data)" and
upcase(name) not in %qlist(&keys);

i=(ceil(&splits * rand('uniform')));
v[i] = catx(' ', v[i], name);
end;

do i=1 to &splits;
call execute('data ' || cats("&data",i) || ';'
|| "set &lib..&data (keep=&keys " || strip(v[i]) || ');'
|| 'run;');
call symputx(cats('keep',i),v[i]);
end;
run;

Valued Guide
Posts: 505

Re: How do I split a dataset into smaller data set by variables

* splitting sashelp.cars into 'make' datasets;

MAKE             Frequency
-----------------------------
Acura                   7
Audi                   19
BMW                    20
Buick                   9
Cadillac                8
...
Suzuki                  8
Toyota                 28
Volkswagen             15
Volvo                  12


%symdel make; * just in case;
data _null_;
   set sashelp.cars;
   by make;
   if first.make then do;
     call symputx('make',make);
     rc=dosubl('
        data &make;
           set sashelp.cars(where=(make="&make"));
        run;quit;
     ');
   end;
run;quit;

SYMBOLGEN:  Macro variable MAKE resolves to Acura
SYMBOLGEN:  Macro variable MAKE resolves to Acura
NOTE: There were 7 observations read from the data set SASHELP.CARS.
      WHERE make='Acura';
...
SYMBOLGEN:  Macro variable MAKE resolves to Volvo
SYMBOLGEN:  Macro variable MAKE resolves to Volvo
NOTE: There were 12 observations read from the data set SASHELP.CARS.
      WHERE make='Volvo';



Trusted Advisor
Posts: 1,300

Re: How do I split a dataset into smaller data set by variables

@rogerdeangelis, this is a vertical split. User wanted a horizontal split. I made the same mistake originally. However, as many have already commented, splitting this file at all makes little sense and there are numerous ways the OP can approach their actual problem without splitting.
Ask a Question
Discussion stats
  • 10 replies
  • 159 views
  • 1 like
  • 6 in conversation