I've been asked to help someone split a dataset into smaller datasets by groups of variables. The input dataset is 1 million records and 12,000 variables. All they want to do is run a proc compare between 2 versions of this dataset, but it is too large for this to work. Is there a way to split this into about 12 datasets with about 1000 variables in each?
Its not going to be any quicker spltting the file up, in fact, taking into account the time to split the dataset, read/write, and loop over the datasets, its most likely to be longer. Why do you have a dataset with 12000 variables? 100 variable is way too much for a dataset. Consider approaching that with an eye to re-modelling the data.
@Barbara_66 wrote:
It's not my data. It won't be quicker to split it up, but I can't even sort this dataset because I run out of workspace.
Why do you think you need to sort the data?
@Barbara_66 wrote:
I've been asked to help someone split a dataset into smaller datasets by groups of variables. The input dataset is 1 million records and 12,000 variables. All they want to do is run a proc compare between 2 versions of this dataset, but it is too large for this to work. Is there a way to split this into about 12 datasets with about 1000 variables in each?
Note: I hope the two sets are sorted by the same variables before you start.
If there are specific variables of interest then you can reduce the problem by only bringing in the variables of concern and/or using WITH and VAR options or using the FIRSTOBS and OBS or WHERE dataset options
Some examples:
proc compare base =dataset1 (where=(groupvar='3')) compare=dataset2 (where=(groupvar='3')) ; var basevar1 basevar2; with comparevar1 comparevar2; run; proc compare base =dataset1 (where=(groupvar='3')) compare=dataset2 (where=(groupvar='3')) ; run; proc compare base =dataset1 (firstobs=10000 obs=20000) compare=dataset2 (firstobs=10000 obs=20000) ; run; proc compare base =dataset1 (keep=var1 var2 var3 var4) compare=dataset2 (keep=var1 var2 var3 var4) ; run;
Also which output options are they using? NO one is going to manually look through millions of lines of output.
%let lib=SASHELP;
%let data=CARS;
%let keys=MAKE MODEL;
%let splits=3;
%macro qlist(list,dlm=%str( ));
(%unquote(%str(%")%qsysfunc(tranwrd(&list,%str( ),%str("&dlm")))%str(%")))
%mend;
data _null_;
call streaminit(23463);
array v[&splits] $ 32767 _temporary_;
do until (done);
set sashelp.vcolumn end=done;
where libname="%upcase(&lib)" and
memname="%upcase(&data)" and
upcase(name) not in %qlist(&keys);
i=(ceil(&splits * rand('uniform')));
v[i] = catx(' ', v[i], name);
end;
do i=1 to &splits;
call execute('data ' || cats("&data",i) || ';'
|| "set &lib..&data (keep=&keys " || strip(v[i]) || ');'
|| 'run;');
call symputx(cats('keep',i),v[i]);
end;
run;
* splitting sashelp.cars into 'make' datasets;
MAKE Frequency
-----------------------------
Acura 7
Audi 19
BMW 20
Buick 9
Cadillac 8
...
Suzuki 8
Toyota 28
Volkswagen 15
Volvo 12
%symdel make; * just in case;
data _null_;
set sashelp.cars;
by make;
if first.make then do;
call symputx('make',make);
rc=dosubl('
data &make;
set sashelp.cars(where=(make="&make"));
run;quit;
');
end;
run;quit;
SYMBOLGEN: Macro variable MAKE resolves to Acura
SYMBOLGEN: Macro variable MAKE resolves to Acura
NOTE: There were 7 observations read from the data set SASHELP.CARS.
WHERE make='Acura';
...
SYMBOLGEN: Macro variable MAKE resolves to Volvo
SYMBOLGEN: Macro variable MAKE resolves to Volvo
NOTE: There were 12 observations read from the data set SASHELP.CARS.
WHERE make='Volvo';
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.