BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lingcx
Obsidian | Level 7

Hi all,

 

This is my first post asking for help. I have about 1000 medium-sized datasets (each has 400k obs and 7 vars including one unique ID and 6 other vars). I would like to merge all 1000 together. I tried two methods. The first method simply cannot be accomplished because of the huge numbers of obs and vars. The second took a long time to run (more than 12 hours) especially when the "final" dataset gets larger and larger in the end. 

 

I'm wondering whether there is a much more efficient way to merge all 1000-ish datasets together. I thought about splitting these into ten batches and merge the 10 semi-final datasets but I'm not sure whether it would help. 

 

If you have a better idea how to proceed with this, please please let me know!! Much appreciated. 

 

Method 1: 
Data final;
merge day: ;
by ID;
run;

 

Method 2:

%macro merge(data);
Data final;
merge final c.&data.;
by ID;
run;
%Mend;


%merge(day1);
%merge(day2);
%merge(day4);
%merge(day5);
%merge(day7);
%merge(day8);
%merge(day11);
%merge(day13);
%merge(day20);
%merge(day24);
....
%merge(day10000);

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

This got me thinking and I did a few more tests.

 

SAS gets very inefficient when processing hundreds of tables concurrently with a BY statement, and the elapse (and CPU) time grows exponentially as more tables are added.

Much better to split the processing as suggested in the first post.

 

This is even truer for the SET statement than for the merge statement. 

 

 

%*********** C R E A T E   S A M P L E **********;
%let nbobs=2000; %let nbtab=1000; %* Create sample; data %macro loop; %local i; %do i=1 %to &nbtab; HAVE&i.(rename=(J=J&i. K=K&i. L=L&i. M=M&i.) compress=no where=(I<3 or I>&i.)) %end; %mend; %loop; do I=1 to &nbobs; J=I;K=I;L=I;M=I; output; end; run;
%************ T E S T M E R G E B Y *************; %* merge 100 tables ==> 0.7 seconds; data WANT; merge HAVE1-HAVE100; by I; run; %* merge 200 tables ==> 2 seconds; data WANT; merge HAVE1-HAVE200; by I; run; %* merge 500 tables ==> 15 seconds; data WANT; merge HAVE1-HAVE500; by I; run; %* merge 1000 tables ==> 65 seconds; data WANT; merge HAVE1-HAVE1000; by I; run; %* Break merge in 200-table blocks ==> total time 10 seconds; data T1; merge HAVE1 - HAVE200; by I; run; data T2; merge HAVE201 - HAVE400; by I; run; data T3; merge HAVE401 - HAVE800; by I; run; data T4; merge HAVE601 - HAVE800; by I; run; data T5; merge HAVE801 -HAVE1000; by I; run; data T6; merge T1-T5; by I; run; %************** T E S T S E T B Y **************; %* set 50 tables ==> 4 seconds; data WANT; set HAVE1-HAVE50; by I; run; %* set 100 tables ==> 26 seconds; data WANT; set HAVE1-HAVE100; by I; run; %* set 200 tables ==> 180 seconds; data WANT; set HAVE1-HAVE200; by I; run; %* Break set in 40-table blocks ==> total time 9 seconds; data T1; set HAVE1 - HAVE40; by I; run; data T2; set HAVE49 - HAVE80; by I; run; data T3; set HAVE89 - HAVE120; by I; run; data T4; set HAVE129 - HAVE160; by I; run; data T5; set HAVE159 - HAVE200; by I; run; data T6; set T1-T5; by I; run;

 

@lingcx  You have to test with your own data in your own environment,but you should find something similar.

 

If you go the UPDATE route, splitting should speed up the prior SET step. Multiple MERGEs may still be faster though.

 

View solution in original post

16 REPLIES 16
Astounding
PROC Star

I just have to ask ... are you certain that MERGE is the right SAS tool to combine these data sets?  That would imply that ID is the only variable that is common to all your data sets, and that you want a final data set with approximately 400K observations and 6000 variables.

 

If all that sounds right, then your idea of batches is a good idea.  You could certainly test one batch of 100 data sets.  Even if that's not the ideal split, you could save the results and not have to repeat the process with smaller batches for those 100.

lingcx
Obsidian | Level 7

Unfortunately, that's exactly what I want - a huge dataset with ~400k obs and 6000 vars. 

Reeza
Super User

I suspect that appending these is the better solution. Do you have any sort of naming conventions with the data set. 

 


@lingcx wrote:

Hi all,

 

 (each has 400k obs and 7 vars including one unique ID and 6 other vars).



Your method will not work because once SAS merges in the data the identical variables will be overwritten so this doesn't make sense to me.  Unless it's 6 different variables but I suspect not.

 


@lingcx wrote:

Hi all,

 

This is my first post asking for help. I have about 1000 medium-sized datasets (each has 400k obs and 7 vars including one unique ID and 6 other vars). I would like to merge all 1000 together. I tried two methods. The first method simply cannot be accomplished because of the huge numbers of obs and vars. The second took a long time to run (more than 12 hours) especially when the "final" dataset gets larger and larger in the end. 

 

I'm wondering whether there is a much more efficient way to merge all 1000-ish datasets together. I thought about splitting these into ten batches and merge the 10 semi-final datasets but I'm not sure whether it would help. 

 

If you have a better idea how to proceed with this, please please let me know!! Much appreciated. 

 

Method 1: 
Data final;
merge day: ;
by ID;
run;

 

Method 2:

%macro merge(data);
Data final;
merge final c.&data.;
by ID;
run;
%Mend;


%merge(day1);
%merge(day2);
%merge(day4);
%merge(day5);
%merge(day7);
%merge(day8);
%merge(day11);
%merge(day13);
%merge(day20);
%merge(day24);
....
%merge(day10000);


 

If you really do need it wide, you can still append and then transpose afterwards. 

 

 

lingcx
Obsidian | Level 7

@Reeza wrote:

I suspect that appending these is the better solution. Do you have any sort of naming conventions with the data set. 

 


@lingcx wrote:

Hi all,

 

 (each has 400k obs and 7 vars including one unique ID and 6 other vars).



Your method will not work because once SAS merges in the data the identical variables will be overwritten so this doesn't make sense to me.  Unless it's 6 different variables but I suspect not.

 

I do have 6 vars named differently but systematically in each of the 1000 datasets. For example, dataset 'day1' has ID, day1_v1, day1_v2, day1_v3, day1_v4, day1_v5, and day1_6, and similarly, dataset 'day2' has day2_v1, day2_v2, day2_v3, day2_v4, day2_v5, and day2_6, etc. 

 


@lingcx wrote:

Hi all,

 

This is my first post asking for help. I have about 1000 medium-sized datasets (each has 400k obs and 7 vars including one unique ID and 6 other vars). I would like to merge all 1000 together. I tried two methods. The first method simply cannot be accomplished because of the huge numbers of obs and vars. The second took a long time to run (more than 12 hours) especially when the "final" dataset gets larger and larger in the end. 

 

I'm wondering whether there is a much more efficient way to merge all 1000-ish datasets together. I thought about splitting these into ten batches and merge the 10 semi-final datasets but I'm not sure whether it would help. 

 

If you have a better idea how to proceed with this, please please let me know!! Much appreciated. 

 

Method 1: 
Data final;
merge day: ;
by ID;
run;

 

Method 2:

%macro merge(data);
Data final;
merge final c.&data.;
by ID;
run;
%Mend;


%merge(day1);
%merge(day2);
%merge(day4);
%merge(day5);
%merge(day7);
%merge(day8);
%merge(day11);
%merge(day13);
%merge(day20);
%merge(day24);
....
%merge(day10000);


 

If you really do need it wide, you can still append and then transpose afterwards. 

 

How much faster do you think it will be if I append and tranpose? Transpose seems to be a time-consuming procedure to me too...

 

 


 

Tom
Super User Tom
Super User

I do have 6 vars named differently but systematically in each of the 1000 datasets. For example, dataset 'day1' has ID, day1_v1, day1_v2, day1_v3, day1_v4, day1_v5, and day1_6, and similarly, dataset 'day2' has day2_v1, day2_v2, day2_v3, day2_v4, day2_v5, and day2_6, etc. 

If so then it should be better to generate a tall data set with just three (or four) variables.  Here is version that assumes you have the structure above. It will parse out the DAY number and VERSION number from the variable name. It should save a lot of space since you will not need to store missing values.

 

data want ;
  set day: ;
  array d day: ;
  length vname $32 ;
  do i=1 to dim(d);
    if not missing(d(i)) then do;
      vname =vname(d(i));
      day = input(substr(scan(vname,1,'_'),4),32.);
      version = input(substr(scan(vname,2,'_'),2),32.);
      value = d(i);
      output;
    end;
  end;
  keep id day version value ;
run;

If the source datasets are already sorted by ID then you could add a BY ID statement after the SET statement and SAS will keep the data sorted by ID.

 

But if opening 1000 datasets at once makes it run too slow then you might try transposing them one by one into the tall structure and then using PROC APPEND to add the latest dataset to the end of master table with all data.

%macro do_one(in=,base=all);

data next;
  set &in ;
  array d day: ;
  length vname $32 ;
  do i=1 to dim(d);
    if not missing(d(i)) then do;
      vname =vname(d(i));
      day = input(substr(scan(vname,1,'_'),4),32.);
      version = input(substr(scan(vname,2,'_'),2),32.);
      value = d(i);
      output;
    end;
  end;
  keep id day version value ;
run;
proc append base=&base data=next force ;
run;
%mend do_one;

%do_one(day1);
%do_one(day2);
....

It the names are DAY1 to DAY1000 then the series of calls can also easily be scripted into a macro.

ballardw
Super User

provide and example of maybe 3 or 4 rows of data from day1 and day2 and then what the result should look like after the "merge".

 

If all you do is put the day2 data below day1 then you are appending and can be done (assuming your macro calls used the correct data set names)

 

data final;

   set day1 - day1000;

run;

 

 

 

AlanC
Barite | Level 11

Let me say this: columns (vars) are bad, rows are good.

 

I would look into transposing them and then figure out logic that works on rows. 6000 columns is going to cause you all types of trouble. 

 

I have had a client with 42,000 vars so it is possible. Personally, I created one with millions of columns to see if it was possible. That said, it is HUGELY discouraged. Don't go down that path. 

 

Transpose each one first then use the set command to pull them into a simple columnar structure.

https://github.com/savian-net
lingcx
Obsidian | Level 7

I see. Thanks! Will give it a try!

PGStats
Opal | Level 21

Going wide is the wrong way to go as others have said. But it can be done somewhat efficiently I think with this approach:

 

data set1;
input id x;
datalines;
1 99
2 999
3 9999
;

data set2;
input id y;
datalines;
1 88
2 888
3 8888
;

data all;
set set: ;
by id;
run;

data want;
update all(obs=0) all;
by id;
run;
PG
lingcx
Obsidian | Level 7

Thanks a lot! Your solution works a bit better, but given my huge numbers of observations, a "set" step could take a long time too... I really need to reduce my datasets at least in one dimension if not both. 

mkeintz
PROC Star

The problem with merging or SETting 1,000 datasets is that SAS will set up 1,000 buffers in memory.   However, if all the datasets have the same variables, then the OPEN=DEFER option can help.  It tells SAS to re-use the same buffer for each incoming dataset in sequence.  ``Here's an example using 3 datasets created from sashelp.stocks:

 

data st1 (where=(stock='IBM'))
     st2 (where=(stock='Intel'))
     st3 (where=(stock='Microsoft')) ;
  set sashelp.stocks;
run;

data need/view=need;
  set st: open=defer;
run;

proc sort data=need out=want;
  by stock date;
run;

 

 

If the original 1,000 datasets each had one stock and are already sorted by date, why not use this, which would avoid the proc sort?

 

   data want;

     set st:  open=defer;

     by stock date;

   run;

 

This can't be done because the "BY stock date" statement means that SAS has to provide a buffer for each dataset - i.e. "open=defer" will not be honored.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
lingcx
Obsidian | Level 7

Thanks for the reply! This is good to know! Unfortunately, my datasets have different names and I need a "merge" but your method is great for concatenating datasets. 

mkeintz
PROC Star

Do each of your datasets have unique  variables?  Or are there groups of datasets with identical vars? If it's the latter, you could still use the "open=defer" with the larger groups.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ChrisNZ
Tourmaline | Level 20

Here is a quick benchmark.

 

I can't run your volume of data or I'll hear some screaming, but it may help. Four solutions are tried.

1- @PGStats's update data step

2- Same update step, but using a view instead of creating the full concatenated dataset

3- Merge by

4- Hash table

 

With 1000 obs and 300 tables, merge and hash are fastest:

%let nbobs=1000;
%let nbtab=300;

%* Create sample;
data %macro loop; %local i; %do i=1 %to &nbtab;
     HAVE&i.(rename=(J=J&i.) compress=no where=(I<3 or I>&i.))
     %end; %mend; 
     %loop;  
  do I=1 to &nbobs;
    J=I;
    output;
  end;
run;
   
%* Update;
data ALL;
  set HAVE: ;
  by I;
run;
data WANT1;
  update ALL(obs=0) ALL;
  by I;
run;
/*    real time           1:18.45      set
      user cpu time       1:16.98
      real time           2.73 seconds update
      user cpu time       2.21 seconds */

%* View + Update;
data ALL ;
  set HAVE: ;
  stop;
run;
data _V/view=_V;
  set HAVE: ;
  by I;
run;
data WANT2;
  update ALL _V;
  by I;
run;
/*    real time           1:21.17
      user cpu time       1:18.40 */

%* Merge;
data WANT3;
  merge HAVE: ;
  by I;
run;
/*    real time           1.20 seconds
      user cpu time       0.76 seconds */

%* Hash;
data _null_;
  if _N_=1 then do;
    dcl hash OUT(ordered:'a');
    OUT.definekey ('I');
    OUT.definedata('I' %macro loop;%local i;%do i=1%to&nbtab.;,"J&i" %end;%mend;%loop );
    OUT.definedone();
  end;
  set HAVE:(keep=I) ;                       %* Read key;
  RC=OUT.find();                            %* Get stored values from hash;   
  set HAVE: end=LASTOBS indsname=INDSNAME;  %* Update new value; 
  if INDSNAME ne lag(INDSNAME) then do;     %* The first time a dataset is read, the PDV is reset. ;
     select(INDSNAME);                      %*   in that case reload the hash values and add new value;
     %macro loop;%local i;%do i=1%to&nbtab.;
     when("WORK.HAVE&i") do; TMP=J&i; RC=OUT.find(); J&i=TMP; end;
     %end;%mend;%loop 
     otherwise;
     end;
  end;
  OUT.replace();               %* save new values;
  if LASTOBS then OUT.output(dataset:'WANT4');
run;
/*    real time           1.60 seconds
      user cpu time       1.23 seconds */

 

When increasing to 2k obs and 1k tables, the results scale similarly.

 

 

I don't know your bufsize, but you could try setting bufno to 100 or so, it you have enough memory.

 

Wide tables are not the most efficient to process as mentioned.

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 3936 views
  • 7 likes
  • 9 in conversation