Solved: Re: Append limiting

texasmfp · Posted 01-06-2020 05:08 PM

My program generates a single observation and, as the last step, appends this single observation to a results file (RESULTS1). The results file starts empty. I want to save every individual result but limit the size of the results file to 100,000 observations. The observation generated by the 100,001th iteration would be automatically written to a new (empty) results file (Results2) again, up to 100,000 in size. The 200,001th observation would go to another new results file (lets call it Results3) etc....So, in the end, if I run 350,000 iterations, I will end up with RESULTS1 (100,000 obs), RESULTS2 (100,000 obs), RESULTS3 (100,000 obs), RESULTS4 (50,000 obs)

Think of it as: I have five gallons of results that pour into a container one drop at a time, but I have pint sized containers so when one is full, the conveyor belt moves to next container and starts to fill it (and never miss a drop).

How can I limit the size of the results file, keep it, and then start a new one with a sequential name?

Thanks

proc append base=RESULTS1 data=newestresult force;
run;

texasmfp · Posted 01-07-2020 01:26 AM

I modified Astounding's suggestion and it works well.

proc append FORCE base=RESULT&result_set data=newestresult;
run; 

%if %eval(&i/(&result_set * 50000))>1 %then %let result_set = %eval(&result_set + 1);

I also added a subsequent macro to join all of the pieces:

%let num_files = &result_set;

%macro join;                                                                                                                           

%do i=1 %to &num_files;
    %let padded_number = %sysfunc(putn(&i, 4.0));
    proc append base=full_set
                data=RESULT&padded_number;
    run;

	proc datasets lib=work nolist;
	delete RESULT&padded_number;
	quit;
	run;
 %end;

%mend join;                                                                                                                                  
                                                                                                                               
%join

Tom, I didn't get to test yours, but I am sure it works similarly.

Much thanks, once again, to SAS Communities!

View solution in original post

mkeintz · Posted 01-06-2020 05:39 PM

Are you saying your program will append one observation at a time, with the expectation that you will end up with multiple datasets of size 100,000? That is the opposite of efficient. It may be much better to think about how to append in larger batches that to worry immediately about dividing the result into datasets with fixed maximum size.

Perhaps you can tells us about the program that generates one result observation at a time.

But as to your question. Setting data set sizes to 100,000 is a much easier task. Let's say you have data set HAVE with some large number of obs, but you know there are no more than 600,000 obs (i.e. you need no more than 6 WANT datasets). Then:

data want1 want2 want3 want4 want5 want6;
  set have end=end_of_have nobs=n_have;
  if _n_<=100000 then output want1; else
  if _n_<=200000 then output want2; else
  if _n_<=300000 then output want3; else
  if _n_<=400000 then output want4; else
  if _n_<=500000 then output want5; else
  output want6;

  if end_of_have;
  if n_have <=500000 then call execute ('proc delete data=want6;run;');
  if n_have <=400000 then call execute ('proc delete data=want5;run;');
  if n_have <=300000 then call execute ('proc delete data=want4;run;');
  if n_have <=200000 then call execute ('proc delete data=want3;run;');
  if n_have <=100000 then call execute ('proc delete data=want2;run;');
run;

The above is a fairly straightforward program. But as I said, appending one at a time - at the scale you contemplate - is an invitation to lots and lots of waiting.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

texasmfp · Posted 01-06-2020 06:00 PM

Thanks. Yes, I know that appending one at a time at 100,000+ results in waiting. My testing shows that I can minimize the wait time by limiting the size of the appended dataset.

However, your suggested program, which works beautifully, solves a problem I do not have. I do not have a dataset named have that has been appended with 600,000 that I want to then split up into 6 pieces. I never want "have" to be appended with more than 100,000. I want to stop appending to have at the 100,000th iteration and start appending to a new file have2, then have3, etc at the append stage.

The program that generates the single observation to be appended does not matter. I am looking for a general solution that can apply to any program, not a specific one.

Astounding · Posted 01-06-2020 06:09 PM

Actually, some of the details of your program matter.

Presumably, you are using macro language to iterate hundreds of thousands of times. If that's incorrect, this approach may not apply.

Within the macro program, you might have (and could certainly add) a macro variable to keep track of which iteration you are working on.

And presumably you could add a small amount of code to this section:

proc append base=RESULTS1 data=newestresult force;
run;

Change it to look like this:


proc append base=RESULTS&result_set data=newestresult force;
run;

Then all you would need to do is to use your "number of iterations" macro variable. Very early, outside the loop, add:

%let result_set = 1;

Then just after PROC APPEND, add some logic to increment &RESULT_SET. One possibility:

%if %eval(&n_iterations / 100000 * 100000) = &n_iterations then 
%let result_set = %eval(&result_set + 1);

texasmfp · Posted 01-06-2020 06:19 PM

Yes, its a macro that is running the iterations. Your suggestion makes sense. I just need to figure out how to create an iteration counter (which I believe you have called n_iterations). Thanks

texasmfp · Posted 01-06-2020 07:18 PM

shouldn't the statement logic be:

%if %eval(&n_iterations /(&result_set * 100000) >1 then
%let result_set = %eval(&result_set + 1);

Astounding · Posted 01-06-2020 10:47 PM

That logic would need to switch from %EVAL to %SYSEVALF. The %EVAL function performs integer arithmatic, so it drops any remainders. In fact, that's the reason why the original logic that I posted works:

%if %eval(&n_iterations / 100000 * 100000) = &n_iterations then 
%let result_set = %eval(&result_set + 1);

When &n_iterations is 100001, the division:

&n_iterations / 100000

produces 1 as the result. The remainder is dropped. Multiplying by 100000 generates 100000 as the result, which is not equal to 100001.

There are a few ways to get working logic out of this. Just make sure the logic increments whenever &N_ITERATIONS is a multiple of 100000.

Tom · Posted 01-06-2020 07:56 PM

Just keep count.

Not sure how you are currently ending your iterations but here is one way to use two loops. One that just increments the target dataset name and the other counts how many iterations you have done for this set.

%let set=1 ;
%let done=0;
%do %until(&done);
   %let inner_loop=1;
   %do %until(&done or &inner_loop>100000);
/* your processing here 
  including logic to set macro variable DONE=1 then time to stop looping 
*/
    proc append base=RESULTS&set data=newestresult force;
run;
    %let inner_loop=%eval(&inner_loop+1);
  %end; %* inner loop ;
  %let set=%eval(&set+1);
%end; %* outer loop;

texasmfp · Posted 01-07-2020 01:26 AM

I modified Astounding's suggestion and it works well.

proc append FORCE base=RESULT&result_set data=newestresult;
run; 

%if %eval(&i/(&result_set * 50000))>1 %then %let result_set = %eval(&result_set + 1);

I also added a subsequent macro to join all of the pieces:

%let num_files = &result_set;

%macro join;                                                                                                                           

%do i=1 %to &num_files;
    %let padded_number = %sysfunc(putn(&i, 4.0));
    proc append base=full_set
                data=RESULT&padded_number;
    run;

	proc datasets lib=work nolist;
	delete RESULT&padded_number;
	quit;
	run;
 %end;

%mend join;                                                                                                                                  
                                                                                                                               
%join

Tom, I didn't get to test yours, but I am sure it works similarly.

Much thanks, once again, to SAS Communities!

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!