topic Re: Optimization in SAS Programming

Optimization

saivenkat — Mon, 24 Sep 2018 18:04:02 GMT

Dears,

I am looking for optimal code to identify the missing values in a variable that has unique values

Ex: I have a dataset that is similar to A1, where 70 million numbers are missed in the range beginning from 1 to 2.1578 billion records

Following step simulated with 6 missing nos instead of 70+ million for simplifying your understanding

Data A1;

do i=1 to 2157895433;

if i ^in(9,13,14,87,100,101) then output;
end;
run;

I wanted to list down the nos that don't exist between 1 and 2157895433

I have following algorithm in hand but it doesn't seem efficient, would appreciate if you could pass a tip to write optimal code

proc sql;
select count(*) format best32., min(i) format best32., max(i) format best32. into
:mv_cnt, :mv_min, :mv_max from a1;
quit;

proc sort data=a1(keep=i) out=a1_sorted;
by i;
run;

data mis_a1;
do i=1 to 2157895433;
output;
end;

run;

data a2_tgt;
merge a1_sorted(in = a) mis_a1(in=b);
by i;
if b and not a;
run;

Re: Optimization

AMSAS — Mon, 24 Sep 2018 18:38:12 GMT

Hi,

Ok, this makes the assumption that your input is already sorted by i

Data A1;
   do i=1 to 200;

    if i ^in(9,13,14,87,100,101) then output;
end;
run;
data missing ; 
	retain cntr 1 ; /* counter, we will use this to compare to the input data */
	set a1 ;
	if cntr ne i then do ; /* if counter doesn't match i then we have found 1 or more missing numbers */
		do miss=cntr to i-1 ; /* This loop accounts for consecutive missing numbers, e.g. 13 & 14 */
		                      /* Cntr=13 i=15, so this will loop from 13 to 14 */
			output missing ;  /* output the missing numbers */
		end ;
		cntr=i ; /* set cntr to be the new value of i e.g. 15 */
	end ;
	cntr+1 ; /* increment cntr */
run ;

Re: Optimization

mkeintz — Mon, 24 Sep 2018 18:48:34 GMT

First of all, I would drop the PROC SQL, since the subsequent proc sort will make it easy to get min, max, and count.

proc sort data=a1;

by I;

run;

data _null_;

set a1 nobs=n_a1;

call symput('mv_cnt',trim(put(n_a1,12.)));

call symput('mv_min',trim(put(I,12.)));

set a1 point=n_a1;

call symput('mv_max',trim(put(I,12.)));

stop;

run;

But which part of your sequence is the slow part? If the problem is to take a complete data step and output only records that are not gaps in some other dataset, you can do the proc sort stuff. But what about (1) sorting the gap file only, and (2) using it in a hash object for lookup. This would relieve you of sorting the main file from which you want to delete the 70 million records.

/* If a1 is not sorted then this is necessary */
proc sort data=a1;
  by i;
run;

/* Make a dataset of gaps - about 70 million records */
data gaps (keep=i) /* about 70 million expected*/
  set a1 (rename=(i=j));
  lagj=coalesce(lag{j},0);
  if j-lagj^=1 then do i=lagj+1 to j-1;
    output;
  end;
  if end_of_need=1 and j< 2157895433 then do i=j+1 to  2157895433;
    output;
  end;
run;

data want;
  set have;
  if _n_=1 then do;
    declare hash h(dataset:'gaps');
	  h.definekey('i');
	  h.definedata('i');
	  h.definedone();
  end;
  if h.find()^=0;
run;

The benefit here is that you don't have to sort the data set to be purged.

Re: Optimization

saivenkat — Mon, 24 Sep 2018 19:14:32 GMT

I have 2 different tables and both the tables have couple of billion records.. It is required to fetch the records from table1, if the record doesn't exist in table2. I felt it is best to find the missing values in the range that belongs to table2. Fianlly fetch the records from table1 if it matches with missing values or beyond the range.

Thanks for your prompt response!