Dears,
I am looking for optimal code to identify the missing values in a variable that has unique values
Ex: I have a dataset that is similar to A1, where 70 million numbers are missed in the range beginning from 1 to 2.1578 billion records
Following step simulated with 6 missing nos instead of 70+ million for simplifying your understanding
Data A1;
do i=1 to 2157895433;
if i ^in(9,13,14,87,100,101) then output;
end;
run;
I wanted to list down the nos that don't exist between 1 and 2157895433
I have following algorithm in hand but it doesn't seem efficient, would appreciate if you could pass a tip to write optimal code
proc sql;
select count(*) format best32., min(i) format best32., max(i) format best32. into
:mv_cnt, :mv_min, :mv_max from a1;
quit;
proc sort data=a1(keep=i) out=a1_sorted;
by i;
run;
data mis_a1;
do i=1 to 2157895433;
output;
end;
run;
data a2_tgt;
merge a1_sorted(in = a) mis_a1(in=b);
by i;
if b and not a;
run;
Hi,
Ok, this makes the assumption that your input is already sorted by i
Data A1;
do i=1 to 200;
if i ^in(9,13,14,87,100,101) then output;
end;
run;
data missing ;
retain cntr 1 ; /* counter, we will use this to compare to the input data */
set a1 ;
if cntr ne i then do ; /* if counter doesn't match i then we have found 1 or more missing numbers */
do miss=cntr to i-1 ; /* This loop accounts for consecutive missing numbers, e.g. 13 & 14 */
/* Cntr=13 i=15, so this will loop from 13 to 14 */
output missing ; /* output the missing numbers */
end ;
cntr=i ; /* set cntr to be the new value of i e.g. 15 */
end ;
cntr+1 ; /* increment cntr */
run ;
Hi,
Ok, this makes the assumption that your input is already sorted by i
Data A1;
do i=1 to 200;
if i ^in(9,13,14,87,100,101) then output;
end;
run;
data missing ;
retain cntr 1 ; /* counter, we will use this to compare to the input data */
set a1 ;
if cntr ne i then do ; /* if counter doesn't match i then we have found 1 or more missing numbers */
do miss=cntr to i-1 ; /* This loop accounts for consecutive missing numbers, e.g. 13 & 14 */
/* Cntr=13 i=15, so this will loop from 13 to 14 */
output missing ; /* output the missing numbers */
end ;
cntr=i ; /* set cntr to be the new value of i e.g. 15 */
end ;
cntr+1 ; /* increment cntr */
run ;
First of all, I would drop the PROC SQL, since the subsequent proc sort will make it easy to get min, max, and count.
proc sort data=a1;
by I;
run;
data _null_;
set a1 nobs=n_a1;
call symput('mv_cnt',trim(put(n_a1,12.)));
call symput('mv_min',trim(put(I,12.)));
set a1 point=n_a1;
call symput('mv_max',trim(put(I,12.)));
stop;
run;
But which part of your sequence is the slow part? If the problem is to take a complete data step and output only records that are not gaps in some other dataset, you can do the proc sort stuff. But what about (1) sorting the gap file only, and (2) using it in a hash object for lookup. This would relieve you of sorting the main file from which you want to delete the 70 million records.
/* If a1 is not sorted then this is necessary */
proc sort data=a1;
by i;
run;
/* Make a dataset of gaps - about 70 million records */
data gaps (keep=i) /* about 70 million expected*/
set a1 (rename=(i=j));
lagj=coalesce(lag{j},0);
if j-lagj^=1 then do i=lagj+1 to j-1;
output;
end;
if end_of_need=1 and j< 2157895433 then do i=j+1 to 2157895433;
output;
end;
run;
data want;
set have;
if _n_=1 then do;
declare hash h(dataset:'gaps');
h.definekey('i');
h.definedata('i');
h.definedone();
end;
if h.find()^=0;
run;
The benefit here is that you don't have to sort the data set to be purged.
I have 2 different tables and both the tables have couple of billion records.. It is required to fetch the records from table1, if the record doesn't exist in table2. I felt it is best to find the missing values in the range that belongs to table2. Fianlly fetch the records from table1 if it matches with missing values or beyond the range.
Thanks for your prompt response!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.