BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
saivenkat
Obsidian | Level 7

Dears,

 

I am looking for optimal code to identify the missing values in a variable that has unique values

 

Ex: I have a dataset that is similar to A1, where 70 million numbers are missed in the range beginning from 1 to  2.1578 billion records 

Following step simulated with 6 missing nos instead of 70+ million for simplifying your understanding

Data A1;

 

   do i=1 to 2157895433;

    if i ^in(9,13,14,87,100,101) then output;
end;
run;

I wanted to list down the nos that don't exist between 1 and 2157895433

 

I have following algorithm in hand but it doesn't seem efficient, would appreciate if you could pass a tip to write optimal code


proc sql;
select count(*) format best32., min(i) format best32., max(i) format best32. into
:mv_cnt, :mv_min, :mv_max from a1;
quit;

 

proc sort data=a1(keep=i) out=a1_sorted;
by i;
run;

 

data mis_a1;
do i=1 to 2157895433;
output;
end;

run;

 

data a2_tgt;
merge a1_sorted(in = a) mis_a1(in=b);
by i;
if b and not a;
run;

1 ACCEPTED SOLUTION

Accepted Solutions
AMSAS
SAS Super FREQ

Hi,

 

Ok, this makes the assumption that your input is already sorted by i

 

 

Data A1;
   do i=1 to 200;

    if i ^in(9,13,14,87,100,101) then output;
end;
run;
data missing ; 
	retain cntr 1 ; /* counter, we will use this to compare to the input data */
	set a1 ;
	if cntr ne i then do ; /* if counter doesn't match i then we have found 1 or more missing numbers */
		do miss=cntr to i-1 ; /* This loop accounts for consecutive missing numbers, e.g. 13 & 14 */
		                      /* Cntr=13 i=15, so this will loop from 13 to 14 */
			output missing ;  /* output the missing numbers */
		end ;
		cntr=i ; /* set cntr to be the new value of i e.g. 15 */
	end ;
	cntr+1 ; /* increment cntr */
run ;

View solution in original post

3 REPLIES 3
AMSAS
SAS Super FREQ

Hi,

 

Ok, this makes the assumption that your input is already sorted by i

 

 

Data A1;
   do i=1 to 200;

    if i ^in(9,13,14,87,100,101) then output;
end;
run;
data missing ; 
	retain cntr 1 ; /* counter, we will use this to compare to the input data */
	set a1 ;
	if cntr ne i then do ; /* if counter doesn't match i then we have found 1 or more missing numbers */
		do miss=cntr to i-1 ; /* This loop accounts for consecutive missing numbers, e.g. 13 & 14 */
		                      /* Cntr=13 i=15, so this will loop from 13 to 14 */
			output missing ;  /* output the missing numbers */
		end ;
		cntr=i ; /* set cntr to be the new value of i e.g. 15 */
	end ;
	cntr+1 ; /* increment cntr */
run ;
mkeintz
PROC Star

First of all, I would drop the PROC SQL, since the subsequent proc sort will make it easy to get min, max, and count.

 

proc sort data=a1;

by I;

run;

 

data _null_;

set a1 nobs=n_a1;

call symput('mv_cnt',trim(put(n_a1,12.)));

call symput('mv_min',trim(put(I,12.)));

set a1 point=n_a1;

call symput('mv_max',trim(put(I,12.)));

stop;

run;

 

 

But which part of your sequence is the slow part?  If the problem is to take a complete data step and output only records that are not gaps in some other dataset, you can do the proc sort stuff.  But what about (1) sorting the gap file only, and (2) using it in a hash object for lookup.  This would relieve you of sorting the main file from which you want to delete the 70 million records.

 

/* If a1 is not sorted then this is necessary */
proc sort data=a1;
  by i;
run;

/* Make a dataset of gaps - about 70 million records */
data gaps (keep=i) /* about 70 million expected*/
  set a1 (rename=(i=j));
  lagj=coalesce(lag{j},0);
  if j-lagj^=1 then do i=lagj+1 to j-1;
    output;
  end;
  if end_of_need=1 and j< 2157895433 then do i=j+1 to  2157895433;
    output;
  end;
run;

data want;
  set have;
  if _n_=1 then do;
    declare hash h(dataset:'gaps');
	  h.definekey('i');
	  h.definedata('i');
	  h.definedone();
  end;
  if h.find()^=0;
run;

The benefit here is that you don't have to sort the data set to be purged.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
saivenkat
Obsidian | Level 7

I have 2 different tables and both the tables have couple of billion records.. It is required to fetch the records from table1, if the record doesn't exist in table2. I felt it is best to find the missing values in the range that belongs to table2. Fianlly fetch the records from table1 if it matches with missing values or beyond the range.

 

Thanks for your prompt response!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 989 views
  • 3 likes
  • 3 in conversation