BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
saivenkat
Obsidian | Level 7

Dears,

 

I am looking for optimal code to identify the missing values in a variable that has unique values

 

Ex: I have a dataset that is similar to A1, where 70 million numbers are missed in the range beginning from 1 to  2.1578 billion records 

Following step simulated with 6 missing nos instead of 70+ million for simplifying your understanding

Data A1;

 

   do i=1 to 2157895433;

    if i ^in(9,13,14,87,100,101) then output;
end;
run;

I wanted to list down the nos that don't exist between 1 and 2157895433

 

I have following algorithm in hand but it doesn't seem efficient, would appreciate if you could pass a tip to write optimal code


proc sql;
select count(*) format best32., min(i) format best32., max(i) format best32. into
:mv_cnt, :mv_min, :mv_max from a1;
quit;

 

proc sort data=a1(keep=i) out=a1_sorted;
by i;
run;

 

data mis_a1;
do i=1 to 2157895433;
output;
end;

run;

 

data a2_tgt;
merge a1_sorted(in = a) mis_a1(in=b);
by i;
if b and not a;
run;

1 ACCEPTED SOLUTION

Accepted Solutions
AMSAS
SAS Super FREQ

Hi,

 

Ok, this makes the assumption that your input is already sorted by i

 

 

Data A1;
   do i=1 to 200;

    if i ^in(9,13,14,87,100,101) then output;
end;
run;
data missing ; 
	retain cntr 1 ; /* counter, we will use this to compare to the input data */
	set a1 ;
	if cntr ne i then do ; /* if counter doesn't match i then we have found 1 or more missing numbers */
		do miss=cntr to i-1 ; /* This loop accounts for consecutive missing numbers, e.g. 13 & 14 */
		                      /* Cntr=13 i=15, so this will loop from 13 to 14 */
			output missing ;  /* output the missing numbers */
		end ;
		cntr=i ; /* set cntr to be the new value of i e.g. 15 */
	end ;
	cntr+1 ; /* increment cntr */
run ;

View solution in original post

3 REPLIES 3
AMSAS
SAS Super FREQ

Hi,

 

Ok, this makes the assumption that your input is already sorted by i

 

 

Data A1;
   do i=1 to 200;

    if i ^in(9,13,14,87,100,101) then output;
end;
run;
data missing ; 
	retain cntr 1 ; /* counter, we will use this to compare to the input data */
	set a1 ;
	if cntr ne i then do ; /* if counter doesn't match i then we have found 1 or more missing numbers */
		do miss=cntr to i-1 ; /* This loop accounts for consecutive missing numbers, e.g. 13 & 14 */
		                      /* Cntr=13 i=15, so this will loop from 13 to 14 */
			output missing ;  /* output the missing numbers */
		end ;
		cntr=i ; /* set cntr to be the new value of i e.g. 15 */
	end ;
	cntr+1 ; /* increment cntr */
run ;
mkeintz
PROC Star

First of all, I would drop the PROC SQL, since the subsequent proc sort will make it easy to get min, max, and count.

 

proc sort data=a1;

by I;

run;

 

data _null_;

set a1 nobs=n_a1;

call symput('mv_cnt',trim(put(n_a1,12.)));

call symput('mv_min',trim(put(I,12.)));

set a1 point=n_a1;

call symput('mv_max',trim(put(I,12.)));

stop;

run;

 

 

But which part of your sequence is the slow part?  If the problem is to take a complete data step and output only records that are not gaps in some other dataset, you can do the proc sort stuff.  But what about (1) sorting the gap file only, and (2) using it in a hash object for lookup.  This would relieve you of sorting the main file from which you want to delete the 70 million records.

 

/* If a1 is not sorted then this is necessary */
proc sort data=a1;
  by i;
run;

/* Make a dataset of gaps - about 70 million records */
data gaps (keep=i) /* about 70 million expected*/
  set a1 (rename=(i=j));
  lagj=coalesce(lag{j},0);
  if j-lagj^=1 then do i=lagj+1 to j-1;
    output;
  end;
  if end_of_need=1 and j< 2157895433 then do i=j+1 to  2157895433;
    output;
  end;
run;

data want;
  set have;
  if _n_=1 then do;
    declare hash h(dataset:'gaps');
	  h.definekey('i');
	  h.definedata('i');
	  h.definedone();
  end;
  if h.find()^=0;
run;

The benefit here is that you don't have to sort the data set to be purged.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
saivenkat
Obsidian | Level 7

I have 2 different tables and both the tables have couple of billion records.. It is required to fetch the records from table1, if the record doesn't exist in table2. I felt it is best to find the missing values in the range that belongs to table2. Fianlly fetch the records from table1 if it matches with missing values or beyond the range.

 

Thanks for your prompt response!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 955 views
  • 3 likes
  • 3 in conversation