Re: check if variables from a list exist in a table

MART1 · Posted 01-05-2022 12:58 PM

Hello

I need to check a table against a list of variables, to identify those records in the table where any of the variables is contained.

In my dummy example below, I need to identify those records in SASHELP.CARS where either "coupe", "hatch", "quattro" or "caravan" is contained in MODEL.

This works fine but in my real data I have about 13 million records, with about 8 million variables; SAS always times out before it completes the task.

I was hoping there would be a better way to do this?

many thanks

  /*create lookup table with keys to be searched*/

data LOOKUP;
input KEY $;

datalines;
coupe
hatch
quattro
caravan
;
run;

/*create table where output will be saved*/

PROC SQL;
CREATE TABLE OUTPUT
(
MAKE char (20),
MODEL char (40)
);

%macro makereport(Var=);

/*marco loops*/

PROC SQL;
CREATE TABLE OUTPUT_AUX AS
    SELECT MAKE, MODEL
FROM SASHELP.CARS
WHERE UPPER(MODEL) contains UPPER(&Var);

QUIT;

/*append result to output table*/

PROC APPEND BASE=OUTPUT DATA=OUTPUT_AUX FORCE;

RUN;

%mend makereport;

data _null_;
                set WORK.LOOKUP;
                string = cats('%makereport(Var="',KEY,'");');
                call execute(string);
run;

PaigeMiller · Posted 01-05-2022 01:06 PM

PROC SQL;
    CREATE TABLE OUTPUT_AUX AS SELECT MAKE, MODEL
    FROM SASHELP.CARS
    WHERE upcase(MODEL) in (select upcase(key) from lookup);
QUIT;

No macros needed. I can't test it on a data set that large, but 13 million records shouldn't really be problem these days. It's not clear why you mention you have 8 million variables, as those are not used in this problem at all.

--
Paige Miller

MART1 · Posted 01-05-2022 01:15 PM

Thanks @PaigeMiller

It does not work though; it must be a CONTAINS rather than IN.

I mentioned the 8 million to give an idea of the size (of course I cannot paste here); when I run it I get an "out of memory " type message.

thanks

ballardw · Posted 01-05-2022 03:00 PM

@MART1 wrote:

Thanks @PaigeMiller

It does not work though; it must be a CONTAINS rather than IN.

I mentioned the 8 million to give an idea of the size (of course I cannot paste here); when I run it I get an "out of memory " type message.

thanks

How many of the 8 million variables are actually involved in the comparison? Reduce the size to what is actually needed such as the comparison variables and enough others to uniquely identify each record. Then if needed combine that back to the original data.

Also, how many distinct values of the comparison variables are involved?

How many variables are compared to the look up values? Just how many look up values are there?

And perhaps if the values of the variables aren't sensitive show some actual examples of what you are searching for in some actual values.

ballardw · Posted 01-05-2022 01:16 PM

Your description of "I need to identify those records in SASHELP.CARS where either "coupe", "hatch", "quattro" or "caravan" is contained in MODEL" is a "Check if VALUES" from a list exist.

It would help to show what you expect but consider:

proc sql;
   create table found as
   select b.make, b.model
   from lookup as a, sashelp.cars as b
   where find(b.model, a.key,'i')>0;
run;

The above code compares all records in the lookup data set with the SASHELP.Cars (the comma between the alias A and the Sashelp.cars data set name) and when the string of the Key in the lookup is found in the Model variable, ignoring case (the 'i' in the Find function) keeps the make and model.

Note: if your KEY could possibly appear twice in a single Model you would add the Distinct predicate to the Select to get only one result per Make/model combination

proc sql;
   create table found as
   select distinct b.make, b.model
   from lookup as a, sashelp.cars as b
   where find(b.model, a.key,'i')>0;
run;

PaigeMiller · Posted 01-05-2022 01:34 PM

Probably the code from @ballardw is better, but still a Cartesian join in SQL with 13 million records may be slow. Whether it is slower than the solution I present is unknown.

%macro dothis;
proc sql noprint;
    select quote(trim(key)) into :keys separated by ' ' from lookup;
quit;

data want;
    set sashelp.cars;
    if %do i=1 %to %sysfunc(countw(&keys));
        find(model,%scan(&keys,&i,%str( )),'i')>0
        %if &i<%sysfunc(countw(&keys)) %then or;
        %end;
    then output;
run;
%mend;
%dothis

--
Paige Miller

MART1 · Posted 01-06-2022 12:08 PM

Thanks @PaigeMiller

unfortunately the macro "keys" exceeds the maximum length (65k)
thanks

yabwon · Posted 01-05-2022 02:52 PM

Maybe data step array could be solution?

data LOOKUP;
input KEY $;
datalines;
coupe
hatch
quattro
caravan
;
run;

data _null_;
  call symputX("N",nobs,"G");        
  stop;
  set lookup nobs=nobs;
run;




data want;
    /* load keys int an array */
    array lookup[&N.] $ 12 _temporary_;
    do _N_ = 1 by 1 until(end);
      set lookup end=end;
      lookup[_N_] = KEY; drop KEY;
    end; 

    /* loop over data set */
    do until(EOF);
      set sashelp.cars end=EOF;

      do _N_ = 1 to &N.;
        if find(model,lookup[_N_],'it') then 
          do; /* if model contains key do output... */
            output;
            leave; /* ...and leave the loop since there is no point to check next */
          end;
      end;
    end;

  stop;
run;

Everything in one data step with no macro loops.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

ghosh · Posted 01-05-2022 03:06 PM

In your first line if you mean "values" rather than "variables" then the following will likely work in an 8 million record database. A macro variable can hold about 65K so you could fit in a large amount of values


proc sql noprint;                              
 select key into :to_inc separated by '|'
 from lookup;
quit;
%put &=to_inc;

Data want;
length match 3;
 Set sashelp.cars;
    if prxmatch("/&to_inc/i",model) 
 	Then MATCH=1;
 	else MATCH=0;
if match;
run;

MART1 · Posted 01-06-2022 11:46 AM

Thanks @ghosh

unfortunately the length of the macro exceeds 65k (maybe a can try breaking up)

thanks

ghosh · Posted 01-06-2022 12:35 PM

Yes, with this approach you could break up the list. How may "words" do you have in your lookup list.

Frankly if I were you I would try the proc format approach shown by @SASKiwi
It's also described in this paper, see p.2
https://support.sas.com/resources/papers/proceedings/proceedings/sugi26/p102-26.pdf

MART1 · Posted 01-06-2022 12:50 PM

Hi @ghosh

I tried to break it up, but in order to work it would need far too many partitions (the lookup list contains all UK postcodes, 8 M "words", each 6 or 7 digits).

I'll try the proc format approach (never seen it before but I'll give it a go!)

thanks

SASKiwi · Posted 01-05-2022 04:01 PM

Here's another efficient way that doesn't have the limitations of lists:

data LOOKUP;
input start $;
FmtName = '$Lookup' ;
if _n_ = 1 then do;
  HLO = 'O';
  label = 'N';
  output;
end;
HLO = '';
label = 'Y';
output;
datalines;
SEDAN
WAGON
SUV
;
run;

proc format cntlin = LOOKUP;
run;

data want;
  set SASHELP.CARS;
  where put(upcase(type), $Lookup.) = 'Y';
run;

andreas_lds · Posted 01-06-2022 01:02 AM

You are mixing terms, making it difficult to follow the problem description. You are talking about finding variables in dataset, but actually it seems that you want to check if one/some/all variables contain one value of list of values.

To make our lives simpler, please post some observations of the data you have as working sas code (data step with datalines). The code posted by @yabwon seems to solve the problem.

MART1 · Posted 01-06-2022 11:39 AM

Thanks @andreas_lds

In the "real world", my dataset contains a list of users (column A) with their associated address (column B). The address may or may not contain the postcode. This dataset contains about 13M records.

In another dataset I have a list of all UK postcodes (distinct list); it's about 8M records.

The goal is to identify those users in the first dataset who have a postcode (any postcode) in their address.

Translated to my dummy example, I wanted to find those records in CARS where model contain either "quattro" or "coupe"...

If it does, the record needs to be added to the final output.

I have used @yabwon code; it works perfectly if I use a small subset of the postcode, but when I use the full list (8M) it takes a long time (tried a few times but I had to stop it after 3 hours).

Maybe I'm just too "demanding" and this would take quite some time?

Thanks

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away