BookmarkSubscribeRSS Feed
MART1
Quartz | Level 8

Hello

 

I need to check a table against a list of variables, to identify those records in the table where any of the variables is contained.

In my dummy example below, I need to identify those records in SASHELP.CARS where either "coupe", "hatch", "quattro" or "caravan" is contained in MODEL.

 

This works fine but in my real data I have about 13 million records, with about 8 million variables; SAS always times out before it completes the task.

 

I was hoping there would be a better way to do this?

many thanks

 

  /*create lookup table with keys to be searched*/

data LOOKUP;
input KEY $;

datalines;
coupe
hatch
quattro
caravan
;
run;

/*create table where output will be saved*/

PROC SQL;
CREATE TABLE OUTPUT
(
MAKE char (20),
MODEL char (40)
);

%macro makereport(Var=);

/*marco loops*/

PROC SQL;
CREATE TABLE OUTPUT_AUX AS
    SELECT MAKE, MODEL
FROM SASHELP.CARS
WHERE UPPER(MODEL) contains UPPER(&Var);

QUIT;

/*append result to output table*/

PROC APPEND BASE=OUTPUT DATA=OUTPUT_AUX FORCE;

RUN;

%mend makereport;

data _null_;
                set WORK.LOOKUP;
                string = cats('%makereport(Var="',KEY,'");');
                call execute(string);
run;
 

 

18 REPLIES 18
PaigeMiller
Diamond | Level 26
PROC SQL;
    CREATE TABLE OUTPUT_AUX AS SELECT MAKE, MODEL
    FROM SASHELP.CARS
    WHERE upcase(MODEL) in (select upcase(key) from lookup);
QUIT;

No macros needed. I can't test it on a data set that large, but 13 million records shouldn't really be problem these days. It's not clear why you mention you have 8 million variables, as those are not used in this problem at all.

--
Paige Miller
MART1
Quartz | Level 8

Thanks @PaigeMiller 

 

It does not work though; it must be a CONTAINS rather than IN.

 

I mentioned the 8 million to give an idea of the size (of course I cannot paste here); when I run it I get an "out of memory " type message.

 

thanks 

ballardw
Super User

@MART1 wrote:

Thanks @PaigeMiller 

 

It does not work though; it must be a CONTAINS rather than IN.

 

I mentioned the 8 million to give an idea of the size (of course I cannot paste here); when I run it I get an "out of memory " type message.

 

thanks 


How many of the 8 million variables are actually involved in the comparison? Reduce the size to what is actually needed such as the comparison variables and enough others to uniquely identify each record. Then if needed combine that back to the original data.

 

Also, how many distinct values of the comparison variables are involved?

 

How many variables are compared to the look up values? Just how many look up values are there?

And perhaps if the values of the variables aren't sensitive show some actual examples of what you are searching for in some actual values.

 

 

 

 

ballardw
Super User

Your description of "I need to identify those records in SASHELP.CARS where either "coupe", "hatch", "quattro" or "caravan" is contained in MODEL" is a "Check if VALUES" from a list exist.

 

It would help to show what you expect but consider:

proc sql;
   create table found as
   select b.make, b.model
   from lookup as a, sashelp.cars as b
   where find(b.model, a.key,'i')>0;
run;

The above code compares all records in the lookup data set with the SASHELP.Cars (the comma between the alias A and the Sashelp.cars data set name) and when the string of the Key in the lookup is found in the Model variable, ignoring case (the 'i' in the Find function) keeps the make and model.

Note: if your KEY could possibly appear twice in a single Model you would add the Distinct predicate to the Select to get only one result per Make/model combination

proc sql;
   create table found as
   select distinct b.make, b.model
   from lookup as a, sashelp.cars as b
   where find(b.model, a.key,'i')>0;
run;

 

PaigeMiller
Diamond | Level 26

Probably the code from @ballardw is better, but still a Cartesian join in SQL with 13 million records may be slow. Whether it is slower than the solution I present is unknown. 

 

%macro dothis;
proc sql noprint;
    select quote(trim(key)) into :keys separated by ' ' from lookup;
quit;

data want;
    set sashelp.cars;
    if %do i=1 %to %sysfunc(countw(&keys));
        find(model,%scan(&keys,&i,%str( )),'i')>0
        %if &i<%sysfunc(countw(&keys)) %then or;
        %end;
    then output;
run;
%mend;
%dothis
--
Paige Miller
MART1
Quartz | Level 8
Thanks @PaigeMiller

unfortunately the macro "keys" exceeds the maximum length (65k)
thanks
yabwon
Onyx | Level 15

Maybe data step array could be solution?

data LOOKUP;
input KEY $;
datalines;
coupe
hatch
quattro
caravan
;
run;

data _null_;
  call symputX("N",nobs,"G");        
  stop;
  set lookup nobs=nobs;
run;




data want;
    /* load keys int an array */
    array lookup[&N.] $ 12 _temporary_;
    do _N_ = 1 by 1 until(end);
      set lookup end=end;
      lookup[_N_] = KEY; drop KEY;
    end; 

    /* loop over data set */
    do until(EOF);
      set sashelp.cars end=EOF;

      do _N_ = 1 to &N.;
        if find(model,lookup[_N_],'it') then 
          do; /* if model contains key do output... */
            output;
            leave; /* ...and leave the loop since there is no point to check next */
          end;
      end;
    end;

  stop;
run;

Everything in one data step with no macro loops.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



ghosh
Barite | Level 11

In your first line if you mean "values" rather than "variables" then the following will likely work in an 8 million record database.  A macro  variable can hold about 65K so you could fit in a large amount of values 


proc sql noprint;                              
 select key into :to_inc separated by '|'
 from lookup;
quit;
%put &=to_inc;

Data want;
length match 3;
 Set sashelp.cars;
    if prxmatch("/&to_inc/i",model) 
 	Then MATCH=1;
 	else MATCH=0;
if match;
run;
MART1
Quartz | Level 8

Thanks @ghosh 

 

unfortunately the length of the macro exceeds 65k (maybe a can try breaking up)

 

thanks 

ghosh
Barite | Level 11
Yes, with this approach you could break up the list. How may "words" do you have in your lookup list.

Frankly if I were you I would try the proc format approach shown by @SASKiwi
It's also described in this paper, see p.2
https://support.sas.com/resources/papers/proceedings/proceedings/sugi26/p102-26.pdf

MART1
Quartz | Level 8

Hi @ghosh 

 

I tried to break it up, but in order to work it would need far too many partitions (the lookup list contains all UK postcodes, 8 M "words", each 6 or 7 digits).

 

I'll try the proc format approach (never seen it before but I'll give it a go!)

thanks

SASKiwi
PROC Star

Here's another efficient way that doesn't have the limitations of lists:

data LOOKUP;
input start $;
FmtName = '$Lookup' ;
if _n_ = 1 then do;
  HLO = 'O';
  label = 'N';
  output;
end;
HLO = '';
label = 'Y';
output;
datalines;
SEDAN
WAGON
SUV
;
run;

proc format cntlin = LOOKUP;
run;

data want;
  set SASHELP.CARS;
  where put(upcase(type), $Lookup.) = 'Y';
run; 
andreas_lds
Jade | Level 19

You are mixing terms, making it difficult to follow the problem description. You are talking about finding variables in dataset, but actually it seems that you want to check if one/some/all variables contain one value of list of values.

To make our lives simpler, please post some observations of the data you have as working sas code (data step with datalines). The code posted by @yabwon seems to solve the problem.

MART1
Quartz | Level 8

Thanks @andreas_lds 

 

In the "real world", my dataset contains a list of users (column A) with their associated address (column B). The address may or may not contain the postcode. This dataset contains about 13M records.

 

In another dataset I have a list of all UK postcodes (distinct list); it's about 8M records.

 

The goal is to identify those users in the first dataset who have a postcode (any postcode) in their address.

 

Translated to my dummy example, I wanted to find those records in CARS where model contain either "quattro" or "coupe"...

If it does, the record needs to be added to the final output.

 

I have used  @yabwon code; it works perfectly if I use a small subset of the postcode, but when I use the full list (8M) it takes a long time (tried a few times but I had to  stop it after 3 hours).

 

Maybe I'm just too "demanding" and this would take quite some time?

Thanks

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 6652 views
  • 4 likes
  • 9 in conversation