Hello
I need to check a table against a list of variables, to identify those records in the table where any of the variables is contained.
In my dummy example below, I need to identify those records in SASHELP.CARS where either "coupe", "hatch", "quattro" or "caravan" is contained in MODEL.
This works fine but in my real data I have about 13 million records, with about 8 million variables; SAS always times out before it completes the task.
I was hoping there would be a better way to do this?
many thanks
/*create lookup table with keys to be searched*/ data LOOKUP; input KEY $; datalines; coupe hatch quattro caravan ; run; /*create table where output will be saved*/ PROC SQL; CREATE TABLE OUTPUT ( MAKE char (20), MODEL char (40) ); %macro makereport(Var=); /*marco loops*/ PROC SQL; CREATE TABLE OUTPUT_AUX AS SELECT MAKE, MODEL FROM SASHELP.CARS WHERE UPPER(MODEL) contains UPPER(&Var); QUIT; /*append result to output table*/ PROC APPEND BASE=OUTPUT DATA=OUTPUT_AUX FORCE; RUN; %mend makereport; data _null_; set WORK.LOOKUP; string = cats('%makereport(Var="',KEY,'");'); call execute(string); run;
PROC SQL;
CREATE TABLE OUTPUT_AUX AS SELECT MAKE, MODEL
FROM SASHELP.CARS
WHERE upcase(MODEL) in (select upcase(key) from lookup);
QUIT;
No macros needed. I can't test it on a data set that large, but 13 million records shouldn't really be problem these days. It's not clear why you mention you have 8 million variables, as those are not used in this problem at all.
Thanks @PaigeMiller
It does not work though; it must be a CONTAINS rather than IN.
I mentioned the 8 million to give an idea of the size (of course I cannot paste here); when I run it I get an "out of memory " type message.
thanks
@MART1 wrote:
Thanks @PaigeMiller
It does not work though; it must be a CONTAINS rather than IN.
I mentioned the 8 million to give an idea of the size (of course I cannot paste here); when I run it I get an "out of memory " type message.
thanks
How many of the 8 million variables are actually involved in the comparison? Reduce the size to what is actually needed such as the comparison variables and enough others to uniquely identify each record. Then if needed combine that back to the original data.
Also, how many distinct values of the comparison variables are involved?
How many variables are compared to the look up values? Just how many look up values are there?
And perhaps if the values of the variables aren't sensitive show some actual examples of what you are searching for in some actual values.
Your description of "I need to identify those records in SASHELP.CARS where either "coupe", "hatch", "quattro" or "caravan" is contained in MODEL" is a "Check if VALUES" from a list exist.
It would help to show what you expect but consider:
proc sql; create table found as select b.make, b.model from lookup as a, sashelp.cars as b where find(b.model, a.key,'i')>0; run;
The above code compares all records in the lookup data set with the SASHELP.Cars (the comma between the alias A and the Sashelp.cars data set name) and when the string of the Key in the lookup is found in the Model variable, ignoring case (the 'i' in the Find function) keeps the make and model.
Note: if your KEY could possibly appear twice in a single Model you would add the Distinct predicate to the Select to get only one result per Make/model combination
proc sql; create table found as select distinct b.make, b.model from lookup as a, sashelp.cars as b where find(b.model, a.key,'i')>0; run;
Probably the code from @ballardw is better, but still a Cartesian join in SQL with 13 million records may be slow. Whether it is slower than the solution I present is unknown.
%macro dothis;
proc sql noprint;
select quote(trim(key)) into :keys separated by ' ' from lookup;
quit;
data want;
set sashelp.cars;
if %do i=1 %to %sysfunc(countw(&keys));
find(model,%scan(&keys,&i,%str( )),'i')>0
%if &i<%sysfunc(countw(&keys)) %then or;
%end;
then output;
run;
%mend;
%dothis
Maybe data step array could be solution?
data LOOKUP;
input KEY $;
datalines;
coupe
hatch
quattro
caravan
;
run;
data _null_;
call symputX("N",nobs,"G");
stop;
set lookup nobs=nobs;
run;
data want;
/* load keys int an array */
array lookup[&N.] $ 12 _temporary_;
do _N_ = 1 by 1 until(end);
set lookup end=end;
lookup[_N_] = KEY; drop KEY;
end;
/* loop over data set */
do until(EOF);
set sashelp.cars end=EOF;
do _N_ = 1 to &N.;
if find(model,lookup[_N_],'it') then
do; /* if model contains key do output... */
output;
leave; /* ...and leave the loop since there is no point to check next */
end;
end;
end;
stop;
run;
Everything in one data step with no macro loops.
Bart
In your first line if you mean "values" rather than "variables" then the following will likely work in an 8 million record database. A macro variable can hold about 65K so you could fit in a large amount of values
proc sql noprint;
select key into :to_inc separated by '|'
from lookup;
quit;
%put &=to_inc;
Data want;
length match 3;
Set sashelp.cars;
if prxmatch("/&to_inc/i",model)
Then MATCH=1;
else MATCH=0;
if match;
run;
Thanks @ghosh
unfortunately the length of the macro exceeds 65k (maybe a can try breaking up)
thanks
Hi @ghosh
I tried to break it up, but in order to work it would need far too many partitions (the lookup list contains all UK postcodes, 8 M "words", each 6 or 7 digits).
I'll try the proc format approach (never seen it before but I'll give it a go!)
thanks
Here's another efficient way that doesn't have the limitations of lists:
data LOOKUP;
input start $;
FmtName = '$Lookup' ;
if _n_ = 1 then do;
HLO = 'O';
label = 'N';
output;
end;
HLO = '';
label = 'Y';
output;
datalines;
SEDAN
WAGON
SUV
;
run;
proc format cntlin = LOOKUP;
run;
data want;
set SASHELP.CARS;
where put(upcase(type), $Lookup.) = 'Y';
run;
You are mixing terms, making it difficult to follow the problem description. You are talking about finding variables in dataset, but actually it seems that you want to check if one/some/all variables contain one value of list of values.
To make our lives simpler, please post some observations of the data you have as working sas code (data step with datalines). The code posted by @yabwon seems to solve the problem.
Thanks @andreas_lds
In the "real world", my dataset contains a list of users (column A) with their associated address (column B). The address may or may not contain the postcode. This dataset contains about 13M records.
In another dataset I have a list of all UK postcodes (distinct list); it's about 8M records.
The goal is to identify those users in the first dataset who have a postcode (any postcode) in their address.
Translated to my dummy example, I wanted to find those records in CARS where model contain either "quattro" or "coupe"...
If it does, the record needs to be added to the final output.
I have used @yabwon code; it works perfectly if I use a small subset of the postcode, but when I use the full list (8M) it takes a long time (tried a few times but I had to stop it after 3 hours).
Maybe I'm just too "demanding" and this would take quite some time?
Thanks
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.