Hi everyone,
I would like to shorten this code, but couldn't find a working solution yet.
data Dataset_new;
set Dataset_1 Dataset_2 Dataset_3;
if whichc('Y1', of A1-A100, C, F) or
...
whichc('Y100', of A1-A100, C, F) or
whichc('X', of A1-A100, C, F);
run;
I want to extract observations, which have ONE OF the listed VALUES (Y1-Y100, X) in ONE OF the listed VARIABLES (A1-A100, C, F).
I wonder, if it is possible to sum up all the values in one statement (for example in one "WHICHC"-statement) as they all refer to the same set of variables.
Listing multiple values within one "WHICHC"-statement didn't work, no matter, how I tried to separate or combine them (for example by comma, space, "OR"-Operator etc.)
I also tried the "IF FIND(CAT())" combination instead of "IF WHICHC()" but listing multiple values in one statement didn't work there either.
Inserting an IN()-function and list the values therein, was also not successful.
Often I didn't receive any error messages, but the extracted observations were totally wrong.
Many thanks in advance,
Curt
Below a coding approach that will select rows where at least one variable contains a string Y1 ... Y100.
data have;
infile datalines truncover;
input (a1-a5 c f) ($);
datalines;
y1 . . . . x z
b b b b b
b a c d y3
d d d y2 y5
d d d d d y100
d d d d d y101
;
data want;
set have;
array vars {*} a1-a5 c f;
if prxmatch('/\b(y\d{1,2}|y100)\b/oi',strip(catx(' ',of vars[*])));
run;
proc print data=want;
run;
One always suspects data structure but here is one approach that works. Not 'multiple values' but placing them into a structure that shortens the code. I only provide 5 variables as that is all that is needed to show the result.
data have; input (a1-a5) ( $); datalines; y1 . . . . b b b b b b a c d y3 d d d y2 y5 ; data want; set have; array a (*) a1-a5; array t (5) $ 2 _temporary_ ('y1','y2','y3','y4','y5'); do i= 1 to dim(t); if whichc(t[i],of a(*))>0 then do; output; leave; end; end; drop i; run;
The T array would would require a definition for the length of the values large enough to hold the longest value. The _temporary_ definition means the variables are not written to the data set and stay the same as the initial set values unless you change them. If your actual values for y are sequential as shown you could load the array or replace it with a loop building the text of the search items. But real world data is seldom that nice so the temporary array should work though it will be long statement.
The "trick" is instead of using subsetting IF is the explicit Output when a condition is true coupled with the LEAVE instruction, again conditional, to exit the Do loop early.
The A array would have all the variables you want to search in so could include your C and D or other variables. I'm just too lazy to make complicated data to demonstrate.
I do include one example line with two values that are in the search for list to demonstrate only one gets written to the output set.
As a couple of asides: If you know that some of your values are more likely to occur than others then placing them first in the temporary array will improve efficiency of the code.
Also, if you wanted to indicate which value was considered as the match then you could a line like:
Found = t[i];
just before the Output statement; Define the length of Found to hold any expected value.
A further (but only minor) simplification of ballardw's elegant solution would be to replace the IF condition
whichc(t[i],of a(*))>0
by
t[i] in a
Below a coding approach that will select rows where at least one variable contains a string Y1 ... Y100.
data have;
infile datalines truncover;
input (a1-a5 c f) ($);
datalines;
y1 . . . . x z
b b b b b
b a c d y3
d d d y2 y5
d d d d d y100
d d d d d y101
;
data want;
set have;
array vars {*} a1-a5 c f;
if prxmatch('/\b(y\d{1,2}|y100)\b/oi',strip(catx(' ',of vars[*])));
run;
proc print data=want;
run;
That was very useful;
thanks to everyone, especially to you ballardw for your detailed explanations!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.