I am trying to count the number of times a particular string occurs over all my observations across 10 variables.
To test this, I've just been doing it over one variable (with the thinking that I could use a loop/macro to go over the others), but my solution is extremely inefficient. Part of the problem is that there are ~500 different strings I need to test for over 650,000 observations. Using SAS EG 7.1
Here is an abbreviated version of what I have:
data have;
input check a1a a2a a3a a1b a2b;
datalines;
a1a . . . . .;
a1a . . . . .;
a2a . . . . .;
a1b . . . . .;
a1b . . . . .;
run;
data want;
set have;
array progs a1a -- a2b;
array new x1 - x5;
do i = 1 to dim(progs);
if vname(progs{i}) = check then do;
progs{i} = 1;
new{i} + progs{i};
end;
end;
run;
I have taken the names of 'progs' (a1a, a2a, etc.) in this case from another data set and transposed and merged them to get them to appear here as variables because I wasn't sure how else to make them easily accessible as a list (perhaps an %Array?) and am trying very hard to avoid typing all 500 or so of them.
I also have the 2 step addition because when using progs{i} + 1; instead of progs{i} = 1; (I also tried using the sum() function, to no avail), I would end up with random 1s in a sea of missings. Making it 1 and adding the value to a separate variable does give me what I need, though.
What I ultimately want (in a report would also be fine) is the last data line to be:
a1b 2 1 . 2 .
Any help streamlining/reworking this would be greatly appreciated.
Your current description is confusing.
What does the list of search terms you want to find look like? Is it essentially a single variable dataset with 500 observations?
What does the data where you are going to search for the terms look like? Is it also a tall table with a single variable to try to locate the values within? Or something else? If it has multiple variables that need to be searched, why? What is the meaning of the multiple variables?
What is the criteria for finding the search term? Is it anywhere in the string or does it need to match a word in the string? Does CASE of the letter matter?
What is the output you want? Is it a count of the number of observations matching each string? Can an single observation match multiple strings? If the data to be searched has multiple variables what does it mean if the same term appears in more than one of the variables? Does that count as one observation found, or multiple hits found?
A small example with just a few observations and a couple of search terms would help a lot in explaining what you have. Please provide expected result for the provided sample data.
The sample data set I include is representative of what I have (there are other variables in it, but they aren't used for this analysis). The variable 'check' is the list I want to count (~650,000 observations) and the variables a1a, a2a, etc. are the search terms (~500 variations). I want to know how many times each of those terms occurs over the observations in 'check'.
What does the list of search terms you want to find look like? Is it essentially a single variable dataset with 500 observations?
Yes. I then transpose and merge it to get the example data set I provided.
What does the data where you are going to search for the terms look like? Is it also a tall table with a single variable to try to locate the values within? Or something else? If it has multiple variables that need to be searched, why? What is the meaning of the multiple variables?
The data I am searching is also tall, but has a lot of related and unrelated variables. They could be dropped to accomplish this, in which case it could be a single variable set as well.
There are 12 variables I need to search; each one corresponds to a ranked choice and the search terms are coded values for those choices. Since they are ranked, I will look at them separately, so I can just search over the single column (as in the example).
What is the criteria for finding the search term? Is it anywhere in the string or does it need to match a word in the string? Does CASE of the letter matter?
The strings must match, case does not matter (but the set is clean and that should not be an issue). There is very little chance of strings that don't match one of the search terms showing up.
What is the output you want? Is it a count of the number of observations matching each string? Can an single observation match multiple strings? If the data to be searched has multiple variables what does it mean if the same term appears in more than one of the variables? Does that count as one observation found, or multiple hits found?
The sample output is what I'm looking for - just a simple count of how many times each search term shows up. A single observation is only able to match one string (by the nature of the search terms - they are distinct and match the observations exactly), and thus can only be counted once.
The expected result above was just the last line, but doesn't correctly represent the code. Here is what the output set looks like:
check a1a a2a a3a a1b a2b x1 x2 x3 x4 x5
a1a 1 . . . . 1 . . . .
a1a 1 . . . . 2 . . . .
a2a . 1 . . . 2 1 . . .
a1b . . . 1 . 2 1 . 1 .
a1b . . . 1 . 2 1 . 2 .
I can't provide the actual data (NDA... also the 500 variable width would make it difficult), but this sample data is very close to the actual data.As single variable lists, they look more like this:
check progs
a1a a1a
a1a a2a
a2a a3a
a1b a1b
a1b a2b
... ...
Thanks for taking a look!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.