BookmarkSubscribeRSS Feed
RyanJB
Obsidian | Level 7

I am trying to count the number of times a particular string occurs over all my observations across 10 variables.

 

To test this, I've just been doing it over one variable (with the thinking that I could use a loop/macro to go over the others), but my solution is extremely inefficient. Part of the problem is that there are ~500 different strings I need to test for over 650,000 observations. Using SAS EG 7.1

 

Here is an abbreviated version of what I have:

 

data have;
	input check a1a a2a a3a a1b a2b;
	datalines;
	a1a . . . . .;
	a1a . . . . .;
	a2a . . . . .;
	a1b . . . . .;
	a1b . . . . .;
run;

data want;
	set have;	
	array progs a1a -- a2b;
	array new x1 - x5;
	do i = 1 to dim(progs);
		if vname(progs{i}) = check then do; 
			progs{i} = 1;
			new{i} + progs{i};
		end;
	end;
run;

I have taken the names of 'progs' (a1a, a2a, etc.) in this case from another data set and transposed and merged them to get them to appear here as variables because I wasn't sure how else to make them easily accessible as a list (perhaps an %Array?) and am trying very hard to avoid typing all 500 or so of them.

 

I also have the 2 step addition because when using progs{i} + 1; instead of progs{i} = 1; (I also tried using the sum() function, to no avail), I would end up with random 1s in a sea of missings. Making it 1 and adding the value to a separate variable does give me what I need, though.

 

What I ultimately want (in a report would also be fine) is the last data line to be:

a1b 2 1 . 2 .

 

Any help streamlining/reworking this would be greatly appreciated.

2 REPLIES 2
Tom
Super User Tom
Super User

Your current description is confusing.

 

What does the list of search terms you want to find look like?  Is it essentially a single variable dataset with 500 observations?

 

What does the data where you are going to search for the terms look like?  Is it also a tall table with a single variable to try to locate the values within?  Or something else?  If it has multiple variables that need to be searched, why?  What is the meaning of the multiple variables?

 

What is the criteria for finding the search term?  Is it anywhere in the string or does it need to match a word in the string? Does CASE of the letter matter?

 

What is the output you want?  Is it a count of the number of observations matching each string?  Can an single observation match multiple strings?  If the data to be searched has multiple variables what does it mean if the same term appears in more than one of the variables?  Does that count as one observation found, or multiple hits found?

 

A small example with just a few observations and a couple of search terms would help a lot in explaining what you have. Please provide expected result for the provided sample data.

RyanJB
Obsidian | Level 7

The sample data set I include is representative of what I have (there are other variables in it, but they aren't used for this analysis). The variable 'check' is the list I want to count (~650,000 observations) and the variables a1a, a2a, etc. are the search terms (~500 variations). I want to know how many times each of those terms occurs over the observations in 'check'.

What does the list of search terms you want to find look like? Is it essentially a single variable dataset with 500 observations?

 

Yes. I then transpose and merge it to get the example data set I provided.

What does the data where you are going to search for the terms look like? Is it also a tall table with a single variable to try to locate the values within? Or something else? If it has multiple variables that need to be searched, why? What is the meaning of the multiple variables?

The data I am searching is also tall, but has a lot of related and unrelated variables. They could be dropped to accomplish this, in which case it could be a single variable set as well.

 

There are 12 variables I need to search; each one corresponds to a ranked choice and the search terms are coded values for those choices. Since they are ranked, I will look at them separately, so I can just search over the single column (as in the example).

 

What is the criteria for finding the search term? Is it anywhere in the string or does it need to match a word in the string? Does CASE of the letter matter?

The strings must match, case does not matter (but the set is clean and that should not be an issue). There is very little chance of strings that don't match one of the search terms showing up.

What is the output you want? Is it a count of the number of observations matching each string? Can an single observation match multiple strings? If the data to be searched has multiple variables what does it mean if the same term appears in more than one of the variables? Does that count as one observation found, or multiple hits found?

The sample output is what I'm looking for - just a simple count of how many times each search term shows up. A single observation is only able to match one string (by the nature of the search terms - they are distinct and match the observations exactly), and thus can only be counted once.

 

The expected result above was just the last line, but doesn't correctly represent the code. Here is what the output set looks like:

check  a1a  a2a  a3a  a1b  a2b  x1  x2  x3  x4  x5
a1a     1    .    .    .    .    1   .   .   .   .
a1a     1    .    .    .    .    2   .   .   .   .
a2a     .    1    .    .    .    2   1   .   .   .
a1b     .    .    .    1    .    2   1   .   1   .
a1b     .    .    .    1    .    2   1   .   2   .

I can't provide the actual data (NDA... also the 500 variable width would make it difficult), but this sample data is very close to the actual data.As single variable lists, they look more like this:

 

check                            progs
a1a                               a1a
a1a                               a2a
a2a                               a3a
a1b                               a1b
a1b                               a2b
... ...

Thanks for taking a look!

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 452 views
  • 0 likes
  • 2 in conversation