Dear all,
I apologize for asking a similar question to one I asked sometime ago, but it does not work with string variables and that is what i need now.
I have a variable VAR which includes a series of words. I need to create a new variable multiple new variables which correspond to the frequencies of certain words in VAR.
In the example below, starting out with VAR, I need to create NEWRED, NEWBLUE and NEWYELLOW, which would have the values as indicated.
Example:
OBS | VAR | NEWRED | NEWBLUE | NEWYELLOW |
1 | red blue red red blue yellow | 3 | 2 | 1 |
2 | red | 1 | 0 | 0 |
3 | blue yellow | 0 | 1 | 1 |
Thanks in advance for your attention.
E
data have;
input OBS VAR & $50. ;*NEWRED NEWBLUE NEWYELLOW;
cards;
1 red blue red red blue yellow 3 2 1
2 red 1 0 0
3 blue yellow 0 1 1
;
data want;
set have;
NEWRED=count(var,'red');
NEWBLUE=count(var,'blue');
newyellow=count(var,'yellow');
run;
OBS | VAR | NEWRED | NEWBLUE | newyellow |
---|---|---|---|---|
1 | red blue red red blue yellow | 3 | 2 | 1 |
2 | red | 1 | 0 | 0 |
3 | blue yellow | 0 | 1 | 1 |
data have;
input OBS VAR & $50. ;*NEWRED NEWBLUE NEWYELLOW;
cards;
1 red blue red red blue yellow 3 2 1
2 red 1 0 0
3 blue yellow 0 1 1
;
data want;
set have;
NEWRED=count(var,'red');
NEWBLUE=count(var,'blue');
newyellow=count(var,'yellow');
run;
OBS | VAR | NEWRED | NEWBLUE | newyellow |
---|---|---|---|---|
1 | red blue red red blue yellow | 3 | 2 | 1 |
2 | red | 1 | 0 | 0 |
3 | blue yellow | 0 | 1 | 1 |
Thank you, this works well and it's efficient.
I should have, however, specified a bit further, since what I also need is a NEW variable that counts the occurrence of multiple colors.
See below variable called REDANDBLUE. Of course, i could calculate it in a further step, as a sum of NEWRED and NEWBLUE, but I wonder whether it could be done in one step. I tried NEWRED=count(var,'red' 'blue') and I have tried NEWRED=count(var,'red', 'blue'), but neither works.
Again, sorry for not posting the whole thing at once.
e
OBS | VAR | NEWRED | NEWBLUE | NEWYELLOW | REDANDBLUE |
1 | red blue red red blue yellow | 3 | 2 | 1 | 5 |
2 | red | 1 | 0 | 0 | 1 |
3 | blue yellow | 0 | 1 | 1 | 1 |
HI @emaneman Add another assignment statement in the same datastep-
data want;
set have;
NEWRED=count(var,'red');
NEWBLUE=count(var,'blue');
newyellow=count(var,'yellow');
REDANDBLUE=sum(newred,newblue);
run;
It looks like you just need to add.
REDANDBLUE=NEWRED+NEWBLUE;
@emaneman wrote:
Thank you, this works well and it's efficient.
I should have, however, specified a bit further, since what I also need is a NEW variable that counts the occurrence of multiple colors.
See below variable called REDANDBLUE. Of course, i could calculate it in a further step, as a sum of NEWRED and NEWBLUE, but I wonder whether it could be done in one step. I tried NEWRED=count(var,'red' 'blue') and I have tried NEWRED=count(var,'red', 'blue'), but neither works.
Again, sorry for not posting the whole thing at once.
e
OBS VAR NEWRED NEWBLUE NEWYELLOW REDANDBLUE 1 red blue red red blue yellow 3 2 1 5 2 red 1 0 0 1 3 blue yellow 0 1 1 1
Why is "RedAndBlue" = 1 in the 2nd and 3rd obs, both don't have "RED BLUE" as value.
The varname is a bit misleading. I don't mean that both red and blue should appear for REDANDBLUE to have a value or 1 or more, but simply counting the number of either red or blue in VAR. As it has been pointed out, I can of course do the sum of the variables RED and BLUE, which I can create with the COUNT function, but I was wondering whether the COUNT function can count different occurrences (eg. red, blue) in one go. But it does not seem to be the case, since REDBLUE=count(var,'red', 'blue') does not work.
@emaneman wrote:
The varname is a bit misleading. I don't mean that both red and blue should appear for REDANDBLUE to have a value or 1 or more, but simply counting the number of either red or blue in VAR. As it has been pointed out, I can of course do the sum of the variables RED and BLUE, which I can create with the COUNT function, but I was wondering whether the COUNT function can count different occurrences (eg. red, blue) in one go. But it does not seem to be the case, since REDBLUE=count(var,'red', 'blue') does not work.
Well, have a look at the docs of the function "count" ... the function is just not working that way. Other have already posted a solution, just adding the counts for blue and red will solve the problem.
Try this
data have;
input VAR $50.;
datalines;
red blue red red blue yellow
red
blue yellow
;
data want;
set have;
newred = count(VAR, 'red');
newblue = count(VAR, 'blue');
newyellow = count(VAR, 'yellow');
run;
When discussing about occurrence of one string within another you ALWAYS have to include whether this is supposed to include composite words that may contain your target values. So if you are interested in "red" do you count "redfish" or "toured" as an occurence?
I realize that your actual data is likely to be more complex and use different strings but the shorter your target and/or more common the letters in the target the more likely to have that target string appear in a composite word or part of another word.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.