Hi SAS Pros,
I am facing a QUESTION!
I am doing an almost entire character dataset. I found some texts with space after it. Specifically, I want to remove all the trailing space after text. I don't care about the leading space or the extra space in between. I have tried trim(var name), trimn(var name), strip(var name), and compress(var name). But none of them worked for my data. I don't why it doesn't.
Also, I am curious that why SAS cannot recognize the text with a trailing space, e.g. Egg; bread; carrot; and Egg; bread; carrot; (there is a trailing space after). I found the texts with trailing spaces were not successfully coded.
Thank you so much for any help!
Best regards,
C
Show us the code you have tried, by pasting your code into the box that appears when you click on the "running man" icon.
Show us a portion of your data as SAS data step code.
Thank you Paige and all of you who responded me with this thread!!!
I think we won't know what the exact issue it is until we see the portion of the data. Enclosed is the portion of the variable with trailing space from what I understand. Actually, it could be leading spaces or tab or return/enter or whatever. You can see from the Excel file that I uploaded. If you find there are two 'identical' values, there must be one without a trailing space and one with it, such as 'Atrial Fibrillation' and 'Atrial Fibrillation '. When I created dummy variable, SAS can only recognize the first 'Atrial Fibrillation'---the one without the trailing space after text. Same issues happened with 'Chronic Alcoholism; Atherosclerotic and Hypertensive Cardiovascular Disease; Emphysema '. Because the variable doesn't have the value of this one without the trailing space, it only showed up once, but it still has a trailing space that I want to remove.
I have used the following syntax to generate the dummies, but realized that SAS doesn't recognize those ones with trailing spaces:
data have;
set want;
if significant_conditions in
('Chronic Alcoholism; Atherosclerotic and Hypertensive Cardiovascular Disease; Emphysema ',
'Cardiomegaly; Arteriosclerotic Cardiovascular Disease; Morbid Obesity ',
'Atrial Fibrillation',
'Atrial Fibrillation ',
'Cardiomegaly',
'Cardiomegaly ',
'Atherosclerotic and Hypertensive Cardiovascular Disease ',
'Atherosclerotic and Hypertensive Cardiovascular Disease') then heart_disease=1;
if significant_conditions in
('Chronic Alcoholism; Atherosclerotic and Hypertensive Cardiovascular Disease; Emphysema ',
'Cardiomegaly; Arteriosclerotic Cardiovascular Disease; Morbid Obesity ') then Alcoholism=1;
if significant_conditions in
('Chronic Alcoholism; Atherosclerotic and Hypertensive Cardiovascular Disease; Emphysema ',
'Cardiomegaly; Arteriosclerotic Cardiovascular Disease; Morbid Obesity ') then obesity=1;
if significant_conditions in
('Chronic Alcoholism; Atherosclerotic and Hypertensive Cardiovascular Disease; Emphysema ',
'Cardiomegaly; Arteriosclerotic Cardiovascular Disease; Morbid Obesity ') then respiratory_disease=1;
run;
If you really need to test if 'XXX' is different than 'XXX ' then you will probably need to change the spaces to some other character and then test. But I cannot think of any real situation where that would be needed, and your example program does not need it. In your current code you are repeating the same string to test in the IN list. Having the same string twice in the list does not make the result of the IN operator any different. Prove it to yourself by writing a simple test program.
data have ;
input string $20.;
cards;
xxx
abc
;
data want;
set have;
test1 = string in ('xxx');
test2 = string in ('xxx ');
test3 = string in ('xxx ');
run;
Results:
Obs string test1 test2 test3 1 xxx 1 1 1 2 abc 0 0 0
Please show some actual data for the variable significant_conditions in the dataset have and what result you want out.
Repeating the code I've already posted here to answer your question. This code will also work for your medical data - and it will clean-up your strings by replacing any combination of characters that are not a letter, number, underscore or dash with a blank.
data have;
infile datalines truncover dlm='|';
input id lunch :$100.;
datalines4;
1|egg; bread; carrot;
2|cheese; veggie juice; cucumber;
3|noodles; three pieces of pork ribs; lettuce
4|some bread stick; salmon; mango juice
5|milk; cookie; tomato;
6|bread; carrot; egg;
;;;;
data have_long(drop=_:);
length id 8;
set have(rename=(id=group_id));
length item $60;
do _i=1 by 1;
item=scan(lunch,_i,';');
item=prxchange('s/[^\w-]+/ /',-1,strip(item));
item=upcase(strip(item));
if missing(item) then leave;
id+1;
output;
end;
run;
proc format;
value $foodcat (multilabel)
'BREAD' = 'Carbon'
'CARROT' = 'Fiber'
'CHEESE' = 'Protein'
'COOKIE' = 'Carbon'
'CUCUMBER' = 'Fiber'
'EGG' = 'Protein'
'LETTUCE' = 'Fiber'
'MANGO JUICE' = 'Fiber'
'MANGO JUICE' = 'Juice'
'MILK' = 'Protein'
'NOODLES' = 'Carbon'
'SALMON' = 'Protein'
'SOME BREAD STICK' = 'Carbon'
'THREE PIECES OF PORK RIBS' = 'Protein'
'TOMATO' = 'Fiber'
'VEGGIE JUICE' = 'Fiber'
'VEGGIE JUICE' = 'Juice'
;
run;
proc summary data=have_long n nway;
class group_id;
class item / mlf;
id lunch;
format item $foodcat.;
output out=long_n;
run;
proc transpose data=long_n out=want(drop=_:) ;
by group_id lunch;
id item;
var _freq_;
run;
proc print data=want;
run;
Just create a list of distinct terms from the have_long table and then use this list to replace the format in above code.
The multilabel option in the format allows you to assign more than one category to a single term. The category names become the SAS variable names when transposing the data so make sure to use strings that are valid to use as SAS variable names.
proc sql;
select distinct item
from have_long;
quit;
I believe you're looking for compbl(). Solution approach to "clean-up" your data here where you're asking the same question but give us more information.
I may misunderstand how SAS stores strings.
SAS stores fixed-length character strings. If the string is shorter than the variable, its value is padded with trailing spaces. For this reason, when comparing values, SAS omits trailing spaces, thus 'A' is equal to 'A '.
Your question is not at all clear. You posted this in you quesiton:
Egg; bread; carrot;
and
Egg; bread; carrot;
Both of those two strings separated by AND look exactly the same as you can tell when posted using a monospaced font.
And problems like this are exactly why I suggested making a "long" set with single values and no delimiters. If your data source is so inconsistent then it really is up to you make it consistent. Or browbeat the source to fix the problems at the source.
I suspect it will be much easier to clean the data to a consistent behavior than the code around all of the garbage in long inconsistently delimited strings.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.