Re: Why trim/strip/compress statements don't work for my data when rem...

CynthiaWei · Posted 06-27-2020 08:39 PM

Hi SAS Pros,

I am facing a QUESTION!

I am doing an almost entire character dataset. I found some texts with space after it. Specifically, I want to remove all the trailing space after text. I don't care about the leading space or the extra space in between. I have tried trim(var name), trimn(var name), strip(var name), and compress(var name). But none of them worked for my data. I don't why it doesn't.

Also, I am curious that why SAS cannot recognize the text with a trailing space, e.g. Egg; bread; carrot; and Egg; bread; carrot; (there is a trailing space after). I found the texts with trailing spaces were not successfully coded.

Thank you so much for any help!

Best regards,

C

PaigeMiller · Posted 06-27-2020 09:00 PM

Show us the code you have tried, by pasting your code into the box that appears when you click on the "running man" icon.

Show us a portion of your data as SAS data step code.

--
Paige Miller

CynthiaWei · Posted 06-28-2020 03:33 PM

Thank you Paige and all of you who responded me with this thread!!!

I think we won't know what the exact issue it is until we see the portion of the data. Enclosed is the portion of the variable with trailing space from what I understand. Actually, it could be leading spaces or tab or return/enter or whatever. You can see from the Excel file that I uploaded. If you find there are two 'identical' values, there must be one without a trailing space and one with it, such as 'Atrial Fibrillation' and 'Atrial Fibrillation '. When I created dummy variable, SAS can only recognize the first 'Atrial Fibrillation'---the one without the trailing space after text. Same issues happened with 'Chronic Alcoholism; Atherosclerotic and Hypertensive Cardiovascular Disease; Emphysema '. Because the variable doesn't have the value of this one without the trailing space, it only showed up once, but it still has a trailing space that I want to remove.

I have used the following syntax to generate the dummies, but realized that SAS doesn't recognize those ones with trailing spaces:

data have;

set want;

if significant_conditions in

('Chronic Alcoholism; Atherosclerotic and Hypertensive Cardiovascular Disease; Emphysema ',

'Cardiomegaly; Arteriosclerotic Cardiovascular Disease; Morbid Obesity ',

'Atrial Fibrillation',

'Atrial Fibrillation ',

'Cardiomegaly',

'Cardiomegaly ',

'Atherosclerotic and Hypertensive Cardiovascular Disease ',

'Atherosclerotic and Hypertensive Cardiovascular Disease') then heart_disease=1;

if significant_conditions in

('Chronic Alcoholism; Atherosclerotic and Hypertensive Cardiovascular Disease; Emphysema ',

'Cardiomegaly; Arteriosclerotic Cardiovascular Disease; Morbid Obesity ') then Alcoholism=1;

if significant_conditions in

('Chronic Alcoholism; Atherosclerotic and Hypertensive Cardiovascular Disease; Emphysema ',

'Cardiomegaly; Arteriosclerotic Cardiovascular Disease; Morbid Obesity ') then obesity=1;

if significant_conditions in

('Chronic Alcoholism; Atherosclerotic and Hypertensive Cardiovascular Disease; Emphysema ',

'Cardiomegaly; Arteriosclerotic Cardiovascular Disease; Morbid Obesity ') then respiratory_disease=1;

run;

Tom · Posted 06-28-2020 04:57 PM

If you really need to test if 'XXX' is different than 'XXX ' then you will probably need to change the spaces to some other character and then test. But I cannot think of any real situation where that would be needed, and your example program does not need it. In your current code you are repeating the same string to test in the IN list. Having the same string twice in the list does not make the result of the IN operator any different. Prove it to yourself by writing a simple test program.

data have ;
  input string $20.;
cards;
xxx
abc
;

data want;
  set have;
   test1 = string  in ('xxx');
   test2 = string in ('xxx     ');
   test3 = string in ('xxx                 ');
run;

Results:

Obs    string    test1    test2    test3

 1      xxx        1        1        1
 2      abc        0        0        0

Please show some actual data for the variable significant_conditions in the dataset have and what result you want out.

Patrick · Posted 06-28-2020 07:13 PM

@CynthiaWei

Repeating the code I've already posted here to answer your question. This code will also work for your medical data - and it will clean-up your strings by replacing any combination of characters that are not a letter, number, underscore or dash with a blank.

data have;
  infile datalines truncover dlm='|';
  input id lunch :$100.;
  datalines4;
1|egg; bread; carrot;
2|cheese;	veggie juice; cucumber;
3|noodles; three pieces	of pork ribs; lettuce 
4|some bread   stick; salmon; mango juice
5|milk; cookie; tomato;	
6|bread; carrot; egg;
;;;;

data have_long(drop=_:);
  length id 8;
  set have(rename=(id=group_id));
  length item $60;
  do _i=1 by 1;
    item=scan(lunch,_i,';');
    item=prxchange('s/[^\w-]+/ /',-1,strip(item));
    item=upcase(strip(item));
    if missing(item) then leave;
    id+1;
    output;
  end;
run;

proc format;
  value $foodcat (multilabel)
    'BREAD' = 'Carbon'
    'CARROT' = 'Fiber'
    'CHEESE' = 'Protein'
    'COOKIE' = 'Carbon'
    'CUCUMBER' = 'Fiber'
    'EGG' = 'Protein'
    'LETTUCE' = 'Fiber'
    'MANGO JUICE' = 'Fiber'
    'MANGO JUICE' = 'Juice'
    'MILK' = 'Protein'
    'NOODLES' = 'Carbon'
    'SALMON' = 'Protein'
    'SOME BREAD STICK' = 'Carbon'
    'THREE PIECES OF PORK RIBS' = 'Protein'
    'TOMATO' = 'Fiber'
    'VEGGIE JUICE' = 'Fiber'
    'VEGGIE JUICE' = 'Juice'
    ;
run;

proc summary data=have_long n nway;
  class group_id;
  class item / mlf;
  id lunch;
  format item $foodcat.;
  output out=long_n;
run; 

proc transpose data=long_n out=want(drop=_:) ;
  by group_id lunch;
  id item;
  var _freq_;
run;

proc print data=want;
run;

Just create a list of distinct terms from the have_long table and then use this list to replace the format in above code.

The multilabel option in the format allows you to assign more than one category to a single term. The category names become the SAS variable names when transposing the data so make sure to use strings that are valid to use as SAS variable names.

proc sql;
  select distinct item
  from have_long;
quit;

Patrick · Posted 06-27-2020 09:05 PM

I believe you're looking for compbl(). Solution approach to "clean-up" your data here where you're asking the same question but give us more information.

ChrisNZ · Posted 06-27-2020 10:53 PM

I may misunderstand how SAS stores strings.

SAS stores fixed-length character strings. If the string is shorter than the variable, its value is padded with trailing spaces. For this reason, when comparing values, SAS omits trailing spaces, thus 'A' is equal to 'A '.

High-Performance SAS Coding - Third Edition

Reeza · Posted 06-27-2020 11:32 PM

Your strings actually had leading spaces not trailing if you're referring to your previous question. You can see them by printing out your values using a hex format and seeing a 20 as the first digits which is the hex code for a space.

proc freq data=have;
table item;
format item $hex32.;
run;

Tom · Posted 06-27-2020 11:32 PM

Your question is not at all clear. You posted this in you quesiton:

Egg; bread; carrot; 
and 
Egg; bread; carrot;

Both of those two strings separated by AND look exactly the same as you can tell when posted using a monospaced font.

Kurt_Bremser · Posted 06-27-2020 11:39 PM

See Maxim 46.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 06-28-2020 01:27 PM

And problems like this are exactly why I suggested making a "long" set with single values and no delimiters. If your data source is so inconsistent then it really is up to you make it consistent. Or browbeat the source to fix the problems at the source.

I suspect it will be much easier to clean the data to a consistent behavior than the code around all of the garbage in long inconsistently delimited strings.

Why trim/strip/compress statements don't work for my data when removing trailing space after text?

Re: Why trim/strip/compress statements don't work for my data when removing trailing space after tex

Re: Why trim/strip/compress statements don't work for my data when removing trailing space after tex

Re: Why trim/strip/compress statements don't work for my data when removing trailing space after tex

Re: Why trim/strip/compress statements don't work for my data when removing trailing space after tex

Re: Why trim/strip/compress statements don't work for my data when removing trailing space after tex

Re: Why trim/strip/compress statements don't work for my data when removing trailing space after tex

Re: Why trim/strip/compress statements don't work for my data when removing trailing space after tex

Re: Why trim/strip/compress statements don't work for my data when removing trailing space after tex

Re: Why trim/strip/compress statements don't work for my data when removing trailing space after tex

Re: Why trim/strip/compress statements don't work for my data when removing trailing space after tex

Registration is open

SAS Training: Just a Click Away