Data cleaning techniques : duplicate proc freq variable

uopsouthpaw · Posted 04-29-2019 05:10 PM

I have some pretty messy text data that I need to clean for consistency (case, spaces, spelling, etc.). Anyway, I'm using proc freq to check my progress in dealing with the entries and come up with Death showing up twice in my frequency table. I'd appreciate any guidance. Here is an example of where I'm going:

data import1_2;
set Import1 (rename='Harm Code'n = 'Harm Code: Raw'n);
'Harm Code'n = compbl(strip(left(upcase('Harm Code: Raw'n))));

...

run;

*Split observations by delimiter;

Data Import1_Harms (rename=new='Harm Code 2'n);

length new $50.;
set Import1_2;
do i=1 by 1 while(scan('Harm Code'n,i,',') ^=' ');
new=scan('Harm Code'n,i,',');
output;
end;
run;

proc freq data=Import1_Harms;
table 'Harm Code 2'n / missing;
run;

Harm Code 2	Frequency	Percent	Cumulative	Cumulative
Frequency	Percent
ABNORMAL BLOOD LOSS	7	0.37	7	0.37
ACCESS SITE COMPLICATIONS	5	0.27	13	0.69
CONVERSION	1	0.05	251	13.33
DEATH	31	1.65	287	15.24
DEATH (AAA RELATED)	4	0.21	291	15.45
DEATH (AAA)	6	0.32	297	15.77
DEATH (INCONCLUSIVE)	2	0.11	299	15.88
DEATH (INDETERMINATE)	1	0.05	300	15.93
…….
AS1	1	0.05	902	47.9
BL3	2	0.11	904	48.01
CMP	5	0.27	909	48.27
COMPLICATIONS	10	0.53	919	48.81
CONVERSION	1	0.05	920	48.86
CTI	1	0.05	923	49.02
DEATH	13	0.69	936	49.71
DEATH(UNKNOWN CAUSE)	1	0.05	953	50.61

Death (and other values) show up twice in the frequency list. Does this have something to do with the original format and field value?

Thanks,

Wes

Reeza · Posted 04-29-2019 05:27 PM

Use COMPRESS() to remove invisible blanks such as tabs or returns. You can use the modifiers to specify invisible blanks.

ballardw · Posted 04-29-2019 05:39 PM

You may only need to remove leading blanks. The sort order is apparently affected by what ever is causing your problem. A space appears before any letter in the sort order. Proc freq, and others, by default do not display the leading spaces but the order of the data is affected.

data example;
  length x $25 ;
  x=' Death'; output;
  x='Accident';output;
  x='Death'; output;
run;

proc freq; run;

data example2;
   set example;
   /* removes leading spaces among other things*/
   x=strip(x);
run;

proc freq data=example2;run;

other procedures allow you to use style options to reveal such:

proc tabulate data=example;
   class x;
   classlev x /style=[Asis=on];
   table x,n;
run;

ASIS=on tells the procedure not to remove the leading space for display.

Data cleaning techniques : duplicate proc freq variable

Re: Data cleaning techniques : duplicate proc freq variable

Re: Data cleaning techniques : duplicate proc freq variable

Data cleaning techniques : duplicate proc freq variable

Re: Data cleaning techniques : duplicate proc freq variable

Re: Data cleaning techniques : duplicate proc freq variable

SAS Innovate 2025: Register Now