I have some pretty messy text data that I need to clean for consistency (case, spaces, spelling, etc.). Anyway, I'm using proc freq to check my progress in dealing with the entries and come up with Death showing up twice in my frequency table. I'd appreciate any guidance. Here is an example of where I'm going:
data import1_2;
set Import1 (rename='Harm Code'n = 'Harm Code: Raw'n);
'Harm Code'n = compbl(strip(left(upcase('Harm Code: Raw'n))));
...
run;
*Split observations by delimiter;
Data Import1_Harms (rename=new='Harm Code 2'n);
length new $50.;
set Import1_2;
do i=1 by 1 while(scan('Harm Code'n,i,',') ^=' ');
new=scan('Harm Code'n,i,',');
output;
end;
run;
proc freq data=Import1_Harms;
table 'Harm Code 2'n / missing;
run;
Harm Code 2 | Frequency | Percent | Cumulative | Cumulative |
Frequency | Percent |
ABNORMAL BLOOD LOSS | 7 | 0.37 | 7 | 0.37 |
ACCESS SITE COMPLICATIONS | 5 | 0.27 | 13 | 0.69 |
CONVERSION | 1 | 0.05 | 251 | 13.33 |
DEATH | 31 | 1.65 | 287 | 15.24 |
DEATH (AAA RELATED) | 4 | 0.21 | 291 | 15.45 |
DEATH (AAA) | 6 | 0.32 | 297 | 15.77 |
DEATH (INCONCLUSIVE) | 2 | 0.11 | 299 | 15.88 |
DEATH (INDETERMINATE) | 1 | 0.05 | 300 | 15.93 |
……. | | | | |
AS1 | 1 | 0.05 | 902 | 47.9 |
BL3 | 2 | 0.11 | 904 | 48.01 |
CMP | 5 | 0.27 | 909 | 48.27 |
COMPLICATIONS | 10 | 0.53 | 919 | 48.81 |
CONVERSION | 1 | 0.05 | 920 | 48.86 |
CTI | 1 | 0.05 | 923 | 49.02 |
DEATH | 13 | 0.69 | 936 | 49.71 |
DEATH(UNKNOWN CAUSE) | 1 | 0.05 | 953 | 50.61 |
Death (and other values) show up twice in the frequency list. Does this have something to do with the original format and field value?
Thanks,
Wes