Hi All:
I've got (another) quick question that I hope someone as another simple answer for.
I would like to take a character string and do a frequency analysis on each character in the string over about 1 million
records. This is so that I can see how many special characters, symbols, numbers, etc. are in the string. The reason
is that I am trying to show that the data coming from an external source that is supposed to be "clean" is crap (ie: there
are special characters in the name fields and things like that).
If I can figure out how to do this for 1 string, I plan to carry it over to all of the character strings in the table.
Thanks to all in advance.
What do you consider special characters?
Perhaps COUNTC would be helpful it has options to exclude/include groups of characters that may fit your definition of special characters.
Sample data would help.
data_null:
I'm currently just looking at blank strings and the results of the ANYPUNCT and NOTPRINT functions.
What I need is a count of every character so I can see exactly which (and how many) of each character there are in the string.
Do you need to know the individual count for each of the characters in the group of characters defined by ANYPUNCT and NOTPRINT? Because COUNTC could give you for each or both groups overall for each record.
There may be issues with encoding. If ASCII or EBCIDIC something like this might help:
Retain a1 - a255 0;
array char A1 -A255;
do j = 1 to length(string);
n= rank(substr(string,j,1));
A
end;
to count the characters by value in one specific string variable.
I'm not sure how to go about the DBCS
ballardw:
Not to worry about EBCDIC or DBCS - the whole thing is in ASCII.
Your solution looks interesting.... I will give it at try as time permits and let you know.
(also - my IBM System/370 Reference Summary GX20-1850-3 circa 1976 spells it: EBCDIC)
Typos.
Basically you have counting bins. One concern I have, is with a million records and any significant length and/or number of string variables you may exceed the storage limit, at least for common characters such as e,r,s,t,l and n,
Can you modify your technique to just count the characters in ANYPUNCT and NOTPRINT groups?
The fun part is printing a report with nonprintable characters being desired ...
This is a way to build start building the wanted punctuation characters, similar would work for NOTPRINT.
data test;
do dec=0 to 255;
byte=byte(dec);
hex=put(dec,hex2.);
anypunct=anypunct(byte);
output;
end;
run;
A number of ways could use this information to count the number of occurrences of each character in some strings but how you are going to print ASCII 7 (the bell IIRC) for use...
And DBCS are a complete mystery to me for such things.
OK, simple it is. You could try to create a separate observation for each character (throwing away characters that you consider "correct"). For example:
data test;
set have;
length character $ 1;
string = compress(string,, 'adps'); /* possibly experiment here */
if string > ' ' then do _n_=1 to length(string);
character = substr(string, _n_, 1);
output;
end;
keep character;
run;
proc freq data=test;
tables character;
run;
Of course, this detects individual characters only and not context errors. "SAS-based" might be OK, while "SA-Sbased" is not. But this program can't tell the difference.
Good luck.
Hi All:
Thanks for all the great ideas, but I think I may have bit off more than I can chew and I'm going to have to add some constraints.
I was thinking along the lines of creating a 256 element array and doing a frequency of each character in my string based on the
hex value but, as noted, the numbers may get too large.
I think I might be able to get away with a frequency count of all non-alpha characters in the field, something like:
do I=1 to length(field);
if indexc(upcase(substr(field,i,1),'ABCDEFGHIJKLMNOPQRSTUVWXYZ'))= 0 then do;
character(i)=substr(field,i,1);
count(I)+1;
end;
end;
(off the cuff so please ignore syntax).
@Astounding,
OP has over one millions obs . It is horrible to change the data structure from wide to long . That could generate a table which have one billion obs . I would recommend to use Hash Table , if OP have enough memory .
Xia Keshan
I understand, but in this case the "long" data set won't be very long. Notice that I used COMPRESS to remove letters, numbers, spaces, and punctuation. Whatever is left would become part of the output, one character at a time. The desired character set may be more (or less) than that, which is why I indicated that it might be necessary to experiment with the right third parameter to COMPRESS.
Astounding:
Part of this exercise is to count all the punctuation, spaces, special characters, etc. This way we can review the source data with
the knowledge of how much cleansing each field needs.
OK then, here are a couple of ideas to consider.
Do you really need 256 counters for each character variable? Perhaps you could group some of those such as all digits, all lowercase letters, all uppercase letters?
If the counters become excessively large, does that really matter?
Good luck.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.