10-23-2014 03:13 PM
I've got (another) quick question that I hope someone as another simple answer for.
I would like to take a character string and do a frequency analysis on each character in the string over about 1 million
records. This is so that I can see how many special characters, symbols, numbers, etc. are in the string. The reason
is that I am trying to show that the data coming from an external source that is supposed to be "clean" is crap (ie: there
are special characters in the name fields and things like that).
If I can figure out how to do this for 1 string, I plan to carry it over to all of the character strings in the table.
Thanks to all in advance.
10-23-2014 03:30 PM
What do you consider special characters?
Perhaps COUNTC would be helpful it has options to exclude/include groups of characters that may fit your definition of special characters.
Sample data would help.
10-23-2014 03:42 PM
I'm currently just looking at blank strings and the results of the ANYPUNCT and NOTPRINT functions.
What I need is a count of every character so I can see exactly which (and how many) of each character there are in the string.
10-23-2014 03:52 PM
Do you need to know the individual count for each of the characters in the group of characters defined by ANYPUNCT and NOTPRINT? Because COUNTC could give you for each or both groups overall for each record.
10-23-2014 03:38 PM
There may be issues with encoding. If ASCII or EBCIDIC something like this might help:
Retain a1 - a255 0;
array char A1 -A255;
do j = 1 to length(string);
to count the characters by value in one specific string variable.
I'm not sure how to go about the DBCS
10-23-2014 03:49 PM
Not to worry about EBCDIC or DBCS - the whole thing is in ASCII.
Your solution looks interesting.... I will give it at try as time permits and let you know.
(also - my IBM System/370 Reference Summary GX20-1850-3 circa 1976 spells it: EBCDIC)
10-23-2014 03:53 PM
Basically you have counting bins. One concern I have, is with a million records and any significant length and/or number of string variables you may exceed the storage limit, at least for common characters such as e,r,s,t,l and n,
10-23-2014 07:37 PM
The fun part is printing a report with nonprintable characters being desired ...
This is a way to build start building the wanted punctuation characters, similar would work for NOTPRINT.
do dec=0 to 255;
A number of ways could use this information to count the number of occurrences of each character in some strings but how you are going to print ASCII 7 (the bell IIRC) for use...
And DBCS are a complete mystery to me for such things.
10-23-2014 04:47 PM
OK, simple it is. You could try to create a separate observation for each character (throwing away characters that you consider "correct"). For example:
length character $ 1;
string = compress(string,, 'adps'); /* possibly experiment here */
if string > ' ' then do _n_=1 to length(string);
character = substr(string, _n_, 1);
proc freq data=test;
Of course, this detects individual characters only and not context errors. "SAS-based" might be OK, while "SA-Sbased" is not. But this program can't tell the difference.
10-24-2014 08:34 AM
Thanks for all the great ideas, but I think I may have bit off more than I can chew and I'm going to have to add some constraints.
I was thinking along the lines of creating a 256 element array and doing a frequency of each character in my string based on the
hex value but, as noted, the numbers may get too large.
I think I might be able to get away with a frequency count of all non-alpha characters in the field, something like:
do I=1 to length(field);
if indexc(upcase(substr(field,i,1),'ABCDEFGHIJKLMNOPQRSTUVWXYZ'))= 0 then do;
(off the cuff so please ignore syntax).
10-24-2014 11:31 AM
OP has over one millions obs . It is horrible to change the data structure from wide to long . That could generate a table which have one billion obs . I would recommend to use Hash Table , if OP have enough memory .
10-28-2014 10:28 AM
I understand, but in this case the "long" data set won't be very long. Notice that I used COMPRESS to remove letters, numbers, spaces, and punctuation. Whatever is left would become part of the output, one character at a time. The desired character set may be more (or less) than that, which is why I indicated that it might be necessary to experiment with the right third parameter to COMPRESS.
10-28-2014 10:46 AM
Part of this exercise is to count all the punctuation, spaces, special characters, etc. This way we can review the source data with
the knowledge of how much cleansing each field needs.
10-28-2014 11:27 AM
OK then, here are a couple of ideas to consider.
Do you really need 256 counters for each character variable? Perhaps you could group some of those such as all digits, all lowercase letters, all uppercase letters?
If the counters become excessively large, does that really matter?