Re: Frequency of Characters in a string

OS2Rules · Posted 10-23-2014 03:13 PM

Hi All:

I've got (another) quick question that I hope someone as another simple answer for.

I would like to take a character string and do a frequency analysis on each character in the string over about 1 million

records. This is so that I can see how many special characters, symbols, numbers, etc. are in the string. The reason

is that I am trying to show that the data coming from an external source that is supposed to be "clean" is crap (ie: there

are special characters in the name fields and things like that).

If I can figure out how to do this for 1 string, I plan to carry it over to all of the character strings in the table.

Thanks to all in advance.

data_null__ · Posted 10-23-2014 03:30 PM

What do you consider special characters?

Perhaps COUNTC would be helpful it has options to exclude/include groups of characters that may fit your definition of special characters.

Sample data would help.

OS2Rules · Posted 10-23-2014 03:42 PM

data_null:

I'm currently just looking at blank strings and the results of the ANYPUNCT and NOTPRINT functions.

What I need is a count of every character so I can see exactly which (and how many) of each character there are in the string.

data_null__ · Posted 10-23-2014 03:52 PM

Do you need to know the individual count for each of the characters in the group of characters defined by ANYPUNCT and NOTPRINT? Because COUNTC could give you for each or both groups overall for each record.

ballardw · Posted 10-23-2014 03:38 PM

There may be issues with encoding. If ASCII or EBCIDIC something like this might help:

Retain a1 - a255 0;

array char A1 -A255;

do j = 1 to length(string);

n= rank(substr(string,j,1));

A= A+1;

end;

to count the characters by value in one specific string variable.

I'm not sure how to go about the DBCS

OS2Rules · Posted 10-23-2014 03:49 PM

ballardw:

Not to worry about EBCDIC or DBCS - the whole thing is in ASCII.

Your solution looks interesting.... I will give it at try as time permits and let you know.

(also - my IBM System/370 Reference Summary GX20-1850-3 circa 1976 spells it: EBCDIC)

ballardw · Posted 10-23-2014 03:53 PM

Typos.

Basically you have counting bins. One concern I have, is with a million records and any significant length and/or number of string variables you may exceed the storage limit, at least for common characters such as e,r,s,t,l and n,

data_null__ · Posted 10-23-2014 03:58 PM

Can you modify your technique to just count the characters in ANYPUNCT and NOTPRINT groups?

ballardw · Posted 10-23-2014 07:37 PM

The fun part is printing a report with nonprintable characters being desired ...

This is a way to build start building the wanted punctuation characters, similar would work for NOTPRINT.

data test;

do dec=0 to 255;

byte=byte(dec);

hex=put(dec,hex2.);

anypunct=anypunct(byte);

output;

end;

run;

A number of ways could use this information to count the number of occurrences of each character in some strings but how you are going to print ASCII 7 (the bell IIRC) for use...

And DBCS are a complete mystery to me for such things.

Astounding · Posted 10-23-2014 04:47 PM

OK, simple it is. You could try to create a separate observation for each character (throwing away characters that you consider "correct"). For example:

data test;

set have;

length character $ 1;

string = compress(string,, 'adps'); /* possibly experiment here */

if string > ' ' then do _n_=1 to length(string);

character = substr(string, _n_, 1);

output;

end;

keep character;

run;

proc freq data=test;

tables character;

run;

Of course, this detects individual characters only and not context errors. "SAS-based" might be OK, while "SA-Sbased" is not. But this program can't tell the difference.

Good luck.

OS2Rules · Posted 10-24-2014 08:34 AM

Hi All:

Thanks for all the great ideas, but I think I may have bit off more than I can chew and I'm going to have to add some constraints.

I was thinking along the lines of creating a 256 element array and doing a frequency of each character in my string based on the

hex value but, as noted, the numbers may get too large.

I think I might be able to get away with a frequency count of all non-alpha characters in the field, something like:

do I=1 to length(field);

if indexc(upcase(substr(field,i,1),'ABCDEFGHIJKLMNOPQRSTUVWXYZ'))= 0 then do;

character(i)=substr(field,i,1);

count(I)+1;

end;

(off the cuff so please ignore syntax).

Ksharp · Posted 10-24-2014 11:31 AM

@Astounding,

OP has over one millions obs . It is horrible to change the data structure from wide to long . That could generate a table which have one billion obs . I would recommend to use Hash Table , if OP have enough memory .

Xia Keshan

Astounding · Posted 10-28-2014 10:28 AM

I understand, but in this case the "long" data set won't be very long. Notice that I used COMPRESS to remove letters, numbers, spaces, and punctuation. Whatever is left would become part of the output, one character at a time. The desired character set may be more (or less) than that, which is why I indicated that it might be necessary to experiment with the right third parameter to COMPRESS.

OS2Rules · Posted 10-28-2014 10:46 AM

Astounding:

Part of this exercise is to count all the punctuation, spaces, special characters, etc. This way we can review the source data with

the knowledge of how much cleansing each field needs.

Astounding · Posted 10-28-2014 11:27 AM

OK then, here are a couple of ideas to consider.

Do you really need 256 counters for each character variable? Perhaps you could group some of those such as all digits, all lowercase letters, all uppercase letters?

If the counters become excessively large, does that really matter?

Good luck.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!