Help using Base SAS procedures

Frequency of Characters in a string

Reply
Super Contributor
Posts: 358

Frequency of Characters in a string

Hi All:

I've got (another) quick question that I hope someone as another simple answer for.

I would like to take a character string and do a frequency analysis on each character in the string over about 1 million

records.  This is so that I can see how many special characters, symbols, numbers, etc. are in the string.  The reason

is that I am trying to show that the data coming from an external source that is supposed to be "clean" is crap (ie: there

are special characters in the name fields and things like that).

If I can figure out how to do this for 1 string, I plan to carry it over to all of the character strings in the table.

Thanks to all in advance.

Respected Advisor
Posts: 3,777

Re: Frequency of Characters in a string

What do you consider special characters? 

Perhaps COUNTC would be helpful it has options to exclude/include groups of characters that may fit your definition of special characters.

Sample data would help.

Super Contributor
Posts: 358

Re: Frequency of Characters in a string

data_null:

I'm currently just looking at blank strings and the results of the ANYPUNCT and NOTPRINT functions.

What I need is a count of every character so I can see exactly which (and how many) of each character there are in the string.

Respected Advisor
Posts: 3,777

Re: Frequency of Characters in a string

Do you need to know the individual count for each of the characters in the group of characters defined by ANYPUNCT and NOTPRINT?  Because COUNTC could give you for each or both groups overall for each record.

Super User
Posts: 10,550

Re: Frequency of Characters in a string

There may be issues with encoding. If ASCII or EBCIDIC something like this might help:

Retain a1 - a255 0;

array char A1 -A255;

do j = 1 to length(string);

     n= rank(substr(string,j,1));

     A= A+1;

end;

to count the characters by value in one specific string variable.

I'm not sure how to go about the DBCS

Super Contributor
Posts: 358

Re: Frequency of Characters in a string

ballardw:

Not to worry about EBCDIC or DBCS - the whole thing is in ASCII.

Your solution looks interesting.... I will give it at try as time permits and let you know.

(also - my IBM System/370 Reference Summary GX20-1850-3 circa 1976 spells it:  EBCDIC)

Super User
Posts: 10,550

Re: Frequency of Characters in a string

Typos.

Basically you have counting bins. One concern I have, is with a million records and any significant length and/or number of string variables you may exceed the storage limit, at least for common characters such as e,r,s,t,l and n,

Respected Advisor
Posts: 3,777

Re: Frequency of Characters in a string

Can you modify your technique to just count the characters in ANYPUNCT and NOTPRINT groups?

Super User
Posts: 10,550

Re: Frequency of Characters in a string

The fun part is printing a report with nonprintable characters being desired ...

This is a way to build start building the wanted punctuation characters, similar would work for NOTPRINT.

data test;

do dec=0 to 255;

   byte=byte(dec);

   hex=put(dec,hex2.);

   anypunct=anypunct(byte);

   output;

end;

run;

A number of ways could use this information to count the number of occurrences of each character in some strings but how you are going to print ASCII 7 (the bell IIRC) for use...

And DBCS are a complete mystery to me for such things.

Super User
Posts: 5,099

Re: Frequency of Characters in a string

OK, simple it is.  You could try to create a separate observation for each character (throwing away characters that you consider "correct").  For example:

data test;

   set have;

   length character $ 1;

   string = compress(string,, 'adps');   /* possibly experiment here */

   if string > ' ' then do _n_=1 to length(string);

      character = substr(string, _n_, 1);

      output;

   end;

   keep character;

run;

proc freq data=test;

   tables character;

run;

Of course, this detects individual characters only and not context errors.  "SAS-based" might be OK, while "SA-Sbased" is not.  But this program can't tell the difference.

Good luck.

Super Contributor
Posts: 358

Re: Frequency of Characters in a string

Hi All:

Thanks for all the great ideas, but I think I may have bit off more than I can chew and I'm going to have to add some constraints.

I was thinking along the lines of creating a 256 element array and doing a frequency of each character in my string based on the

hex value but, as noted, the numbers may get too large.

I think I might be able to get away with a frequency count of all non-alpha characters in the field, something like:

do I=1 to length(field);

    if indexc(upcase(substr(field,i,1),'ABCDEFGHIJKLMNOPQRSTUVWXYZ'))= 0 then do;

        character(i)=substr(field,i,1);

        count(I)+1;

        end;

    end;

(off the cuff so please ignore syntax).

Super User
Posts: 9,691

Re: Frequency of Characters in a string

@Astounding,

OP has over one millions obs . It is horrible to change the data structure from wide to long . That could generate a table which have one billion obs . I would recommend to use Hash Table , if OP have enough memory .

Xia Keshan

Super User
Posts: 5,099

Re: Frequency of Characters in a string

I understand, but in this case the "long" data set won't be very long.  Notice that I used COMPRESS to remove letters, numbers, spaces, and punctuation.  Whatever is left would become part of the output, one character at a time.  The desired character set may be more (or less) than that, which is why I indicated that it might be necessary to experiment with the right third parameter to COMPRESS.


Super Contributor
Posts: 358

Re: Frequency of Characters in a string

Astounding:

Part of this exercise is to count all the punctuation, spaces, special characters, etc.  This way we can review the source data with

the knowledge of how much cleansing each field needs.

Super User
Posts: 5,099

Re: Frequency of Characters in a string

OK then, here are a couple of ideas to consider.

Do you really need 256 counters for each character variable?  Perhaps you could group some of those such as all digits, all lowercase letters, all uppercase letters?

If the counters become excessively large, does that really matter?

Good luck.

Ask a Question
Discussion stats
  • 17 replies
  • 467 views
  • 1 like
  • 6 in conversation