BookmarkSubscribeRSS Feed
OS2Rules
Obsidian | Level 7

Hi All:

I've got (another) quick question that I hope someone as another simple answer for.

I would like to take a character string and do a frequency analysis on each character in the string over about 1 million

records.  This is so that I can see how many special characters, symbols, numbers, etc. are in the string.  The reason

is that I am trying to show that the data coming from an external source that is supposed to be "clean" is crap (ie: there

are special characters in the name fields and things like that).

If I can figure out how to do this for 1 string, I plan to carry it over to all of the character strings in the table.

Thanks to all in advance.

17 REPLIES 17
data_null__
Jade | Level 19

What do you consider special characters? 

Perhaps COUNTC would be helpful it has options to exclude/include groups of characters that may fit your definition of special characters.

Sample data would help.

OS2Rules
Obsidian | Level 7

data_null:

I'm currently just looking at blank strings and the results of the ANYPUNCT and NOTPRINT functions.

What I need is a count of every character so I can see exactly which (and how many) of each character there are in the string.

data_null__
Jade | Level 19

Do you need to know the individual count for each of the characters in the group of characters defined by ANYPUNCT and NOTPRINT?  Because COUNTC could give you for each or both groups overall for each record.

ballardw
Super User

There may be issues with encoding. If ASCII or EBCIDIC something like this might help:

Retain a1 - a255 0;

array char A1 -A255;

do j = 1 to length(string);

     n= rank(substr(string,j,1));

     A= A+1;

end;

to count the characters by value in one specific string variable.

I'm not sure how to go about the DBCS

OS2Rules
Obsidian | Level 7

ballardw:

Not to worry about EBCDIC or DBCS - the whole thing is in ASCII.

Your solution looks interesting.... I will give it at try as time permits and let you know.

(also - my IBM System/370 Reference Summary GX20-1850-3 circa 1976 spells it:  EBCDIC)

ballardw
Super User

Typos.

Basically you have counting bins. One concern I have, is with a million records and any significant length and/or number of string variables you may exceed the storage limit, at least for common characters such as e,r,s,t,l and n,

data_null__
Jade | Level 19

Can you modify your technique to just count the characters in ANYPUNCT and NOTPRINT groups?

ballardw
Super User

The fun part is printing a report with nonprintable characters being desired ...

This is a way to build start building the wanted punctuation characters, similar would work for NOTPRINT.

data test;

do dec=0 to 255;

   byte=byte(dec);

   hex=put(dec,hex2.);

   anypunct=anypunct(byte);

   output;

end;

run;

A number of ways could use this information to count the number of occurrences of each character in some strings but how you are going to print ASCII 7 (the bell IIRC) for use...

And DBCS are a complete mystery to me for such things.

Astounding
PROC Star

OK, simple it is.  You could try to create a separate observation for each character (throwing away characters that you consider "correct").  For example:

data test;

   set have;

   length character $ 1;

   string = compress(string,, 'adps');   /* possibly experiment here */

   if string > ' ' then do _n_=1 to length(string);

      character = substr(string, _n_, 1);

      output;

   end;

   keep character;

run;

proc freq data=test;

   tables character;

run;

Of course, this detects individual characters only and not context errors.  "SAS-based" might be OK, while "SA-Sbased" is not.  But this program can't tell the difference.

Good luck.

OS2Rules
Obsidian | Level 7

Hi All:

Thanks for all the great ideas, but I think I may have bit off more than I can chew and I'm going to have to add some constraints.

I was thinking along the lines of creating a 256 element array and doing a frequency of each character in my string based on the

hex value but, as noted, the numbers may get too large.

I think I might be able to get away with a frequency count of all non-alpha characters in the field, something like:

do I=1 to length(field);

    if indexc(upcase(substr(field,i,1),'ABCDEFGHIJKLMNOPQRSTUVWXYZ'))= 0 then do;

        character(i)=substr(field,i,1);

        count(I)+1;

        end;

    end;

(off the cuff so please ignore syntax).

Ksharp
Super User

@Astounding,

OP has over one millions obs . It is horrible to change the data structure from wide to long . That could generate a table which have one billion obs . I would recommend to use Hash Table , if OP have enough memory .

Xia Keshan

Astounding
PROC Star

I understand, but in this case the "long" data set won't be very long.  Notice that I used COMPRESS to remove letters, numbers, spaces, and punctuation.  Whatever is left would become part of the output, one character at a time.  The desired character set may be more (or less) than that, which is why I indicated that it might be necessary to experiment with the right third parameter to COMPRESS.


OS2Rules
Obsidian | Level 7

Astounding:

Part of this exercise is to count all the punctuation, spaces, special characters, etc.  This way we can review the source data with

the knowledge of how much cleansing each field needs.

Astounding
PROC Star

OK then, here are a couple of ideas to consider.

Do you really need 256 counters for each character variable?  Perhaps you could group some of those such as all digits, all lowercase letters, all uppercase letters?

If the counters become excessively large, does that really matter?

Good luck.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 2120 views
  • 1 like
  • 6 in conversation