Solved: Re: Count each type of character

Jbraun · Posted 04-28-2022 07:24 AM

I have data like this (simplified version):

ID Text

1 ABC

2 ABCD

3 ABCDE

I wish to count how many times each type of character appears. But: the actual data it is a big dataset and I don't know what types of characters that may appear. In the example above, A, B, C, D and E appear, but in the actual data any unicode character may appear. The purpose is to remove certain characters since they seem to cause trouble when I try to import it in SQL, but it seems that I can import it in SAS. Therefore I try to figure find out which characters I need to remove to be able to import it in SQL. I found countc which allows me to specify a particular character I wish to count, which would work fine for a limited number of characters, but in this case I expect to get a table of many hundred lines in the output since I have several million lines in the raw data.

Any help would be very appreciated

Kind regards,

Jacob

andreas_lds · Posted 04-28-2022 07:44 AM

You should create a list of chars that don't cause any trouble, so that they can be removed with kcompress before variable text is analysed.

This could solve your problem:

data _null_;
   set have end= jobDone;
   
   length 
      OneChar $ 2 /*  not sure if a length of 4 is required */
      Count 8
   ;
   
   if _n_ = 1 then do;
      declare hash h(ordered: 'yes');
      h.defineKey('OneChar');
      h.defineData('OneChar', 'Count');
      h.defineDone();
   end;
   
   do i = 1 to lengthn(Text);
      OneChar = ksubstr(Text, i, 1);
      
      if h.find() ^= 0 then do;
         Count = 1;
         h.add();
      end;
      else do;
         Count = Count + 1;
         h.replace();
      end;
   end;
   
   if jobDone then do;
      h.output(dataset: 'work.counter');
   end;
run;

View solution in original post

andreas_lds · Posted 04-28-2022 07:44 AM

You should create a list of chars that don't cause any trouble, so that they can be removed with kcompress before variable text is analysed.

This could solve your problem:

data _null_;
   set have end= jobDone;
   
   length 
      OneChar $ 2 /*  not sure if a length of 4 is required */
      Count 8
   ;
   
   if _n_ = 1 then do;
      declare hash h(ordered: 'yes');
      h.defineKey('OneChar');
      h.defineData('OneChar', 'Count');
      h.defineDone();
   end;
   
   do i = 1 to lengthn(Text);
      OneChar = ksubstr(Text, i, 1);
      
      if h.find() ^= 0 then do;
         Count = 1;
         h.add();
      end;
      else do;
         Count = Count + 1;
         h.replace();
      end;
   end;
   
   if jobDone then do;
      h.output(dataset: 'work.counter');
   end;
run;

Jbraun · Posted 04-28-2022 07:48 AM

Awesome, thanks!

PaigeMiller · Posted 04-28-2022 07:49 AM

You can create a new data set of just single characters, then let PROC FREQ have all the fun

data intermediate;
    set have;
    length char $ 1;
    do i=1 to length(text);
        char=substr(text,i,1);
        output;
    end;
    drop i text;
run;
proc freq data=intermediate;
    tables char/noprint out=want;
run;

--
Paige Miller

Jbraun · Posted 04-28-2022 07:51 AM

Much appreciated!

Tom · Posted 04-28-2022 09:53 AM

Wouldn't it be easier to just understand what is happening?

Sounds like you have data in UTF-8 encoding and you are trying to write it to a database that is using some other encoding. What encoding is it using? Is it a single byte encoding? Like LATIN1?

If it is a single byte encoding the remove the characters that are not available in that encoding.

You might start with getting the list of invalid characters.

For example this will create a new variable named BAD with the characters that will not map to LATIN1 and only keep the observations that have at least one bad character.

data test;
  set have;
  bad = kcompress(text, kcvt(collate(0,256),'LATIN1','UTF-8'),'k');
  if bad ne ' ';
run;

Jbraun · Posted 04-28-2022 03:02 PM

The data is undocumentet so I don't know exactly what could hide in the file, but thanks for the suggestion

Tom · Posted 04-28-2022 03:45 PM

@Jbraun wrote:
The data is undocumentet so I don't know exactly what could hide in the file, but thanks for the suggestion

Your original description of the problem made it sound like the problem was not the source data, but the limitations of the target you were trying to push the data into. If the target database supported UTF-8 characters and your SAS session is running with UTF-8 support then reading files with any encoding should work fine.

It is when you try to go the other way that you end up trying to stuff ten pounds of **** into a five pound bag.