BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Jbraun
Fluorite | Level 6

I have data like this (simplified version):

 

ID     Text

1       ABC

2       ABCD

3       ABCDE

 

I wish to count how many times each type of character appears. But: the actual data it is a big dataset and I don't know what types of characters that may appear. In the example above, A, B, C, D and E appear, but in the actual data any unicode character may appear. The purpose is to remove certain characters since they seem to cause trouble when I try to import it in SQL, but it seems that I can import it in SAS. Therefore I try to figure find out which characters I need to remove to be able to import it in SQL. I found countc which allows me to specify a particular character I wish to count, which would work fine for a limited number of characters, but in this case I expect to get a table of many hundred lines in the output since I have several million lines in the raw data. 

 

Any help would be very appreciated

 

Kind regards,

Jacob

1 ACCEPTED SOLUTION

Accepted Solutions
andreas_lds
Jade | Level 19

You should create a list of chars that don't cause any trouble, so that they can be removed with kcompress before variable text is analysed.

 

This could solve your problem:

data _null_;
   set have end= jobDone;
   
   length 
      OneChar $ 2 /*  not sure if a length of 4 is required */
      Count 8
   ;
   
   if _n_ = 1 then do;
      declare hash h(ordered: 'yes');
      h.defineKey('OneChar');
      h.defineData('OneChar', 'Count');
      h.defineDone();
   end;
   
   do i = 1 to lengthn(Text);
      OneChar = ksubstr(Text, i, 1);
      
      if h.find() ^= 0 then do;
         Count = 1;
         h.add();
      end;
      else do;
         Count = Count + 1;
         h.replace();
      end;
   end;
   
   if jobDone then do;
      h.output(dataset: 'work.counter');
   end;
run;

View solution in original post

7 REPLIES 7
andreas_lds
Jade | Level 19

You should create a list of chars that don't cause any trouble, so that they can be removed with kcompress before variable text is analysed.

 

This could solve your problem:

data _null_;
   set have end= jobDone;
   
   length 
      OneChar $ 2 /*  not sure if a length of 4 is required */
      Count 8
   ;
   
   if _n_ = 1 then do;
      declare hash h(ordered: 'yes');
      h.defineKey('OneChar');
      h.defineData('OneChar', 'Count');
      h.defineDone();
   end;
   
   do i = 1 to lengthn(Text);
      OneChar = ksubstr(Text, i, 1);
      
      if h.find() ^= 0 then do;
         Count = 1;
         h.add();
      end;
      else do;
         Count = Count + 1;
         h.replace();
      end;
   end;
   
   if jobDone then do;
      h.output(dataset: 'work.counter');
   end;
run;
Jbraun
Fluorite | Level 6
Awesome, thanks!
PaigeMiller
Diamond | Level 26

You can create a new data set of just single characters, then let PROC FREQ have all the fun

 

data intermediate;
    set have;
    length char $ 1;
    do i=1 to length(text);
        char=substr(text,i,1);
        output;
    end;
    drop i text;
run;
proc freq data=intermediate;
    tables char/noprint out=want;
run;
--
Paige Miller
Jbraun
Fluorite | Level 6

Much appreciated!

Tom
Super User Tom
Super User

Wouldn't it be easier to just understand what is happening?

Sounds like you have data in UTF-8 encoding and you are trying to write it to a database that is using some other encoding.  What encoding is it using? Is it a single byte encoding?  Like LATIN1?

 

If it is a single byte encoding the remove the characters that are not available in that encoding.

 

You might start with getting the list of invalid characters.

For example this will create a new variable named BAD with the characters that will not map to LATIN1 and only keep the observations that have at least one bad character.

data test;
  set have;
  bad = kcompress(text, kcvt(collate(0,256),'LATIN1','UTF-8'),'k');
  if bad ne ' ';
run;
Jbraun
Fluorite | Level 6
The data is undocumentet so I don't know exactly what could hide in the file, but thanks for the suggestion
Tom
Super User Tom
Super User

@Jbraun wrote:
The data is undocumentet so I don't know exactly what could hide in the file, but thanks for the suggestion

Your original description of the problem made it sound like the problem was not the source data, but the limitations of the target you were trying to push the data into.   If the target database supported UTF-8 characters and your SAS session is running with UTF-8 support then reading files with any encoding should work fine. 

 

It is when you try to go the other way that you end up trying to stuff ten pounds of **** into a five pound bag.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 558 views
  • 4 likes
  • 4 in conversation