<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Hidden blanks in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819383#M323439</link>
    <description>&lt;P&gt;So you want to split the string into strings of exactly two characters (not two bytes)?&amp;nbsp; You want to split to be overlapping? So a string like 'ABCD' becomes 'AB', 'BC', 'CD' and 'C ' ?&lt;/P&gt;
&lt;P&gt;Use KSUBSTR() to limit the sub term to only two characters. Make sure to use KTRANSLATE() since you are dealing with potential multibyte characters.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want;
  set have;
/* A unicode character can use up to 4 bytes */
  length term $8 ; 
  do index=1 to klength(string);
    term = ktranslate(ksubstr(string||' ',index,2),'_',' ');
    output;
  end;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 21 Jun 2022 16:53:06 GMT</pubDate>
    <dc:creator>Tom</dc:creator>
    <dc:date>2022-06-21T16:53:06Z</dc:date>
    <item>
      <title>Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819350#M323422</link>
      <description>&lt;P&gt;Hi all&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a dataset which is the results of&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;web scraping from html pages text and&lt;/LI&gt;&lt;LI&gt;then split each text into bigrams (set of two characters)&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;So each line is a combination of a term (example: "ba", "ca","de" etc...) and the length of the term (which is max 2 and min 0 if we are extracting the last characters from a word)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Example dataset&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="15.png" style="width: 940px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/72493i89743A03465F1EB2/image-size/large?v=v2&amp;amp;px=999" role="button" title="15.png" alt="15.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now, in order to avoid to have spaces in the bigrams, but to still keep track of them, I'm substituting them for underscores&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=""&gt;proc sql;
create table test01 as
select *, length(_term_) as length_bigr,
length(_term_2) as length_bigr2
from move.language_det_NGRAMS
where iden=1;
quit;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The problem is, I'm getting weird results as output, as you can see in the image above&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The _term_2 is filled with underscores till reaching the max length for a text var (200)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm not sure what is causing this. Could be some html hidden format/character? Someone can share knowledge about if some pre-processing is still to be made on the text derived from the html to get rid of hidden spaces or something related?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Bests&lt;/P&gt;&lt;P&gt;D&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 15:21:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819350#M323422</guid>
      <dc:creator>dcortell</dc:creator>
      <dc:date>2022-06-21T15:21:45Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819353#M323425</link>
      <description>&lt;P&gt;How is _TERM_2 created in data set &lt;FONT face="courier new,courier"&gt;move.language_det_NGRAMS&lt;/FONT&gt;? You don't show us the code where it is created.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 15:33:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819353#M323425</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2022-06-21T15:33:02Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819354#M323426</link>
      <description>&lt;P&gt;Somewhere along the way, _term_2 was defined with a length of 200. SAS character variables are always padded with blanks up to their defined length (if they only contain blanks, it is a missing value per definition).&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 15:32:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819354#M323426</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2022-06-21T15:32:54Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819359#M323430</link>
      <description>&lt;P&gt;Sorry, _term_2 is created as follows:&lt;/P&gt;&lt;PRE&gt;&lt;BR /&gt;&lt;CODE class=""&gt;data move.language_det_NGRAMS;
set move.language_det_NGRAMS (drop=_term_2);
_term_2=tranwrd(compress(_term_,,'kw')," ","_");
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 15:55:08 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819359#M323430</guid>
      <dc:creator>dcortell</dc:creator>
      <dc:date>2022-06-21T15:55:08Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819366#M323433</link>
      <description>&lt;P&gt;So you never set a length for _TERM_2.&amp;nbsp; Instead you forced SAS to GUESS how you wanted it defined.&lt;/P&gt;
&lt;P&gt;It should define it to match how _TERM_ is defined.&amp;nbsp; So if you are seeing _TERM_2 be defined as length $200 then _TERM_ was probably defined as length $200 also.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What is the purpose of the COMPRESS() function call?&amp;nbsp; If you want to just remove leading/trailing spaces you could use STRIP().&lt;/P&gt;
&lt;P&gt;Why use TRANWRD() when just converting single characters.&amp;nbsp; You can use TRANSLATE() for that.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What if you use something like this instead:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data move.language_det_NGRAMS;
  set move.language_det_NGRAMS (drop=_term_2);
  length _term_2 $2 ;
  _term_2=translate(strip(_term_),'_',' ');
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 21 Jun 2022 16:14:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819366#M323433</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2022-06-21T16:14:44Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819370#M323434</link>
      <description>&lt;P&gt;So this is how _term_ is generated from the text:&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=""&gt;data move.language_det_NGRAMS (where=(_i_=2));
   set viyadrop.language_detect_train2;
   _tmpStr_ = cleaned;

   do while (klength(_tmpStr_)&amp;gt;0); 
   /**max is min betweeb length or 2: size of grams****/ 
      _maxN_=min(klength(_tmpStr_), 2); 
    
      do _i_=_maxN_ to _maxN_;
    /**extract from pos1 n of words _i_ (3): 1 to 3 trigram **/
         _term_ = ksubstr(_tmpStr_, 1, _i_);
         output;  
      end;  
   
  /** after extracting trigrams from pos 1,
  if length &amp;gt;1 then temp=move to next pos **/
   if klength(_tmpStr_)&amp;gt;1 then _tmpStr_ = ksubstr(_tmpStr_, 2);  
      else _tmpStr_ = '';
   end;
  
   keep iden class _term_ _i_;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 21 Jun 2022 16:26:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819370#M323434</guid>
      <dc:creator>dcortell</dc:creator>
      <dc:date>2022-06-21T16:26:30Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819372#M323435</link>
      <description>&lt;P&gt;I can't use strip as, example of text "A bird". The set of bigrams should be: "A_" - "_b" - "bi" - "ir" -"rd". If I employ strip, the "A_" bigram will come up as "A" as the blank will be removed in the strip phase&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 16:29:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819372#M323435</guid>
      <dc:creator>dcortell</dc:creator>
      <dc:date>2022-06-21T16:29:44Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819376#M323437</link>
      <description>&lt;P&gt;Also, I have tried the following:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=""&gt;data test;
set move.language_det_NGRAMS  ( obs=20 drop=_term_2);
_term_2=translate(compbl(_term_),'_',' ');
run;

proc sql;
create table test01 as
select *, 
length(_term_) as length_bigr,
length(_term_2) as length_bigr2
from test
where iden=1;
quit;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;But for same reason the result is not the expected: Example&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="15.png" style="width: 636px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/72496i3468043A951C5BC2/image-size/large?v=v2&amp;amp;px=999" role="button" title="15.png" alt="15.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In bigrams as the one in line 4, where there is only a combination of "blank" and "h", I would expect the above should output "_h", but then instead an underscore is also filled on the right side of the character, despite the length being only 2 of _term_&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 16:36:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819376#M323437</guid>
      <dc:creator>dcortell</dc:creator>
      <dc:date>2022-06-21T16:36:47Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819379#M323438</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/71355"&gt;@dcortell&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;I can't use strip as, example of text "A bird". The set of bigrams should be: "A_" - "_b" - "bi" - "ir" -"rd". If I employ strip, the "A_" bigram will come up as "A" as the blank will be removed in the strip phase&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Then use TRIM()&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 16:43:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819379#M323438</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2022-06-21T16:43:58Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819383#M323439</link>
      <description>&lt;P&gt;So you want to split the string into strings of exactly two characters (not two bytes)?&amp;nbsp; You want to split to be overlapping? So a string like 'ABCD' becomes 'AB', 'BC', 'CD' and 'C ' ?&lt;/P&gt;
&lt;P&gt;Use KSUBSTR() to limit the sub term to only two characters. Make sure to use KTRANSLATE() since you are dealing with potential multibyte characters.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want;
  set have;
/* A unicode character can use up to 4 bytes */
  length term $8 ; 
  do index=1 to klength(string);
    term = ktranslate(ksubstr(string||' ',index,2),'_',' ');
    output;
  end;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 16:53:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819383#M323439</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2022-06-21T16:53:06Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819384#M323440</link>
      <description>&lt;P&gt;Sorry but it is the same, if I use Trim() or Strip(), the trailing blanks will be removed, and the bigram "A " which should transform in "A_", would just end up with the trim intermediate step to output as "A" as the leading space will be removed. The problem is not in removing trailing or leading blanks in the bigram, but understand why they seems get "multiplied" if we use then Translate(_term_,"_","") function&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 16:50:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819384#M323440</guid>
      <dc:creator>dcortell</dc:creator>
      <dc:date>2022-06-21T16:50:23Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819385#M323441</link>
      <description>&lt;P&gt;From the documentation of the &lt;A href="https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lefunctionsref/p0pgemqcslm9uen1tvr5gcrusgrw.htm" target="_blank" rel="noopener"&gt;TRANWRD Function&lt;/A&gt;:&lt;/P&gt;
&lt;H3 id="n1c7k94n7xs6v0n11husarb03l9b" class="xisDoc-title"&gt;Length of Returned Variable&lt;/H3&gt;
&lt;P class="xisDoc-paragraph"&gt;In a DATA step, if the TRANWRD function returns a value to a variable that has not previously been assigned a length, that variable is given a length of &lt;FONT color="#FF0000"&gt;200 bytes&lt;/FONT&gt;. You can use the LENGTH statement, before calling TRANWRD, to change the length of the value.&lt;/P&gt;
&lt;P class="xisDoc-paragraph"&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 16:52:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819385#M323441</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2022-06-21T16:52:39Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819392#M323443</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/11562"&gt;@Kurt_Bremser&lt;/a&gt;&amp;nbsp;, ok, but also switching as suggested to the use of translate() in combination with compbl() as per the example above, does not explain why still blanks are added where it should be not the case. As per documentation:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;"&lt;SPAN&gt;In a DATA step, if the TRANSLATE function returns a value to a variable that has not previously been assigned a length, then that variable is given the length of the first argument."&lt;/SPAN&gt;&lt;SPAN&gt;So&amp;nbsp; also trying this direction, and looking at the example mentioned of line 4, where we have a " h" bigram, the above translate(compbl(),"_"," ") should output "_h", but then we get a "_h_" output, adding another spaces at the left of the character&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Tue, 21 Jun 2022 17:05:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819392#M323443</guid>
      <dc:creator>dcortell</dc:creator>
      <dc:date>2022-06-21T17:05:38Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819396#M323445</link>
      <description>&lt;P&gt;Perhaps you will find this example a bit enlightening as to what is happening:&lt;/P&gt;
&lt;PRE&gt;data junk;
   length _term_ $ 10.;
   _term_='ha';
   x = compbl(_term_);
   _term_1=translate(x,'_',' ');
   _term_2=translate(strip(x),'_',' ');
   _term_3=translate(strip(compbl(_term_)),'_',' ');
run;&lt;/PRE&gt;
&lt;P&gt;_term_1 is basically what you are doing.&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 17:22:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819396#M323445</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2022-06-21T17:22:36Z</dc:date>
    </item>
    <item>
      <title>Re: Hidden blanks</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819399#M323446</link>
      <description>&lt;P&gt;Terrific. This perfectly solve the topic. Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 17:35:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hidden-blanks/m-p/819399#M323446</guid>
      <dc:creator>dcortell</dc:creator>
      <dc:date>2022-06-21T17:35:24Z</dc:date>
    </item>
  </channel>
</rss>

