Solved: Re: Hidden blanks

dcortell · Posted 06-21-2022 11:21 AM

Hi all

I have a dataset which is the results of

web scraping from html pages text and
then split each text into bigrams (set of two characters)

So each line is a combination of a term (example: "ba", "ca","de" etc...) and the length of the term (which is max 2 and min 0 if we are extracting the last characters from a word)

Example dataset

Now, in order to avoid to have spaces in the bigrams, but to still keep track of them, I'm substituting them for underscores

proc sql;
create table test01 as
select *, length(_term_) as length_bigr,
length(_term_2) as length_bigr2
from move.language_det_NGRAMS
where iden=1;
quit;

The problem is, I'm getting weird results as output, as you can see in the image above

The _term_2 is filled with underscores till reaching the max length for a text var (200)

I'm not sure what is causing this. Could be some html hidden format/character? Someone can share knowledge about if some pre-processing is still to be made on the text derived from the html to get rid of hidden spaces or something related?

Bests

D

Tom · Posted 06-21-2022 12:49 PM

So you want to split the string into strings of exactly two characters (not two bytes)? You want to split to be overlapping? So a string like 'ABCD' becomes 'AB', 'BC', 'CD' and 'C ' ?

Use KSUBSTR() to limit the sub term to only two characters. Make sure to use KTRANSLATE() since you are dealing with potential multibyte characters.

data want;
  set have;
/* A unicode character can use up to 4 bytes */
  length term $8 ; 
  do index=1 to klength(string);
    term = ktranslate(ksubstr(string||' ',index,2),'_',' ');
    output;
  end;
run;

View solution in original post

PaigeMiller · Posted 06-21-2022 11:32 AM

How is _TERM_2 created in data set move.language_det_NGRAMS? You don't show us the code where it is created.

--
Paige Miller

Kurt_Bremser · Posted 06-21-2022 11:32 AM

Somewhere along the way, _term_2 was defined with a length of 200. SAS character variables are always padded with blanks up to their defined length (if they only contain blanks, it is a missing value per definition).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

dcortell · Posted 06-21-2022 11:55 AM

Sorry, _term_2 is created as follows:


data move.language_det_NGRAMS;
set move.language_det_NGRAMS (drop=_term_2);
_term_2=tranwrd(compress(_term_,,'kw')," ","_");
run;

Tom · Posted 06-21-2022 12:14 PM

So you never set a length for _TERM_2. Instead you forced SAS to GUESS how you wanted it defined.

It should define it to match how _TERM_ is defined. So if you are seeing _TERM_2 be defined as length $200 then _TERM_ was probably defined as length $200 also.

What is the purpose of the COMPRESS() function call? If you want to just remove leading/trailing spaces you could use STRIP().

Why use TRANWRD() when just converting single characters. You can use TRANSLATE() for that.

What if you use something like this instead:

data move.language_det_NGRAMS;
  set move.language_det_NGRAMS (drop=_term_2);
  length _term_2 $2 ;
  _term_2=translate(strip(_term_),'_',' ');
run;

dcortell · Posted 06-21-2022 12:26 PM

So this is how _term_ is generated from the text:

data move.language_det_NGRAMS (where=(_i_=2));
   set viyadrop.language_detect_train2;
   _tmpStr_ = cleaned;

   do while (klength(_tmpStr_)>0); 
   /**max is min betweeb length or 2: size of grams****/ 
      _maxN_=min(klength(_tmpStr_), 2); 
    
      do _i_=_maxN_ to _maxN_;
    /**extract from pos1 n of words _i_ (3): 1 to 3 trigram **/
         _term_ = ksubstr(_tmpStr_, 1, _i_);
         output;  
      end;  
   
  /** after extracting trigrams from pos 1,
  if length >1 then temp=move to next pos **/
   if klength(_tmpStr_)>1 then _tmpStr_ = ksubstr(_tmpStr_, 2);  
      else _tmpStr_ = '';
   end;
  
   keep iden class _term_ _i_;
run;

Tom · Posted 06-21-2022 12:49 PM

So you want to split the string into strings of exactly two characters (not two bytes)? You want to split to be overlapping? So a string like 'ABCD' becomes 'AB', 'BC', 'CD' and 'C ' ?

Use KSUBSTR() to limit the sub term to only two characters. Make sure to use KTRANSLATE() since you are dealing with potential multibyte characters.

data want;
  set have;
/* A unicode character can use up to 4 bytes */
  length term $8 ; 
  do index=1 to klength(string);
    term = ktranslate(ksubstr(string||' ',index,2),'_',' ');
    output;
  end;
run;

dcortell · Posted 06-21-2022 01:35 PM

Terrific. This perfectly solve the topic. Thanks!

dcortell · Posted 06-21-2022 12:29 PM

I can't use strip as, example of text "A bird". The set of bigrams should be: "A_" - "_b" - "bi" - "ir" -"rd". If I employ strip, the "A_" bigram will come up as "A" as the blank will be removed in the strip phase

PaigeMiller · Posted 06-21-2022 12:43 PM

@dcortell wrote:

I can't use strip as, example of text "A bird". The set of bigrams should be: "A_" - "_b" - "bi" - "ir" -"rd". If I employ strip, the "A_" bigram will come up as "A" as the blank will be removed in the strip phase

Then use TRIM()

--
Paige Miller

dcortell · Posted 06-21-2022 12:50 PM

Sorry but it is the same, if I use Trim() or Strip(), the trailing blanks will be removed, and the bigram "A " which should transform in "A_", would just end up with the trim intermediate step to output as "A" as the leading space will be removed. The problem is not in removing trailing or leading blanks in the bigram, but understand why they seems get "multiplied" if we use then Translate(_term_,"_","") function

Kurt_Bremser · Posted 06-21-2022 12:52 PM

From the documentation of the TRANWRD Function:

Length of Returned Variable

In a DATA step, if the TRANWRD function returns a value to a variable that has not previously been assigned a length, that variable is given a length of 200 bytes. You can use the LENGTH statement, before calling TRANWRD, to change the length of the value.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

dcortell · Posted 06-21-2022 12:36 PM

Also, I have tried the following:

data test;
set move.language_det_NGRAMS  ( obs=20 drop=_term_2);
_term_2=translate(compbl(_term_),'_',' ');
run;

proc sql;
create table test01 as
select *, 
length(_term_) as length_bigr,
length(_term_2) as length_bigr2
from test
where iden=1;
quit;

But for same reason the result is not the expected: Example

In bigrams as the one in line 4, where there is only a combination of "blank" and "h", I would expect the above should output "_h", but then instead an underscore is also filled on the right side of the character, despite the length being only 2 of _term_

dcortell · Posted 06-21-2022 01:05 PM

@Kurt_Bremser , ok, but also switching as suggested to the use of translate() in combination with compbl() as per the example above, does not explain why still blanks are added where it should be not the case. As per documentation:

"In a DATA step, if the TRANSLATE function returns a value to a variable that has not previously been assigned a length, then that variable is given the length of the first argument."So also trying this direction, and looking at the example mentioned of line 4, where we have a " h" bigram, the above translate(compbl(),"_"," ") should output "_h", but then we get a "_h_" output, adding another spaces at the left of the character

ballardw · Posted 06-21-2022 01:22 PM

Perhaps you will find this example a bit enlightening as to what is happening:

data junk;
   length _term_ $ 10.;
   _term_='ha';
   x = compbl(_term_);
   _term_1=translate(x,'_',' ');
   _term_2=translate(strip(x),'_',' ');
   _term_3=translate(strip(compbl(_term_)),'_',' ');
run;

_term_1 is basically what you are doing.

Length of Returned Variable

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away