BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
dcortell
Pyrite | Level 9

Hi all

 

I have a dataset which is the results of

  1. web scraping from html pages text and
  2. then split each text into bigrams (set of two characters)

So each line is a combination of a term (example: "ba", "ca","de" etc...) and the length of the term (which is max 2 and min 0 if we are extracting the last characters from a word)

 

Example dataset

15.png

 

Now, in order to avoid to have spaces in the bigrams, but to still keep track of them, I'm substituting them for underscores

proc sql;
create table test01 as
select *, length(_term_) as length_bigr,
length(_term_2) as length_bigr2
from move.language_det_NGRAMS
where iden=1;
quit;

The problem is, I'm getting weird results as output, as you can see in the image above

 

The _term_2 is filled with underscores till reaching the max length for a text var (200)

 

I'm not sure what is causing this. Could be some html hidden format/character? Someone can share knowledge about if some pre-processing is still to be made on the text derived from the html to get rid of hidden spaces or something related?

 

Bests

D

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

So you want to split the string into strings of exactly two characters (not two bytes)?  You want to split to be overlapping? So a string like 'ABCD' becomes 'AB', 'BC', 'CD' and 'C ' ?

Use KSUBSTR() to limit the sub term to only two characters. Make sure to use KTRANSLATE() since you are dealing with potential multibyte characters.

data want;
  set have;
/* A unicode character can use up to 4 bytes */
  length term $8 ; 
  do index=1 to klength(string);
    term = ktranslate(ksubstr(string||' ',index,2),'_',' ');
    output;
  end;
run;

 

View solution in original post

14 REPLIES 14
PaigeMiller
Diamond | Level 26

How is _TERM_2 created in data set move.language_det_NGRAMS? You don't show us the code where it is created.

 

 

--
Paige Miller
Kurt_Bremser
Super User

Somewhere along the way, _term_2 was defined with a length of 200. SAS character variables are always padded with blanks up to their defined length (if they only contain blanks, it is a missing value per definition).

dcortell
Pyrite | Level 9

Sorry, _term_2 is created as follows:


data move.language_det_NGRAMS; set move.language_det_NGRAMS (drop=_term_2); _term_2=tranwrd(compress(_term_,,'kw')," ","_"); run;

 

Tom
Super User Tom
Super User

So you never set a length for _TERM_2.  Instead you forced SAS to GUESS how you wanted it defined.

It should define it to match how _TERM_ is defined.  So if you are seeing _TERM_2 be defined as length $200 then _TERM_ was probably defined as length $200 also.

 

What is the purpose of the COMPRESS() function call?  If you want to just remove leading/trailing spaces you could use STRIP().

Why use TRANWRD() when just converting single characters.  You can use TRANSLATE() for that.

 

What if you use something like this instead:

data move.language_det_NGRAMS;
  set move.language_det_NGRAMS (drop=_term_2);
  length _term_2 $2 ;
  _term_2=translate(strip(_term_),'_',' ');
run;
dcortell
Pyrite | Level 9

So this is how _term_ is generated from the text:

data move.language_det_NGRAMS (where=(_i_=2));
   set viyadrop.language_detect_train2;
   _tmpStr_ = cleaned;

   do while (klength(_tmpStr_)>0); 
   /**max is min betweeb length or 2: size of grams****/ 
      _maxN_=min(klength(_tmpStr_), 2); 
    
      do _i_=_maxN_ to _maxN_;
    /**extract from pos1 n of words _i_ (3): 1 to 3 trigram **/
         _term_ = ksubstr(_tmpStr_, 1, _i_);
         output;  
      end;  
   
  /** after extracting trigrams from pos 1,
  if length >1 then temp=move to next pos **/
   if klength(_tmpStr_)>1 then _tmpStr_ = ksubstr(_tmpStr_, 2);  
      else _tmpStr_ = '';
   end;
  
   keep iden class _term_ _i_;
run;
Tom
Super User Tom
Super User

So you want to split the string into strings of exactly two characters (not two bytes)?  You want to split to be overlapping? So a string like 'ABCD' becomes 'AB', 'BC', 'CD' and 'C ' ?

Use KSUBSTR() to limit the sub term to only two characters. Make sure to use KTRANSLATE() since you are dealing with potential multibyte characters.

data want;
  set have;
/* A unicode character can use up to 4 bytes */
  length term $8 ; 
  do index=1 to klength(string);
    term = ktranslate(ksubstr(string||' ',index,2),'_',' ');
    output;
  end;
run;

 

dcortell
Pyrite | Level 9

Terrific. This perfectly solve the topic. Thanks!

dcortell
Pyrite | Level 9

I can't use strip as, example of text "A bird". The set of bigrams should be: "A_" - "_b" - "bi" - "ir" -"rd". If I employ strip, the "A_" bigram will come up as "A" as the blank will be removed in the strip phase

PaigeMiller
Diamond | Level 26

@dcortell wrote:

I can't use strip as, example of text "A bird". The set of bigrams should be: "A_" - "_b" - "bi" - "ir" -"rd". If I employ strip, the "A_" bigram will come up as "A" as the blank will be removed in the strip phase


Then use TRIM()

--
Paige Miller
dcortell
Pyrite | Level 9

Sorry but it is the same, if I use Trim() or Strip(), the trailing blanks will be removed, and the bigram "A " which should transform in "A_", would just end up with the trim intermediate step to output as "A" as the leading space will be removed. The problem is not in removing trailing or leading blanks in the bigram, but understand why they seems get "multiplied" if we use then Translate(_term_,"_","") function

Kurt_Bremser
Super User

From the documentation of the TRANWRD Function:

Length of Returned Variable

In a DATA step, if the TRANWRD function returns a value to a variable that has not previously been assigned a length, that variable is given a length of 200 bytes. You can use the LENGTH statement, before calling TRANWRD, to change the length of the value.

 

dcortell
Pyrite | Level 9

Also, I have tried the following:

 

data test;
set move.language_det_NGRAMS  ( obs=20 drop=_term_2);
_term_2=translate(compbl(_term_),'_',' ');
run;

proc sql;
create table test01 as
select *, 
length(_term_) as length_bigr,
length(_term_2) as length_bigr2
from test
where iden=1;
quit;

But for same reason the result is not the expected: Example

 

15.png

 

In bigrams as the one in line 4, where there is only a combination of "blank" and "h", I would expect the above should output "_h", but then instead an underscore is also filled on the right side of the character, despite the length being only 2 of _term_

dcortell
Pyrite | Level 9

@Kurt_Bremser , ok, but also switching as suggested to the use of translate() in combination with compbl() as per the example above, does not explain why still blanks are added where it should be not the case. As per documentation:

 

  • "In a DATA step, if the TRANSLATE function returns a value to a variable that has not previously been assigned a length, then that variable is given the length of the first argument."So  also trying this direction, and looking at the example mentioned of line 4, where we have a " h" bigram, the above translate(compbl(),"_"," ") should output "_h", but then we get a "_h_" output, adding another spaces at the left of the character
ballardw
Super User

Perhaps you will find this example a bit enlightening as to what is happening:

data junk;
   length _term_ $ 10.;
   _term_='ha';
   x = compbl(_term_);
   _term_1=translate(x,'_',' ');
   _term_2=translate(strip(x),'_',' ');
   _term_3=translate(strip(compbl(_term_)),'_',' ');
run;

_term_1 is basically what you are doing.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 3093 views
  • 0 likes
  • 5 in conversation