Re: cutting of data up to 200 characters

pdhokriya · Posted 07-14-2021 08:56 AM

If dataset has larger than 200 characters then 200 cutting off the setting at whole words with the first 200 characters going into text, the second set of 200 characters going into text1 and so fourth. Continue until all characters have been processed.

Query :

data have;
str = "Subject 1013 at v6 26AUG2020 was not reconsented with Main ICF (c ) v4.0 14AUG2020, IRB approval 19AUG2020, xxxxx Approval 20AUG2020. Subject 1013 was reconsented at V9 18NOV2020 with Main ICF (c ) v4.0 14AUG2020, IRB approval 19AUG2020, xxxxx Approval 20AUG2020.";output;
str= " a b c" ; output;
run;

So the output would be :

text	text1	text2
Subject 1013 at v6 26AUG2020 was not reconsented with Main ICF (c ) v4.0 14AUG2020, IRB approval 19AUG2020, xxxxx Approval 20AUG2020. Subject 1013 was reconsented at V9 18NOV2020 with Main ICF (c )	v4.0 14AUG2020, IRB approval 19AUG2020, Lilly Approval 20AUG2020.
a b c

Kurt_Bremser · Posted 07-14-2021 09:14 AM

I recommend to not limit yourself by a fixed number of text pieces, use a vertical layout instead:

data want;
length text $200;
set have;
do until (lengthn(str) = 0);
  text = substr(str,1,200);
  call scan(text,-1,pos,len," ");
  if pos + len lt 200 then pos = pos + len;
  str = substr(str,pos);
  text = substr(text,1,pos);
  output;
end;
keep text;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

pdhokriya · Posted 07-14-2021 09:54 AM

Observation 2 should be in another variable ( text1).

Reeza · Posted 07-14-2021 10:12 AM

You need to predefine the variables ahead of time so you that means before you start your step you need to know the maximum number of variables you'll need for each string. Instead of that, a more dynamic method is to do it in a long format, so each observation to each line and then transpose it afterwards to a wide format if necessary. It's usually more efficient to work in the long format generally but it takes some training and experience to get there.

@pdhokriya wrote:

Observation 2 should be in another variable ( text1).

Transposing data tutorials:
Long to Wide:
https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-long-to-wide-using-proc-transpose/

https://stats.idre.ucla.edu/sas/modules/reshaping-data-long-to-wide-using-the-data-step/

Wide to Long:
https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-wide-to-long-using-proc-transpose/

https://stats.idre.ucla.edu/sas/modules/reshaping-data-wide-to-long-using-a-data-step/

And sometimes a double transpose is needed for extra wide data sets:
https://gist.github.com/statgeek/2321b6f62ab78d5bf2b0a5a8626bd7cd

Kurt_Bremser · Posted 07-14-2021 10:23 AM

@pdhokriya wrote:

Observation 2 should be in another variable ( text1).

I VERY STRONGLY advise against such coding techniques. They only make your further data handling complicated and serve no purpose in about 99.95 % of cases, especially as observed here on the communities. Cleanse your brain of the poisons of Excel thinking.

Please give me a valid reason why you want a limited set of text variables that you might run out of if str becomes longer than expected.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 07-14-2021 10:55 AM

It is not hard to adapt @Kurt_Bremser solution to generating multiple variables instead.

Calculate the number of variables you will need by dividing the max length of your source variable by 200 and adding at least one more to handle the inefficiency caused by not splitting words in the middle. So for example if STR is length 800 you might use 5 as the number of new TEXT variables to create.

data want;
  set have;
  array text [5] $200;
  do nwords=1 to dim(text) until (lengthn(str) = 0);
    text[nwords] = str;
    call scan(text[nwords],-1,pos,len," ");
    if pos + len lt 200 then pos = pos + len;
    str = substr(str,pos);
    text[nwords] = substr(text[nwords],1,pos);
  end;
run;

If you end up with some observations where STR is not empty then re-run with a larger array.

ballardw · Posted 07-14-2021 10:26 AM

Can you describe what you will do with the resulting data, and why 200 characters as a magic number is desired?

If you have an arbitrary 200 character split then you are extremely likely to have words split in the middle with bits in two separate variables. Which means that finding values or actual use of the reduced variables seems hardly worth it.

cutting of data up to 200 characters