BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
js5
Pyrite | Level 9 js5
Pyrite | Level 9

Hello,

 

I have a .tsv file with variables exceeding 32767 characters. I have managed to read these using input statement and varchar(131071) data type and lrecl=262143, but when I try to split the long variables into several pieces using scan, processing ends at character 32767. Is this supposed to be working?

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Quentin
Super User

I think the problem is actually before that.  It looks like when you INPUT the data, it's only grabbing the first 32,767 characters.

 

I think to make this work, you would need to find an informat to use on the INPUT statement that allows a length >32767.

 

From a quick read of the docs, I don't see any.

 

1    data _null_;
2      infile "Q:\junk\input.txt" truncover dsd delimiter='09'x lrecl=100000;
3      length var1 varchar(70000) var2 $10;
4      input var1 var2;
5
6      l=length(var1) ;
7      put l= ;
8    run;

NOTE: The infile "Q:\junk\input.txt" is:
      Filename=Q:\junk\input.txt,
      RECFM=V,LRECL=100000,File Size (bytes)=60012,
      Last Modified=14Nov2022:11:41:23,
      Create Time=14Nov2022:11:41:23

l=32767
NOTE: 1 record was read from the infile "Q:\junk\input.txt".
      The minimum record length was 60010.
      The maximum record length was 60010.

If you can't find an informat that will read > 32,767 characters, you might have to read it in chunks, as Tom suggested.

 

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.

View solution in original post

11 REPLIES 11
LinusH
Tourmaline | Level 20

I think it should work, but don't have SAS Viya at my fingertips.

Can you show the log from the import and scan operations?

Data never sleeps
Tom
Super User Tom
Super User

@js5 wrote:

Hello,

 

I have a .tsv file with variables exceeding 32767 characters. I have managed to read these using input statement and varchar(131071) data type and lrecl=262143, but when I try to split the long variables into several pieces using scan, processing ends at character 32767. Is this supposed to be working?

 

 


Please show the code you are using.

A SAS dataset only has two types of variables, floating point numbers and fixed length character strings.  The maximum length of a character string is 32,767 bytes.

 

Are you using something other than a SAS data step to read the file.

 

What is a TSV file?  Do you mean a tab delimited text file?

LinusH
Tourmaline | Level 20
Since OP refers to varchar I guess this is executed in CAS.
Data never sleeps
js5
Pyrite | Level 9 js5
Pyrite | Level 9

.tsv is a tab-delimited file, yes. According to https://luminaredata.com/wp-content/uploads/2018/04/Using_SAS_94M5_to_Manage_Strings_Exceeding_32kb.... there is limited support for varchar in the data step. I am trying to split the string into pieces shorter than 32k before saving it, but it appears not to work.

I cannot share my code 1:1 as the data are proprietary but I will see if I can come up with an example.

Tom
Super User Tom
Super User

You can read a text file with strings that are longer than 32K, you just have to read them into multiple variables (or multiple observations).  You can use the COLUMN= option of the INFILE statement to help you understand the length of the fields on the line.

 

Try something like this in normal SAS code:

data want;
  infile TSV dsd dlm='09'x truncover firstobs=2 column=cc ;
  length id slen 8 str1-str5 $32767 ;
  array str str1-str5;
  input id @ ;
  start=cc;
  input str1 @ ;
  end=cc;
  slen=end-start+1;
  input @start ;
  do index=1 to ceil(slen/32767);
    input str[index] $char32767. ;
    if 32767*(index-1) > slen then do;
      str[index]=substrn(str[index],1,32767*(index-1)-slen);
    end;
  end;
run;
  
Quentin
Super User

I think it should work.

 

I played  bit in Base SAS 9.4M7.  It looks like SCAN worked okay for this test.  But I did have a problem with REPEAT not wanting to write a string longer than 32,767.  Since VARCHAR is new to DATA step (M5 I think), it wouldn't be that surprising if some functions don't handle it correctly yet.

 

1    data _null_ ;
2      length longstr varchar(40000) longstr2 varchar(80000);
3      longstr=repeat('abc ',9999) ;
4      l=length(longstr) ; *I think this should be 40000, but its only 32767 ;
5
6
7      longstr2=trim(longstr)||longstr ;
8
9      longstr2=trim(longstr2)||' def' ;
10
11     l2=length(longstr2) ;
12     lastword=scan(longstr2,-1) ;
13     put l= l2= lastword= ;
14   run;

l=32767 l2=65538 lastword=def
The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
LinusH
Tourmaline | Level 20

CAS i required if you want to persist VARCHAR variables in your output dataset:

 

Requirement The CAS engine is required if you want to preserve a variable as a VARCHAR data type when reading it in or writing it out using the DATA step.
Data never sleeps
js5
Pyrite | Level 9 js5
Pyrite | Level 9

I do not need the varchar variables in my output dataset, I am dropping them. I only need to read the long strings into varchars so that I can split them into <32k pieces with scan before outputting them to an output dataset.

Quentin
Super User

Can you make a small example, like the data step I posted, that shows the problem?  It looked like scan worked for me.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
js5
Pyrite | Level 9 js5
Pyrite | Level 9

This seems to be sufficient to reproduce the problem:

data varchar;
	infile "C:\users\&sysuserid\Work Folders\Desktop\SAS\varchar\input.txt" truncover dsd delimiter='09'x lrecl=100000;
	length var1 varchar(70000) var2 $10;
	input var1 var2;

	do i = 1 to countw(var1, " ");
		short = scan(var1, i, " ");
		output;
	end;
run;

Neither countw nor scan seem to work beyond 32727 characters.

Quentin
Super User

I think the problem is actually before that.  It looks like when you INPUT the data, it's only grabbing the first 32,767 characters.

 

I think to make this work, you would need to find an informat to use on the INPUT statement that allows a length >32767.

 

From a quick read of the docs, I don't see any.

 

1    data _null_;
2      infile "Q:\junk\input.txt" truncover dsd delimiter='09'x lrecl=100000;
3      length var1 varchar(70000) var2 $10;
4      input var1 var2;
5
6      l=length(var1) ;
7      put l= ;
8    run;

NOTE: The infile "Q:\junk\input.txt" is:
      Filename=Q:\junk\input.txt,
      RECFM=V,LRECL=100000,File Size (bytes)=60012,
      Last Modified=14Nov2022:11:41:23,
      Create Time=14Nov2022:11:41:23

l=32767
NOTE: 1 record was read from the infile "Q:\junk\input.txt".
      The minimum record length was 60010.
      The maximum record length was 60010.

If you can't find an informat that will read > 32,767 characters, you might have to read it in chunks, as Tom suggested.

 

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 1918 views
  • 0 likes
  • 4 in conversation