Hello,
I have a .tsv file with variables exceeding 32767 characters. I have managed to read these using input statement and varchar(131071) data type and lrecl=262143, but when I try to split the long variables into several pieces using scan, processing ends at character 32767. Is this supposed to be working?
I think the problem is actually before that. It looks like when you INPUT the data, it's only grabbing the first 32,767 characters.
I think to make this work, you would need to find an informat to use on the INPUT statement that allows a length >32767.
From a quick read of the docs, I don't see any.
1 data _null_; 2 infile "Q:\junk\input.txt" truncover dsd delimiter='09'x lrecl=100000; 3 length var1 varchar(70000) var2 $10; 4 input var1 var2; 5 6 l=length(var1) ; 7 put l= ; 8 run; NOTE: The infile "Q:\junk\input.txt" is: Filename=Q:\junk\input.txt, RECFM=V,LRECL=100000,File Size (bytes)=60012, Last Modified=14Nov2022:11:41:23, Create Time=14Nov2022:11:41:23 l=32767 NOTE: 1 record was read from the infile "Q:\junk\input.txt". The minimum record length was 60010. The maximum record length was 60010.
If you can't find an informat that will read > 32,767 characters, you might have to read it in chunks, as Tom suggested.
I think it should work, but don't have SAS Viya at my fingertips.
Can you show the log from the import and scan operations?
@js5 wrote:
Hello,
I have a .tsv file with variables exceeding 32767 characters. I have managed to read these using input statement and varchar(131071) data type and lrecl=262143, but when I try to split the long variables into several pieces using scan, processing ends at character 32767. Is this supposed to be working?
Please show the code you are using.
A SAS dataset only has two types of variables, floating point numbers and fixed length character strings. The maximum length of a character string is 32,767 bytes.
Are you using something other than a SAS data step to read the file.
What is a TSV file? Do you mean a tab delimited text file?
.tsv is a tab-delimited file, yes. According to https://luminaredata.com/wp-content/uploads/2018/04/Using_SAS_94M5_to_Manage_Strings_Exceeding_32kb.... there is limited support for varchar in the data step. I am trying to split the string into pieces shorter than 32k before saving it, but it appears not to work.
I cannot share my code 1:1 as the data are proprietary but I will see if I can come up with an example.
You can read a text file with strings that are longer than 32K, you just have to read them into multiple variables (or multiple observations). You can use the COLUMN= option of the INFILE statement to help you understand the length of the fields on the line.
Try something like this in normal SAS code:
data want;
infile TSV dsd dlm='09'x truncover firstobs=2 column=cc ;
length id slen 8 str1-str5 $32767 ;
array str str1-str5;
input id @ ;
start=cc;
input str1 @ ;
end=cc;
slen=end-start+1;
input @start ;
do index=1 to ceil(slen/32767);
input str[index] $char32767. ;
if 32767*(index-1) > slen then do;
str[index]=substrn(str[index],1,32767*(index-1)-slen);
end;
end;
run;
I think it should work.
I played bit in Base SAS 9.4M7. It looks like SCAN worked okay for this test. But I did have a problem with REPEAT not wanting to write a string longer than 32,767. Since VARCHAR is new to DATA step (M5 I think), it wouldn't be that surprising if some functions don't handle it correctly yet.
1 data _null_ ; 2 length longstr varchar(40000) longstr2 varchar(80000); 3 longstr=repeat('abc ',9999) ; 4 l=length(longstr) ; *I think this should be 40000, but its only 32767 ; 5 6 7 longstr2=trim(longstr)||longstr ; 8 9 longstr2=trim(longstr2)||' def' ; 10 11 l2=length(longstr2) ; 12 lastword=scan(longstr2,-1) ; 13 put l= l2= lastword= ; 14 run; l=32767 l2=65538 lastword=def
CAS i required if you want to persist VARCHAR variables in your output dataset:
Requirement | The CAS engine is required if you want to preserve a variable as a VARCHAR data type when reading it in or writing it out using the DATA step. |
---|
I do not need the varchar variables in my output dataset, I am dropping them. I only need to read the long strings into varchars so that I can split them into <32k pieces with scan before outputting them to an output dataset.
Can you make a small example, like the data step I posted, that shows the problem? It looked like scan worked for me.
This seems to be sufficient to reproduce the problem:
data varchar;
infile "C:\users\&sysuserid\Work Folders\Desktop\SAS\varchar\input.txt" truncover dsd delimiter='09'x lrecl=100000;
length var1 varchar(70000) var2 $10;
input var1 var2;
do i = 1 to countw(var1, " ");
short = scan(var1, i, " ");
output;
end;
run;
Neither countw nor scan seem to work beyond 32727 characters.
I think the problem is actually before that. It looks like when you INPUT the data, it's only grabbing the first 32,767 characters.
I think to make this work, you would need to find an informat to use on the INPUT statement that allows a length >32767.
From a quick read of the docs, I don't see any.
1 data _null_; 2 infile "Q:\junk\input.txt" truncover dsd delimiter='09'x lrecl=100000; 3 length var1 varchar(70000) var2 $10; 4 input var1 var2; 5 6 l=length(var1) ; 7 put l= ; 8 run; NOTE: The infile "Q:\junk\input.txt" is: Filename=Q:\junk\input.txt, RECFM=V,LRECL=100000,File Size (bytes)=60012, Last Modified=14Nov2022:11:41:23, Create Time=14Nov2022:11:41:23 l=32767 NOTE: 1 record was read from the infile "Q:\junk\input.txt". The minimum record length was 60010. The maximum record length was 60010.
If you can't find an informat that will read > 32,767 characters, you might have to read it in chunks, as Tom suggested.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.