Hello,
I have a .tsv file with variables exceeding 32767 characters. I have managed to read these using input statement and varchar(131071) data type and lrecl=262143, but when I try to split the long variables into several pieces using scan, processing ends at character 32767. Is this supposed to be working?
I think the problem is actually before that. It looks like when you INPUT the data, it's only grabbing the first 32,767 characters.
I think to make this work, you would need to find an informat to use on the INPUT statement that allows a length >32767.
From a quick read of the docs, I don't see any.
1 data _null_; 2 infile "Q:\junk\input.txt" truncover dsd delimiter='09'x lrecl=100000; 3 length var1 varchar(70000) var2 $10; 4 input var1 var2; 5 6 l=length(var1) ; 7 put l= ; 8 run; NOTE: The infile "Q:\junk\input.txt" is: Filename=Q:\junk\input.txt, RECFM=V,LRECL=100000,File Size (bytes)=60012, Last Modified=14Nov2022:11:41:23, Create Time=14Nov2022:11:41:23 l=32767 NOTE: 1 record was read from the infile "Q:\junk\input.txt". The minimum record length was 60010. The maximum record length was 60010.
If you can't find an informat that will read > 32,767 characters, you might have to read it in chunks, as Tom suggested.
I think it should work, but don't have SAS Viya at my fingertips.
Can you show the log from the import and scan operations?
@js5 wrote:
Hello,
I have a .tsv file with variables exceeding 32767 characters. I have managed to read these using input statement and varchar(131071) data type and lrecl=262143, but when I try to split the long variables into several pieces using scan, processing ends at character 32767. Is this supposed to be working?
Please show the code you are using.
A SAS dataset only has two types of variables, floating point numbers and fixed length character strings. The maximum length of a character string is 32,767 bytes.
Are you using something other than a SAS data step to read the file.
What is a TSV file? Do you mean a tab delimited text file?
.tsv is a tab-delimited file, yes. According to https://luminaredata.com/wp-content/uploads/2018/04/Using_SAS_94M5_to_Manage_Strings_Exceeding_32kb.... there is limited support for varchar in the data step. I am trying to split the string into pieces shorter than 32k before saving it, but it appears not to work.
I cannot share my code 1:1 as the data are proprietary but I will see if I can come up with an example.
You can read a text file with strings that are longer than 32K, you just have to read them into multiple variables (or multiple observations). You can use the COLUMN= option of the INFILE statement to help you understand the length of the fields on the line.
Try something like this in normal SAS code:
data want;
infile TSV dsd dlm='09'x truncover firstobs=2 column=cc ;
length id slen 8 str1-str5 $32767 ;
array str str1-str5;
input id @ ;
start=cc;
input str1 @ ;
end=cc;
slen=end-start+1;
input @start ;
do index=1 to ceil(slen/32767);
input str[index] $char32767. ;
if 32767*(index-1) > slen then do;
str[index]=substrn(str[index],1,32767*(index-1)-slen);
end;
end;
run;
I think it should work.
I played bit in Base SAS 9.4M7. It looks like SCAN worked okay for this test. But I did have a problem with REPEAT not wanting to write a string longer than 32,767. Since VARCHAR is new to DATA step (M5 I think), it wouldn't be that surprising if some functions don't handle it correctly yet.
1 data _null_ ; 2 length longstr varchar(40000) longstr2 varchar(80000); 3 longstr=repeat('abc ',9999) ; 4 l=length(longstr) ; *I think this should be 40000, but its only 32767 ; 5 6 7 longstr2=trim(longstr)||longstr ; 8 9 longstr2=trim(longstr2)||' def' ; 10 11 l2=length(longstr2) ; 12 lastword=scan(longstr2,-1) ; 13 put l= l2= lastword= ; 14 run; l=32767 l2=65538 lastword=def
CAS i required if you want to persist VARCHAR variables in your output dataset:
Requirement | The CAS engine is required if you want to preserve a variable as a VARCHAR data type when reading it in or writing it out using the DATA step. |
---|
I do not need the varchar variables in my output dataset, I am dropping them. I only need to read the long strings into varchars so that I can split them into <32k pieces with scan before outputting them to an output dataset.
Can you make a small example, like the data step I posted, that shows the problem? It looked like scan worked for me.
This seems to be sufficient to reproduce the problem:
data varchar;
infile "C:\users\&sysuserid\Work Folders\Desktop\SAS\varchar\input.txt" truncover dsd delimiter='09'x lrecl=100000;
length var1 varchar(70000) var2 $10;
input var1 var2;
do i = 1 to countw(var1, " ");
short = scan(var1, i, " ");
output;
end;
run;
Neither countw nor scan seem to work beyond 32727 characters.
I think the problem is actually before that. It looks like when you INPUT the data, it's only grabbing the first 32,767 characters.
I think to make this work, you would need to find an informat to use on the INPUT statement that allows a length >32767.
From a quick read of the docs, I don't see any.
1 data _null_; 2 infile "Q:\junk\input.txt" truncover dsd delimiter='09'x lrecl=100000; 3 length var1 varchar(70000) var2 $10; 4 input var1 var2; 5 6 l=length(var1) ; 7 put l= ; 8 run; NOTE: The infile "Q:\junk\input.txt" is: Filename=Q:\junk\input.txt, RECFM=V,LRECL=100000,File Size (bytes)=60012, Last Modified=14Nov2022:11:41:23, Create Time=14Nov2022:11:41:23 l=32767 NOTE: 1 record was read from the infile "Q:\junk\input.txt". The minimum record length was 60010. The maximum record length was 60010.
If you can't find an informat that will read > 32,767 characters, you might have to read it in chunks, as Tom suggested.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.