Hello,
I have a .tsv file with variables exceeding 32767 characters. I have managed to read these using input statement and varchar(131071) data type and lrecl=262143, but when I try to split the long variables into several pieces using scan, processing ends at character 32767. Is this supposed to be working?
I think the problem is actually before that. It looks like when you INPUT the data, it's only grabbing the first 32,767 characters.
I think to make this work, you would need to find an informat to use on the INPUT statement that allows a length >32767.
From a quick read of the docs, I don't see any.
1 data _null_;
2 infile "Q:\junk\input.txt" truncover dsd delimiter='09'x lrecl=100000;
3 length var1 varchar(70000) var2 $10;
4 input var1 var2;
5
6 l=length(var1) ;
7 put l= ;
8 run;
NOTE: The infile "Q:\junk\input.txt" is:
Filename=Q:\junk\input.txt,
RECFM=V,LRECL=100000,File Size (bytes)=60012,
Last Modified=14Nov2022:11:41:23,
Create Time=14Nov2022:11:41:23
l=32767
NOTE: 1 record was read from the infile "Q:\junk\input.txt".
The minimum record length was 60010.
The maximum record length was 60010.
If you can't find an informat that will read > 32,767 characters, you might have to read it in chunks, as Tom suggested.
I think it should work, but don't have SAS Viya at my fingertips.
Can you show the log from the import and scan operations?
@js5 wrote:
Hello,
I have a .tsv file with variables exceeding 32767 characters. I have managed to read these using input statement and varchar(131071) data type and lrecl=262143, but when I try to split the long variables into several pieces using scan, processing ends at character 32767. Is this supposed to be working?
Please show the code you are using.
A SAS dataset only has two types of variables, floating point numbers and fixed length character strings. The maximum length of a character string is 32,767 bytes.
Are you using something other than a SAS data step to read the file.
What is a TSV file? Do you mean a tab delimited text file?
.tsv is a tab-delimited file, yes. According to https://luminaredata.com/wp-content/uploads/2018/04/Using_SAS_94M5_to_Manage_Strings_Exceeding_32kb.... there is limited support for varchar in the data step. I am trying to split the string into pieces shorter than 32k before saving it, but it appears not to work.
I cannot share my code 1:1 as the data are proprietary but I will see if I can come up with an example.
You can read a text file with strings that are longer than 32K, you just have to read them into multiple variables (or multiple observations). You can use the COLUMN= option of the INFILE statement to help you understand the length of the fields on the line.
Try something like this in normal SAS code:
data want;
infile TSV dsd dlm='09'x truncover firstobs=2 column=cc ;
length id slen 8 str1-str5 $32767 ;
array str str1-str5;
input id @ ;
start=cc;
input str1 @ ;
end=cc;
slen=end-start+1;
input @start ;
do index=1 to ceil(slen/32767);
input str[index] $char32767. ;
if 32767*(index-1) > slen then do;
str[index]=substrn(str[index],1,32767*(index-1)-slen);
end;
end;
run;
I think it should work.
I played bit in Base SAS 9.4M7. It looks like SCAN worked okay for this test. But I did have a problem with REPEAT not wanting to write a string longer than 32,767. Since VARCHAR is new to DATA step (M5 I think), it wouldn't be that surprising if some functions don't handle it correctly yet.
1 data _null_ ;
2 length longstr varchar(40000) longstr2 varchar(80000);
3 longstr=repeat('abc ',9999) ;
4 l=length(longstr) ; *I think this should be 40000, but its only 32767 ;
5
6
7 longstr2=trim(longstr)||longstr ;
8
9 longstr2=trim(longstr2)||' def' ;
10
11 l2=length(longstr2) ;
12 lastword=scan(longstr2,-1) ;
13 put l= l2= lastword= ;
14 run;
l=32767 l2=65538 lastword=def
CAS i required if you want to persist VARCHAR variables in your output dataset:
| Requirement | The CAS engine is required if you want to preserve a variable as a VARCHAR data type when reading it in or writing it out using the DATA step. |
|---|
I do not need the varchar variables in my output dataset, I am dropping them. I only need to read the long strings into varchars so that I can split them into <32k pieces with scan before outputting them to an output dataset.
Can you make a small example, like the data step I posted, that shows the problem? It looked like scan worked for me.
This seems to be sufficient to reproduce the problem:
data varchar;
infile "C:\users\&sysuserid\Work Folders\Desktop\SAS\varchar\input.txt" truncover dsd delimiter='09'x lrecl=100000;
length var1 varchar(70000) var2 $10;
input var1 var2;
do i = 1 to countw(var1, " ");
short = scan(var1, i, " ");
output;
end;
run;Neither countw nor scan seem to work beyond 32727 characters.
I think the problem is actually before that. It looks like when you INPUT the data, it's only grabbing the first 32,767 characters.
I think to make this work, you would need to find an informat to use on the INPUT statement that allows a length >32767.
From a quick read of the docs, I don't see any.
1 data _null_;
2 infile "Q:\junk\input.txt" truncover dsd delimiter='09'x lrecl=100000;
3 length var1 varchar(70000) var2 $10;
4 input var1 var2;
5
6 l=length(var1) ;
7 put l= ;
8 run;
NOTE: The infile "Q:\junk\input.txt" is:
Filename=Q:\junk\input.txt,
RECFM=V,LRECL=100000,File Size (bytes)=60012,
Last Modified=14Nov2022:11:41:23,
Create Time=14Nov2022:11:41:23
l=32767
NOTE: 1 record was read from the infile "Q:\junk\input.txt".
The minimum record length was 60010.
The maximum record length was 60010.
If you can't find an informat that will read > 32,767 characters, you might have to read it in chunks, as Tom suggested.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.