Solved: Re: Importing string larger than 32767

davidsmarch · Posted 11-14-2019 05:12 PM

I'm trying to import multiple variables are each strings and can be more than 32767 a piece. I know the limit for a string variable is 32767, but I'm hoping to import such that if a variable reaches the max it spills over into another variable. I saw a partial fix in another post, but I can't get it to work. For trying it out, I limited the dataset to only 1 variable. To make it easier I also took the header row out o the dataset. This variable is just called Y. The current form is tab-delimited, but I've also unsuccessfully tried other formats.

Below I tried to import the Y variable as a possible array with a first variable Ypart1 up to 32767 and let it spill over into a Ypart2 if needed. As it is, it just fills up the first Ypart1 to the limit, makes a Ypart2, but no data appears there. I've attached the limited datafile. I can do this per variable, if necessary. Obviously it'd be best to do it all at once. Thanks for any guidance!

filename y_var temp;
data y_test;
   infile 'C:\Users\dsm6713\Google Drive\Academic\y_var.tsv' truncover;
   array Ypart[2] $32767.;
   input (Ypart[*])($char32767.);
run;

Tom · Posted 11-15-2019 01:59 PM

Looks like an interesting solution! I tried to adapt it and and while it's parsing the first column it's not moving past that. The delimiter in my data is actually the letter "a". So a string would look like :

947a946a945a944a943a942a941a940a939a938a937a936a933a1029a926a1119a897a875a850a828a811a798a791a786a786a785

But out to hundreds of thousands of characters. So the string between the a's is never longer than 4.

That is a totally different problem than the way you originally presented it. So you no longer have a length issue at all since the maximum length is only 4!

Let's make some simple example data:

filename example temp;
data _null_;
  file example;
  infile cards truncover;
  input line $100. ;
  len=lengthn(line);
  put line $varying200. len ;
cards;
947a946a945a944a943a942a941a
947a946a945a944a943a942a941a940a939a938a937
936a933a1029a926a1119
;

Now just read it into a tall skinny structure with line number, column number and value.

data want ;
  infile example dsd dlm='a' truncover length=line_len column=current_col ;
  row+1;
  do col=1 by 1 until(current_col > line_len);
    input word :$4. @;
    output;
  end;
run;

Results

Obs    row    col    word

  1     1       1    947
  2     1       2    946
  3     1       3    945
  4     1       4    944
  5     1       5    943
  6     1       6    942
  7     1       7    941
  8     2       1    947
  9     2       2    946
 10     2       3    945
 11     2       4    944
 12     2       5    943
 13     2       6    942
 14     2       7    941
 15     2       8    940
 16     2       9    939
 17     2      10    938
 18     2      11    937
 19     3       1    936
 20     3       2    933
 21     3       3    1029
 22     3       4    926
 23     3       5    1119

View solution in original post

Astounding · Posted 11-14-2019 06:30 PM

Just to simplify slightly and cut down the possibilities for error, get rid of the array and try:

input Ypart1 $char32767. Ypart2 $char32767.;

davidsmarch · Posted 11-14-2019 06:41 PM

Astounding, I just tried that, and same result. It cuts off Ypart1 at 32767 and includes a blank Ypart2 variable. It's not spilling over. The original solution below works, but only as they included the data as parmcards:

filename FT15F001 temp;
data lines;
   infile FT15F001 truncover;
   array part[3] $40.;
   input (part[*])($char40.);
   parmcards;
this is the data line this is the line this is the data line 
this is the data line this is the data line this is the data line this is the data line 
this is the data line this is the data line this is the data line 
;;;;
   run;

I'm including this as perhaps it will make more sense to you why theirs works.

ballardw · Posted 11-14-2019 07:18 PM

Like this:

data example;
   infile datalines truncover;
   input part1 $40. part2 $40. part3 $40.;
   datalines;
this is the data line this is the line this is the data line 
this is the data line this is the data line this is the data line this is the data line 
this is the data line this is the data line this is the data line 
;

If your actual data step has anything other than that varying data you need to show the entire data step.

INFILE options have some interesting interactions with the INPUT statement sometimes and the entire syntax is likely needed.

davidsmarch · Posted 11-14-2019 07:34 PM

Sorry, but I'm not sure what you mean. I only included that example as it worked for the strings of data that example used. My data is much too large to paste into SAS, so I am trying to import it from a delimited file. My first post is the closest I've gotten, and I included a smaller version of the input data file on that post.

ballardw · Posted 11-14-2019 07:55 PM

@davidsmarch wrote:

Sorry, but I'm not sure what you mean. I only included that example as it worked for the strings of data that example used. My data is much too large to paste into SAS, so I am trying to import it from a delimited file. My first post is the closest I've gotten, and I included a smaller version of the input data file on that post.

NOT any data that long.

The entire code of the data step. One would hope you are not attempting to do this with datalines or parmcards.

Something that would look similar to

Data mydata;

infile "pathtofile.name" Lrecl=64000 <any otherinfile options> .

input ...

As a minimum your INFILE statement will need an LRECL option to tell SAS that you are attempting to read more than 32K characters on a single line. Otherwise SAS will likely not read past the 32K characters at all.

Tom · Posted 11-14-2019 08:07 PM

@davidsmarch wrote:

Sorry, but I'm not sure what you mean. I only included that example as it worked for the strings of data that example used. My data is much too large to paste into SAS, so I am trying to import it from a delimited file. My first post is the closest I've gotten, and I included a smaller version of the input data file on that post.

Two things you need to fix. How many bytes per line you are reading from the file and how you are parsing the "delimited" file.

First your INFILE statement (or the FILENAME statement) needs to set a LRECL that is longer than 32K. Without that nothing will make it to the data step to be read. You could instead try reading it just a stream of characters and figure out where the end of line characters are by your self. There is a max for the LRECL option so if you really have a lot of data per line you might need to use the second option.

Second you cannot really take advantage of the DSD option to process the delimited file since it seems that some of the fields in the file are longer than the max that can fit into a single variable. So you will need make your own algorithm for figuring out where one field ends and the next one begins.

Here is an example where the input line has three variables. I will a use length of just 3 to demonstrate how the "long" variable can be split into multiple variables.

data test;
  infile cards dsd dlm='|' truncover ;
  length x 8 v1-v5 $3 y 8;
  input x @ ;
  array v[5] ;
  do i=1 to dim(v) until(loc) ;
    if i=dim(v) then input v[i] :$char3. @;
    else input v[i] $char3. @;
    loc = indexc(v[i],'|');
  end;
  if loc then do;
    v[i]=substrn(v[i],1,loc-1);
    input +(loc-3) @ ;
  end;
  input y;
cards;
1|1234567890|9
2|123456|8
;

Results:

Obs    x    v1     v2     v3     v4    v5    y    i    loc

 1     1    123    456    789    0           9    4     2
 2     2    123    456                       8    3     1

davidsmarch · Posted 11-15-2019 01:42 PM

Looks like an interesting solution! I tried to adapt it and and while it's parsing the first column it's not moving past that. The delimiter in my data is actually the letter "a". So a string would look like :

947a946a945a944a943a942a941a940a939a938a937a936a933a1029a926a1119a897a875a850a828a811a798a791a786a786a785

But out to hundreds of thousands of characters. So the string between the a's is never longer than 4. I like the idea of creating a new variable for each value between the a's. I'd then look to transpose that, but that's a later step.

The way I altered your code is below:


data test;
infile 'C:\Users\dsm6713\Google Drive\Academic\y_var.tsv' dsd dlm='a' truncover ;
  length y1-y200 $4;
  array y[200];
  do i=1 to dim(y) until(loc) ;
    if i=dim(y) then input y[i] :$char32767. @;
    else input y[i] $char32767. @;
    loc = indexc(y[i],'a');
  end;
  if loc then do;
    y[i]=substrn(y[i],1,loc-1);
    input +(loc-32767) @ ;
  end;
run;

Tom · Posted 11-15-2019 01:59 PM

Looks like an interesting solution! I tried to adapt it and and while it's parsing the first column it's not moving past that. The delimiter in my data is actually the letter "a". So a string would look like :

947a946a945a944a943a942a941a940a939a938a937a936a933a1029a926a1119a897a875a850a828a811a798a791a786a786a785

But out to hundreds of thousands of characters. So the string between the a's is never longer than 4.

That is a totally different problem than the way you originally presented it. So you no longer have a length issue at all since the maximum length is only 4!

Let's make some simple example data:

filename example temp;
data _null_;
  file example;
  infile cards truncover;
  input line $100. ;
  len=lengthn(line);
  put line $varying200. len ;
cards;
947a946a945a944a943a942a941a
947a946a945a944a943a942a941a940a939a938a937
936a933a1029a926a1119
;

Now just read it into a tall skinny structure with line number, column number and value.

data want ;
  infile example dsd dlm='a' truncover length=line_len column=current_col ;
  row+1;
  do col=1 by 1 until(current_col > line_len);
    input word :$4. @;
    output;
  end;
run;

Results

Obs    row    col    word

  1     1       1    947
  2     1       2    946
  3     1       3    945
  4     1       4    944
  5     1       5    943
  6     1       6    942
  7     1       7    941
  8     2       1    947
  9     2       2    946
 10     2       3    945
 11     2       4    944
 12     2       5    943
 13     2       6    942
 14     2       7    941
 15     2       8    940
 16     2       9    939
 17     2      10    938
 18     2      11    937
 19     3       1    936
 20     3       2    933
 21     3       3    1029
 22     3       4    926
 23     3       5    1119

davidsmarch · Posted 11-17-2019 05:01 PM

Apologies. Correct, though the length in each cell is shorter, I saw the process as needing to get the entire string split wide, then transpose to long. But your solution works perfectly!

Classroom Training Available!