BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
davidsmarch
Calcite | Level 5

I'm trying to import multiple variables are each strings and can be more than 32767 a piece. I know the limit for a string variable is 32767, but I'm hoping to import such that if a variable reaches the max it spills over into another variable. I saw a partial fix in another post, but I can't get it to work. For trying it out, I limited the dataset to only 1 variable. To make it easier I also took the header row out o the dataset. This variable is just called Y. The current form is tab-delimited, but I've also unsuccessfully tried other formats.

 

Below I tried to import the Y variable as a possible array with a first variable Ypart1 up to 32767 and let it spill over into a Ypart2 if needed. As it is, it just fills up the first Ypart1 to the limit, makes a Ypart2, but no data appears there. I've attached the limited datafile. I can do this per variable, if necessary. Obviously it'd be best to do it all at once. Thanks for any guidance!

 

filename y_var temp;
data y_test;
   infile 'C:\Users\dsm6713\Google Drive\Academic\y_var.tsv' truncover;
   array Ypart[2] $32767.;
   input (Ypart[*])($char32767.);
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Looks like an interesting solution! I tried to adapt it and and while it's parsing the first column it's not moving past that. The delimiter in my data is actually the letter "a". So a string would look like :

 

947a946a945a944a943a942a941a940a939a938a937a936a933a1029a926a1119a897a875a850a828a811a798a791a786a786a785

 

But out to hundreds of thousands of characters. So the string between the a's is never longer than 4. 

That is a totally different problem than the way you originally presented it.  So you no longer have a length issue at all since the maximum length is only 4!

 

Let's make some simple example data:

filename example temp;
data _null_;
  file example;
  infile cards truncover;
  input line $100. ;
  len=lengthn(line);
  put line $varying200. len ;
cards;
947a946a945a944a943a942a941a
947a946a945a944a943a942a941a940a939a938a937
936a933a1029a926a1119
;

Now just read it into a tall skinny structure with line number, column number and value.

data want ;
  infile example dsd dlm='a' truncover length=line_len column=current_col ;
  row+1;
  do col=1 by 1 until(current_col > line_len);
    input word :$4. @;
    output;
  end;
run;

Results

Obs    row    col    word

  1     1       1    947
  2     1       2    946
  3     1       3    945
  4     1       4    944
  5     1       5    943
  6     1       6    942
  7     1       7    941
  8     2       1    947
  9     2       2    946
 10     2       3    945
 11     2       4    944
 12     2       5    943
 13     2       6    942
 14     2       7    941
 15     2       8    940
 16     2       9    939
 17     2      10    938
 18     2      11    937
 19     3       1    936
 20     3       2    933
 21     3       3    1029
 22     3       4    926
 23     3       5    1119

 

 

View solution in original post

9 REPLIES 9
Astounding
PROC Star

Just to simplify slightly and cut down the possibilities for error, get rid of the array and try:

 

input Ypart1 $char32767. Ypart2 $char32767.;
davidsmarch
Calcite | Level 5

Astounding, I just tried that, and same result. It cuts off Ypart1 at 32767 and includes a blank Ypart2 variable. It's not spilling over. The original solution below works, but only as they included the data as parmcards:

 

filename FT15F001 temp;
data lines;
   infile FT15F001 truncover;
   array part[3] $40.;
   input (part[*])($char40.);
   parmcards;
this is the data line this is the line this is the data line 
this is the data line this is the data line this is the data line this is the data line 
this is the data line this is the data line this is the data line 
;;;;
   run;

I'm including this as perhaps it will make more sense to you why theirs works. 

ballardw
Super User

Like this:

data example;
   infile datalines truncover;
   input part1 $40. part2 $40. part3 $40.;
   datalines;
this is the data line this is the line this is the data line 
this is the data line this is the data line this is the data line this is the data line 
this is the data line this is the data line this is the data line 
;

If your actual data step has anything other than that varying data you need to show the entire data step.

INFILE options have some interesting interactions with the INPUT statement sometimes and the entire syntax is likely needed.

davidsmarch
Calcite | Level 5

Sorry, but I'm not sure what you mean. I only included that example as it worked for the strings of data that example used. My data is much too large to paste into SAS, so I am trying to import it from a delimited file. My first post is the closest I've gotten, and I included a smaller version of the input data file on that post. 

ballardw
Super User

@davidsmarch wrote:

Sorry, but I'm not sure what you mean. I only included that example as it worked for the strings of data that example used. My data is much too large to paste into SAS, so I am trying to import it from a delimited file. My first post is the closest I've gotten, and I included a smaller version of the input data file on that post. 


NOT any data that long.

 

The entire code of the data step. One would hope you are not attempting to do this with datalines or parmcards.

 

Something that would look similar to

 

Data mydata;

   infile "pathtofile.name"  Lrecl=64000 <any otherinfile options> .

   input ...

 

 

As a minimum your INFILE statement will need an LRECL option to tell SAS that you are attempting to read more than 32K characters on a single line. Otherwise SAS will likely not read past the 32K characters at all.

Tom
Super User Tom
Super User

@davidsmarch wrote:

Sorry, but I'm not sure what you mean. I only included that example as it worked for the strings of data that example used. My data is much too large to paste into SAS, so I am trying to import it from a delimited file. My first post is the closest I've gotten, and I included a smaller version of the input data file on that post. 


Two things you need to fix. How many bytes per line you are reading from the file and how you are parsing the "delimited" file.

 

First your INFILE statement (or the FILENAME statement) needs to set a LRECL that is longer than 32K. Without that nothing will make it to the data step to be read.  You could instead try reading it just a stream of characters and figure out where the end of line characters are by your self.  There is a max for the LRECL option so if you really have a lot of data per line you might need to use the second option.

 

Second you cannot really take advantage of the DSD option to process the delimited file since it seems that some of the fields in the file are longer than the max that can fit into a single variable.  So you will need make your own algorithm for figuring out where one field ends and the next one begins.

 

Here is an example where the input line has three variables. I will a use length of just 3 to demonstrate how the "long" variable can be split into multiple variables. 

data test;
  infile cards dsd dlm='|' truncover ;
  length x 8 v1-v5 $3 y 8;
  input x @ ;
  array v[5] ;
  do i=1 to dim(v) until(loc) ;
    if i=dim(v) then input v[i] :$char3. @;
    else input v[i] $char3. @;
    loc = indexc(v[i],'|');
  end;
  if loc then do;
    v[i]=substrn(v[i],1,loc-1);
    input +(loc-3) @ ;
  end;
  input y;
cards;
1|1234567890|9
2|123456|8
;

Results:

Obs    x    v1     v2     v3     v4    v5    y    i    loc

 1     1    123    456    789    0           9    4     2
 2     2    123    456                       8    3     1

 

davidsmarch
Calcite | Level 5

Looks like an interesting solution! I tried to adapt it and and while it's parsing the first column it's not moving past that. The delimiter in my data is actually the letter "a". So a string would look like :

 

947a946a945a944a943a942a941a940a939a938a937a936a933a1029a926a1119a897a875a850a828a811a798a791a786a786a785

 

But out to hundreds of thousands of characters. So the string between the a's is never longer than 4. I like the idea of creating a new variable for each value between the a's. I'd then look to transpose that, but that's a later step.

 

The way I altered your code is below:

 


data test;
infile 'C:\Users\dsm6713\Google Drive\Academic\y_var.tsv' dsd dlm='a' truncover ;
  length y1-y200 $4;
  array y[200];
  do i=1 to dim(y) until(loc) ;
    if i=dim(y) then input y[i] :$char32767. @;
    else input y[i] $char32767. @;
    loc = indexc(y[i],'a');
  end;
  if loc then do;
    y[i]=substrn(y[i],1,loc-1);
    input +(loc-32767) @ ;
  end;
run;

 

Tom
Super User Tom
Super User

Looks like an interesting solution! I tried to adapt it and and while it's parsing the first column it's not moving past that. The delimiter in my data is actually the letter "a". So a string would look like :

 

947a946a945a944a943a942a941a940a939a938a937a936a933a1029a926a1119a897a875a850a828a811a798a791a786a786a785

 

But out to hundreds of thousands of characters. So the string between the a's is never longer than 4. 

That is a totally different problem than the way you originally presented it.  So you no longer have a length issue at all since the maximum length is only 4!

 

Let's make some simple example data:

filename example temp;
data _null_;
  file example;
  infile cards truncover;
  input line $100. ;
  len=lengthn(line);
  put line $varying200. len ;
cards;
947a946a945a944a943a942a941a
947a946a945a944a943a942a941a940a939a938a937
936a933a1029a926a1119
;

Now just read it into a tall skinny structure with line number, column number and value.

data want ;
  infile example dsd dlm='a' truncover length=line_len column=current_col ;
  row+1;
  do col=1 by 1 until(current_col > line_len);
    input word :$4. @;
    output;
  end;
run;

Results

Obs    row    col    word

  1     1       1    947
  2     1       2    946
  3     1       3    945
  4     1       4    944
  5     1       5    943
  6     1       6    942
  7     1       7    941
  8     2       1    947
  9     2       2    946
 10     2       3    945
 11     2       4    944
 12     2       5    943
 13     2       6    942
 14     2       7    941
 15     2       8    940
 16     2       9    939
 17     2      10    938
 18     2      11    937
 19     3       1    936
 20     3       2    933
 21     3       3    1029
 22     3       4    926
 23     3       5    1119

 

 

davidsmarch
Calcite | Level 5

Apologies. Correct, though the length in each cell is shorter, I saw the process as needing to get the entire string split wide, then transpose to long. But your solution works perfectly! 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 3828 views
  • 1 like
  • 4 in conversation