Solved: Importing a large (20GB) delimited file without knowing the number of ...

Ani7 · Posted 07-19-2021 08:53 AM

I have a pipe-delimited dataset that contains about 50 million observations. Unfortunately, not all of them contain the same number of variables, i.e., there are a couple 1000 observations that contain "extra" pipes. So while most observations only contain values for 50 variables, these "right-shifted" observations may have values that go through var60 or var70 (with the first 10 or 20 variables being empty). Since I need to know the number of variables in there to write an infile statement to import this dataset, is there a quick way to find out how many there are?

I have tried a simple proc import with guessingrows = max and while that worked, it took about 12 hours to run. Does proc import have an option to force all variables to character before importing? I'm running SAS EG and am aware of manually setting attributes for Excel files but not for .txt files.

Tom · Posted 07-19-2021 11:36 AM

If you know that the extra delimiters can only appear in one of the fields then you could read the file properly by counting how many words are on the line. So read up to before the field that might have extra pipes normally. Then use a DO loop based on the number of words in the to read the and rebuild the string. Then read the end normally.

I posted this solution recently on another thread.

But if the extra pipes can appear all over the line you will probably need to get a human to try to figure out how to fix them.

So you might start by filtering out the bad lines

filename in 'myfile.txt';
filename bad 'myfile_bad.txt';
data _null_;
  infile in ;
  if countw(_infile_,'|') > 75 then do;
     file bad;
     put _infile_;
  end;
run;

and then manually fixing the bad lines into say a new file called GOOD.

filename both ('myfile.txt' 'myfile_good.txt');
data want;
  infile both dsd dlm='|' truncover;
  input @;
* ignore the bad lines ;
  if countw(_infile_,'|') > 75 then return;
* code to read the good lines goes here ;
  input (v1-v75) (:$100.);
run;

View solution in original post

ballardw · Posted 07-19-2021 10:58 AM

Were you actually given a data source that big that had no documentation? I would think anyone making such sized files for others to use would have something describing what you should have.

If the records are not the same structure on each line then Proc Import is suspect.

Do the values on those records with the "extra variables" have some pattern? It may be that you need to write a data step to read them properly. If you saved the log of the previous run then there was a data step written to the log that you could copy, save and modify. If you didn't save the log you could set the system option OBS to only process something like 50K records, which should complete a bit faster, to get a new version of the generated data step:

options obs=50k;

<your proc import code goes here>
/* reset to default*/
options obs=max;

Then copy that generated data step from the Log to the Editor, clean up, and rerun the new data step.

If the "extra variables" are always missing then remove them from the Informat/format/input statements generated.

Guessingrows will max out at 32K records examined, so with a file of 50 million records that is what "max" uses for guessing.

Ani7 · Posted 07-19-2021 11:15 AM

Don't get me started on the lack of documentation. But yes, the only information we know is how many variables there "should" be. I currently used your method with an infile statement and arbitrarily picked 75 variables (and drop all the empty ones). But in the off chance that there are more than 75, I imagine I'd be missing out on them with this method.

Guessingrows = max actually uses 2,147,483,647 obs to guess so this is still a viable alternative but not the most efficient.
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p13kvtl8ezj13in17i6m99jypcwi.htm

Tom · Posted 07-19-2021 11:21 AM

Are you sure the issue is extra variables on some rows? Perhaps the issue is that some of the rows have the delimiter in the data and the creator of the file did not enclose those field values in quotes to protect the delimiters. Some systems use "escape" characters instead of quotes to protect delimiters. Perhaps your source file was created using that method.

Please go upvote this request to have SAS improve the handling of such files:

https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-...

If the lines do have the embedded delimiters prefixed with \ then you could possibly modify the infile line on the fly to fix them. The lines would need to be 32k bytes or less for this to work.

data want (compress=yes);
  infile 'myfile.txt' dsd dlm='|' truncover firstobs=2;
  input @;
  _infile_=tranwrd(_infile_,'\|','FF'x);
  input (v1-v75) (:$100.);
run;

Ani7 · Posted 07-19-2021 11:24 AM

The problem is actually arising from one of the variables containing text which also includes the delimiter. For example, var20 may have the value "NAME 1 NAME2 | TEST" and that extra pipe moves TEST to the next variable. We unfortunately can't go back to the client for this so have to make do.

Tom · Posted 07-19-2021 11:36 AM

If you know that the extra delimiters can only appear in one of the fields then you could read the file properly by counting how many words are on the line. So read up to before the field that might have extra pipes normally. Then use a DO loop based on the number of words in the to read the and rebuild the string. Then read the end normally.

I posted this solution recently on another thread.

But if the extra pipes can appear all over the line you will probably need to get a human to try to figure out how to fix them.

So you might start by filtering out the bad lines

filename in 'myfile.txt';
filename bad 'myfile_bad.txt';
data _null_;
  infile in ;
  if countw(_infile_,'|') > 75 then do;
     file bad;
     put _infile_;
  end;
run;

and then manually fixing the bad lines into say a new file called GOOD.

filename both ('myfile.txt' 'myfile_good.txt');
data want;
  infile both dsd dlm='|' truncover;
  input @;
* ignore the bad lines ;
  if countw(_infile_,'|') > 75 then return;
* code to read the good lines goes here ;
  input (v1-v75) (:$100.);
run;

ballardw · Posted 07-19-2021 11:50 AM

@Ani7 wrote:
The problem is actually arising from one of the variables containing text which also includes the delimiter. For example, var20 may have the value "NAME 1 NAME2 | TEST" and that extra pipe moves TEST to the next variable. We unfortunately can't go back to the client for this so have to make do.

If that is actually quoted in the source file Proc Import wouldn't do that generally as it will use the DSD option. If the value actually appears as: valuex|Name1 Name2|Test|othervalue, no quotes around the problem value then yes there is a problem. With whoever created the file.

Before spending a lot of time "fixing" I would suggest approaching the source and see if they could create a file with either or both of 1) quotes around text values containing the | and 2) with a different delimiter that does not appear in the data text anywhere such as maybe a Tab or ^ character.

Ksharp · Posted 07-20-2021 09:43 AM

You could make a very big number of variable to import this txt file like Tom's code , and drop the missing variables after later.

data want (compress=yes);
infile 'myfile.txt' dsd dlm='|' truncover firstobs=2;
input (v1-v600) (:$200.);
run;

Importing a large (20GB) delimited file without knowing the number of variables in it

Re: Importing a large (20GB) delimited file without knowing the number of variables in it

Re: Importing a large (20GB) delimited file without knowing the number of variables in it

Re: Importing a large (20GB) delimited file without knowing the number of variables in it

Re: Importing a large (20GB) delimited file without knowing the number of variables in it

Re: Importing a large (20GB) delimited file without knowing the number of variables in it

Re: Importing a large (20GB) delimited file without knowing the number of variables in it

Re: Importing a large (20GB) delimited file without knowing the number of variables in it

Re: Importing a large (20GB) delimited file without knowing the number of variables in it

SAS Innovate 2025: Call for Content

Classroom Training Available!