I have a pipe-delimited dataset that contains about 50 million observations. Unfortunately, not all of them contain the same number of variables, i.e., there are a couple 1000 observations that contain "extra" pipes. So while most observations only contain values for 50 variables, these "right-shifted" observations may have values that go through var60 or var70 (with the first 10 or 20 variables being empty). Since I need to know the number of variables in there to write an infile statement to import this dataset, is there a quick way to find out how many there are?
I have tried a simple proc import with guessingrows = max and while that worked, it took about 12 hours to run. Does proc import have an option to force all variables to character before importing? I'm running SAS EG and am aware of manually setting attributes for Excel files but not for .txt files.
If you know that the extra delimiters can only appear in one of the fields then you could read the file properly by counting how many words are on the line. So read up to before the field that might have extra pipes normally. Then use a DO loop based on the number of words in the to read the and rebuild the string. Then read the end normally.
I posted this solution recently on another thread.
But if the extra pipes can appear all over the line you will probably need to get a human to try to figure out how to fix them.
So you might start by filtering out the bad lines
filename in 'myfile.txt';
filename bad 'myfile_bad.txt';
data _null_;
infile in ;
if countw(_infile_,'|') > 75 then do;
file bad;
put _infile_;
end;
run;
and then manually fixing the bad lines into say a new file called GOOD.
filename both ('myfile.txt' 'myfile_good.txt');
data want;
infile both dsd dlm='|' truncover;
input @;
* ignore the bad lines ;
if countw(_infile_,'|') > 75 then return;
* code to read the good lines goes here ;
input (v1-v75) (:$100.);
run;
Were you actually given a data source that big that had no documentation? I would think anyone making such sized files for others to use would have something describing what you should have.
If the records are not the same structure on each line then Proc Import is suspect.
Do the values on those records with the "extra variables" have some pattern? It may be that you need to write a data step to read them properly. If you saved the log of the previous run then there was a data step written to the log that you could copy, save and modify. If you didn't save the log you could set the system option OBS to only process something like 50K records, which should complete a bit faster, to get a new version of the generated data step:
options obs=50k; <your proc import code goes here> /* reset to default*/ options obs=max;
Then copy that generated data step from the Log to the Editor, clean up, and rerun the new data step.
If the "extra variables" are always missing then remove them from the Informat/format/input statements generated.
Guessingrows will max out at 32K records examined, so with a file of 50 million records that is what "max" uses for guessing.
Are you sure the issue is extra variables on some rows? Perhaps the issue is that some of the rows have the delimiter in the data and the creator of the file did not enclose those field values in quotes to protect the delimiters. Some systems use "escape" characters instead of quotes to protect delimiters. Perhaps your source file was created using that method.
Please go upvote this request to have SAS improve the handling of such files:
If the lines do have the embedded delimiters prefixed with \ then you could possibly modify the infile line on the fly to fix them. The lines would need to be 32k bytes or less for this to work.
data want (compress=yes);
infile 'myfile.txt' dsd dlm='|' truncover firstobs=2;
input @;
_infile_=tranwrd(_infile_,'\|','FF'x);
input (v1-v75) (:$100.);
run;
If you know that the extra delimiters can only appear in one of the fields then you could read the file properly by counting how many words are on the line. So read up to before the field that might have extra pipes normally. Then use a DO loop based on the number of words in the to read the and rebuild the string. Then read the end normally.
I posted this solution recently on another thread.
But if the extra pipes can appear all over the line you will probably need to get a human to try to figure out how to fix them.
So you might start by filtering out the bad lines
filename in 'myfile.txt';
filename bad 'myfile_bad.txt';
data _null_;
infile in ;
if countw(_infile_,'|') > 75 then do;
file bad;
put _infile_;
end;
run;
and then manually fixing the bad lines into say a new file called GOOD.
filename both ('myfile.txt' 'myfile_good.txt');
data want;
infile both dsd dlm='|' truncover;
input @;
* ignore the bad lines ;
if countw(_infile_,'|') > 75 then return;
* code to read the good lines goes here ;
input (v1-v75) (:$100.);
run;
@Ani7 wrote:
The problem is actually arising from one of the variables containing text which also includes the delimiter. For example, var20 may have the value "NAME 1 NAME2 | TEST" and that extra pipe moves TEST to the next variable. We unfortunately can't go back to the client for this so have to make do.
If that is actually quoted in the source file Proc Import wouldn't do that generally as it will use the DSD option. If the value actually appears as: valuex|Name1 Name2|Test|othervalue, no quotes around the problem value then yes there is a problem. With whoever created the file.
Before spending a lot of time "fixing" I would suggest approaching the source and see if they could create a file with either or both of 1) quotes around text values containing the | and 2) with a different delimiter that does not appear in the data text anywhere such as maybe a Tab or ^ character.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.