I've tried running the following on my dataset:
data split; set test; length var1-var700 $10.; array var(700) $; do i = 1 to dim(var); var[i]=scan(row,i,'|','m'); end; run;
But this takes a very long time.
Is there a more efficient way of handling this?
I was so interested in learning this _infile_ trick, I stayed up researching and trying to resolve it myself, and here is what I found
/* Dummy File */
filename tmpf temp;
/* Ensure it gets created */
data _null_;
file tmpf;
put;
run;
/* Use External file with the _infile_ trick */
Data split;
set test;
array var{700} $10;
Infile tmpf DSD DLM = '|' truncover ls=7699 lrecl=7699;
Input @1 @@;
_infile_ = strip(row) ;
Input @1 var[*] :$10. @@;
drop row;
Run;
1. The file is already a SAS data set?
2. Can you go back to the source and read it directly correctly, that would probably be a bit more efficient.
3. That's a wide data set, can you partition it out in any manner? Do you have enough space to store/manage it? That's going to be in the tens if not hundreds of gigabytes. I had a data set with ~30 million rows with 100 variables that was usually 30GB.....
4. What is your SAS installation like? Are you working on a desktop or server installation? If this is with SAS UE, it likely will not handle data this large.
Processing any file that large will take time - so what's your expected amount of time to read/process this file? It likely won't happen in seconds (unless you have an amazing server) but should be less than 30 minutes I would expect.
I'm not an efficiency expert, but I suspect there's not a way to do that more efficiently once it's into SAS. Ideally, if you could go back a step to when the data is loaded and have it loaded correctly would be better. You should be able to specify the delimiter easily if this is a well formatted text file, which it sounds like it is.
If you are using SAS 9.4, you may want to use Proc DS2 to process your ~28 Million Rows data set in a Multi-threaded way.
Here is how you can do it
proc ds2;
/* Declare a thread that would split the long delimited string into 700 variables */
thread splitter/overwrite=yes;
dcl double y count;
dcl bigint thisThread;
dcl vararray varchar(10) var[1:700] var1-var700;
drop count;
method run();
set test;
do i = LBOUND(var) to HBOUND(var);
var[i]=scan(row,i,'|','m');
end;
thisThread=_threadid_;
count+1;
end;
method term();
put '**Thread' _threadid_ 'processed' count 'rows:';
end;
endthread;
run;
/* Use 6 threads to process the large ~28 Million rows */
data split/overwrite=yes;
dcl thread splitter st;
method run();
set from st threads=6;
end;
enddata;
run;
quit;
Credits to @SASJedi for posting this blog Jedi SAS Tricks: Warp Speed DATA Steps with DS2
Hope this helps with speeding up your processing,
Ahmed
Some characteristics of your data may make this possible. Here are a few related questions. Among VAR1-VAR700:
Do any variables have a missing value?
Do any variables contain embedded blanks?
What I'm looking at is that each variable forces SAS to scan (on average) for 350 delimiters. There might be a way to convert the values using either arrays or a hash table (or maybe even parsing functions), while searching for just a single delimiter each time. So let's start with those questions.
Did you try using "_infile_ trick"?
data split;
infile cards dsd dlm='|' truncover ;
set test;
input @1 @@;
_infile_=row;
input (var1-var700) (:$10.) @@;
drop row;
cards;
***dummy line***
;
Hi @Tom & @FreelanceReinh
Thanks for your replies. I tried both your approaches on the generated dataset and it works great. However mine seems to lead to some issues which I think is due to many missing values.
Here's an eg of how data would look like, where you can see there will be a lot of missing values:
A|452435234|4353663412|JOHN|DOE|25||||||MI|00001||||||||||||||||||
When I run @Tom's data steps on dataset, the output ends up as follows (rest as blanks). Looks like issue with truncover, not sure how to rectify:
var1 | var2 | var3 | var4 | var5 | var6 | var7 | var8 | var9 | var10 | var11 | var... |
A | 452435234 | 4353663412 | JOHN | DOE |
When I run @FreelanceReinh data steps, output ends up blank.
Hi @promo_at_work,
Thanks for testing the suggested solutions. Did you check the resulting SPLIT datasets, e.g. by comparing them to the dataset obtained with your correct, but slow SCAN approach?
The reason I am asking is that I hadn't been able to get correct results from my generated test data using the "INFILE CARDS" approach (which was used in the 2016 post where I had seen the "_infile_ trick" for the first time). I don't think that TRUNCOVER is the issue here, but rather the limitation to a record length of 80 characters (which I wasn't able to overcome, regardless of the CARDIMAGE system option setting or the LINESIZE= option of the INFILE statement, LRECL= being inapplicable).
The length of variable _INFILE_ can be written to the log:
701 options nocardimage; 702 data _null_; 703 infile cards dsd dlm='|' truncover ; 704 set test(obs=1); 705 input @1 @@; 706 _infile_=row; 707 len=length(_infile_); 708 put len=; 709 cards; len=80 NOTE: There were 1 observations read from the data set WORK.TEST.
So, the subsequent INPUT statement (populating the 700 character variables) would operate on a truncated record, resulting in missing values for all but the first few variables. This is the effect you describe.
Therefore I resorted to reading from an external (dummy) file where the default logical record length is 32767 (unless the LRECL= system option has been set to a smaller value), hence sufficient for your data. The external file can be an existing file, as @s_lassen has suggested, or a newly created temporary file as in my approach.
When you apply my approach to your data and the "output ends up blank", do you get any log messages (including NOTEs) which are different from those obtained with the test data (where "it works great")?
I don't believe that "many missing values" can be the reason. The test data contain missing values as well, up to seven in a row, and it's easy to increase the proportion of missing values further, e.g. by assigning
l=floor(11*rand('uniform')**4);
(not to speak of l=0).
Are different operating systems involved, so that the TERMSTR= option of the INFILE statement should be used? (I'm not familiar with the "SAS Grid Platform".) Or do your data contain control characters, multi-byte characters etc.?
Hey @FreelanceReinh,
Thanks for looking into it. Yes, it may appear to be the record length. Here's extract from log:
NOTE: LOST CARD.
_ERROR_=1 _INFILE_= _N_=1
NOTE: 1 record was read from the infile FT15F001.
The minimum record length was 4.
The maximum record length was 4.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: There were 1 observations read from the data set WORK.TEST.
NOTE: The data set WORK.SPLIT has 0 observations and 700 variables.
NOTE: DATA statement used (Total process time):
real time 0.08 seconds
cpu time 0.01 seconds
The record lengths of 4 mentioned in the log are correct. They just reflect the length of the dummy string "blah". Much more important is the record length that is normally mentioned in the log message starting with "NOTE: The infile FT15F001 is:". On my computer it says:
RECFM=V,LRECL=32767,File Size (bytes)=6,
and 32767 should be enough for 700 variables of length 10 (plus the delimiters between them).
Do you see a different value in your log?
The notes
NOTE: LOST CARD.
and
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
together with
NOTE: There were 1 observations read from the data set WORK.TEST. NOTE: The data set WORK.SPLIT has 0 observations and 700 variables.
indicate that variable ROW contains less than 700 pipe-delimited values in the first observation of dataset TEST. You can add the TRUNCOVER option to the INFILE statement to avoid these notes and probably get some (or possibly all) observations in dataset SPLIT. However, the discrepancy between the anticipated and the actual number of values would still be alarming.
Edit: You can run the following step to determine the number of delimiters in the first observation (or in the first few observations if you increase the OBS= option value):
data _null_;
set test(obs=1);
numdlm=countc(row,'|');
put numdlm=;
run;
The result numdlm=699 (in the log) would be good, but a smaller number is more likely, given the symptoms you described.
I was so interested in learning this _infile_ trick, I stayed up researching and trying to resolve it myself, and here is what I found
/* Dummy File */
filename tmpf temp;
/* Ensure it gets created */
data _null_;
file tmpf;
put;
run;
/* Use External file with the _infile_ trick */
Data split;
set test;
array var{700} $10;
Infile tmpf DSD DLM = '|' truncover ls=7699 lrecl=7699;
Input @1 @@;
_infile_ = strip(row) ;
Input @1 var[*] :$10. @@;
drop row;
Run;
Using the INPUT statement in this way is pretty neat.
To speed this DATA step just a little more, you could add a "retain var;" statement after the ARRAY statement.
Each time the DATA step iterates, all the variables in the array VAR are initialized to all blanks. Then, the INPUT statement sets all the variables in VAR to a value. Using the RETAIN statement causes the initialization to blanks to not occur as it isn't needed.
There might be a slight speed up by using the $CHAR10. informat instead of $10.. $CHARw. results in a memory move while $w. does a little more work to remove leading blanks and to check for the value ".".
If you're using a Unix system, you can use "/dev/zero" instead of tmpf on the INPUT statement. This makes the program a little simpler by not needing to create a temp file.
If you're using Windows or z/OS, I'm not aware of an equivalent to /dev/zero on those operating systems.
Data split2;
set test;
array var{700} $10;
retain var;
Infile '/dev/zero' DSD DLM = '|' truncover ls=7699 lrecl=7699;
Input @1 @@;
_infile_ = strip(row) ;
Input @1 var[*] :$CHAR10. @@;
drop row;
Run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.