BookmarkSubscribeRSS Feed
deleted_user
Not applicable
Hi Expert,

Is there any function in SAS which offers parallel access to a flat file.

A functionality exactly similar to as "POINT=variable" for set statment.

Please Help!


Regards,
Abhishek
14 REPLIES 14
LinusH
Tourmaline | Level 20
Not to my knowledge.
But why do you need it?
Describe your case in more detail, there may be an alternative solution that could be acceptable for you.
/Linus
Data never sleeps
DanielSantos
Barite | Level 11
I believe random access is achievable with some files on the z/OS system, beside that, it's not to my knowledge either.

Although it is possible to travel "backward" through the file, access to external files is always done sequentially.

Check the info on FPOINT,FNOTE,DROPNOTE and FREWIND:
http://support.sas.com/documentation/cdl/en/lrdict/62618/HTML/default/a000209714.htm
http://support.sas.com/documentation/cdl/en/lrdict/62618/HTML/default/a000209721.htm
http://support.sas.com/documentation/cdl/en/lrdict/62618/HTML/default/a000211377.htm
http://support.sas.com/documentation/cdl/en/lrdict/62618/HTML/default/a000211061.htm

Cheers from Portugal.

Daniel Santos @ www.cgd.pt
deleted_user
Not applicable
The size of the flat file is in GB's. If I read the file in sequential order it will definitely hit the performance, instead of this if I virtually split the file into multiple parts and read each part in parallel as if there are multiple files whose cumulative number of records are equal to the actual flat file record count. This I suppose will enhace the performance to a considerable extent.

Please help!


Abhishek
deleted_user
Not applicable
code similar to

data test;
infile filename recl = x firstobs = 100 obs = 150;
input;
run;

In the above code though the output contains records from 100 to 150 but the during execution it also takes (reads) the starting 99 records into buffer which is unlike the functionality of "point = variable" option (in set statement)
LinusH
Tourmaline | Level 20
I assume that you will search this multiple times?
Then I think it's better to load the data into SAS/SPDE, and the use random access. It's probably worth the little overhead of importing.
/Linus
Data never sleeps
Patrick
Opal | Level 21
I believe a flat file is as it says 'flat' so you don't have all this metadata information which would allow random access (i.e. number of observations, variables,...).

As you don't have this information: How could SAS possibly read record 100 without first reading record 1-99 as the only definition what record 100 is will be a count of end-of-line indicators.
deleted_user
Not applicable
I take your point.

Again moving back to the root cause, please suggest the best practices to load a large files size more than 10G. The file is a fixed width file.

Shall I first split the file and then load all files in parallel ?

Also what are infile options which I can harness for reading large flat file?

Thanks,
Abhishek
Peter_C
Rhodochrosite | Level 12
I understand some languages allow you to read from an address withiin the file - I think the terminology is "offset within the file" on windows.
The nearest SAS appears to offer would be unbuffered data (lrecl greater than 32k or RECFM=N), read with something like INPUT @10000000 @;
That should be able to address the data starting at the millionth byte.
When all data must be read, this approach may not be best because "unbuffered" usually means slower. Hewever, it might be worth trying.

good luck
PeterC
LinusH
Tourmaline | Level 20
Splitting the file won't probably help you, since SAS would probably read the data as fast as your splitting program would.
If you plan just to do this once, I would say don't bother about performance.
If you plan to import this file on a regularly basis, you might need to design (simple) a ETL flow, where you might optimize by just import changed data in some way.
I don't think that there are any options that will affect the performance that much.
SAS is considered very fast for importing flat files, even compared with various bulk-loaders available with the competition.

/Linus
Data never sleeps
DanielSantos
Barite | Level 11
Agree with Linus.

Parallel processing can be very hardware dependent. It may depend on how the file is fragmented through the disks to the system load at execution.

There are still some system/SASparameters that you could fine tune, but I would probably don't bother with that, unless you are seeing some suspicious I/O performance accessing the file.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
What OS platform is running SAS for this application? There are also considerations with parallel-processing about threading your batch processes/jobs, whether there is a comprehensive job-scheduling facility. What other pre-processing (sort package) utilities are available to you? What challenges do you have with intermediate data storage resources which may influence when you run your parallel processes?

It's quite possible that your SAS application may not need to reference all of your flat file contents, so, yes, it is possible that you could determine what specific "unique values" are needed and filter that file as it is being loaded. Your needs could be date-related, so, again, an input-side data filter process may be suitable.

Certainly your SAS application design process will want to take these factors into consideration.

Scott Barry
SBBWorks, Inc.
deleted_user
Not applicable
Data storage is not a concern, we have enough space and memory for parallel execution (64G RAM).

Also complete data is required as part of analysis, so none of the column value or records can be skipped and as the values are fixed length so missing or empty space are least expected.
DanielSantos
Barite | Level 11
Storage space is not the trouble, but the infrastructure setup may be, and there's so many things to consider there (local/SAN? RAID type? Stripping? DirectIO? etc...)

Can you tell us the typical transfer speed (MB/s) to import the data from the file you are getting?

Cheers from Portugal.

Daniel Santos @ www.cgd.pt
Peter_C
Rhodochrosite | Level 12
the "fixed-width" picture helps with performance - requiring no variable-length-search for end-of-field, nor end-of-line.
Assuming your multi-gb platform can cache the whole file, and share it's memory cache between applications, then you could have parallel processors pull the content, each starting from a different "address" within the memory image. That would allow multiple processors of your platform, to act independently, each reading its own part. When all are read, the remaining steps start with an sql UNION or data step SET (interleaved, if the flat file has useful order).
Of course, sharing memory in that kind of way seems very "op.sys-dependent".
Good luck
PeterC

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1177 views
  • 0 likes
  • 6 in conversation