@Season wrote:
So the selection process still takes place after the entire importation is done, right?
A text file is linear. There is no way to read it without actually reading it.
Thank you! I consulted Deepseek on resolving this issue in R and it provided a "flowing decompression" method of dealing with this problem. In short, batches of observations are decompressed, imported, selected and stored. When one cycle finishes, a second batch is decompressed while the first batch of decompressed file is deleted, and so on. The stored observations, which is what we finally want yet is distributed as multiple small datasets for the time being, is stacked to form a large one. Can SAS do something like this?
@Season wrote:
Thank you! I consulted Deepseek on resolving this issue in R and it provided a "flowing decompression" method of dealing with this problem. In short, batches of observations are decompressed, imported, selected and stored. When one cycle finishes, a second batch is decompressed while the first batch of decompressed file is deleted, and so on. The stored observations, which is what we finally want yet is distributed as multiple small datasets for the time being, is stacked to form a large one. Can SAS do something like this?
Why would you want to? SAS does NOT load the whole dataset into memory to work with it, like the original base R does with variables (objects as R calls them). So no tricks to make it use less memory is typically needed when working in SAS.
Because loading in the entire dataset is too large. I understand that the importation process might not need all of the file to be loaded into memory in SAS, but the question is the resultant imported dataset is too large to be stored in memory as well.
@Season wrote:
Because loading in the entire dataset is too large. I understand that the importation process might not need all of the file to be loaded into memory in SAS, but the question is the resultant imported dataset is too large to be stored in memory as well.
SAS stores datasets on disk, not in memory. So large amounts of memory are not needed to work with datasets. Especially one that only has 40 variables. The only place you will have memory issues would be if you tried to do analysis that resulted in creating matrices that were too large to store in memory. For example trying to using CLASS variable with millions of distinct classes.
Saving such a large dataset on disk might be any issue however. The SAS dataset structure is not that efficient but using the COMPRESS=YES option can make them take a little less disk space.
Thank you for your patient illustration! Could you please tell me where to specify the COMPRESS=YES option?
You set the system option using the OPTIONS statement.
options compress=yes;
You set it at the LIBREF level using the COMPRESS= option of the LIBNAME statement.
libname mylib 'myfolder_name' compress=yes;
You can set it at the DATASET level using the COMPRESS= dataset option.
data mylib.myds(compress=yes);
infile .....
Is it possible, then, to specify the starting and ending row of the .csv.gz file and let SAS read in the designated subset of data only?
@Season wrote:
Is it possible, then, to specify the starting and ending row of the .csv.gz file and let SAS read in the designated subset of data only?
Yes. Setting the starting observation number (really the starting LINE number) you have already seen in the example INFILE statements posted above. To tell where to stop use the OBS= option of the INFILE statement.
So to read the first 100 lines of actual data you would use FIRSTOBS=2 and OBS=101 (skipping the header line).
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.