Hi Experts,
I have a big data text file of a customer with 20 billion of records. In this file, I want to read it and select only two columns and filter only lines by a filter condition.
I´m looking to a best strategy to do that, because a data step with infile statement, if I remember, it does not allow to filtering lines when SAS is reading files to pdv.
Is there any other way to a optmized reading proccess of this file ?
Does anyone have a insight to the best way of loading reading large files ?
I´m also considering to slice it in multiple files in OS by a shell script command
What type of text file? If the source is fixed width data then the Input statement will read directly only the columns specified.
If the file is CSV of some flavor you read the unwanted columns into a variable that you drop. Dummy code as an example: This assumes that the first variable that you want to keep is the 4th variable (or column if you prefer) and the second that you want is the 7th variable. There is no need to read past the second variable that you want. Note that all the other values get read as a single character then dropped from the output. The character informat is important because it won't choke on anything. Set the informat as needed for your variables.
Data want; infile "<path>\somefilename.csv" dlm=',' dsd lrecl=32000; informat dropvar $1. firstvar $10. /*or appropriate informat for the variable you want*/ secondvar f12. ; input dropvar dropvar dropvar firstvar dropvar dropvar secondvar ; drop dropvar; <your conditions to keep records go here> run;
I use this with a data source that provides dozens of files and in most of the files I only use 5 or 10 out of as many as 200 variables.
If the file is not a simple character separated (character includes Tab) file, then post an example file with 10 or 15 rows of text use the text box opened with the </> icon to preserve formatting and clearly describe how to identify the values you need to read.
Not sure how you are going to filter the records without reading them to see what they contain.
Why do think you need something other than a data step to do that?
SAS reads text file pretty efficiently.
For more detailed help provide an actual example. Describe the records in the text file. Describe the layout of the text file. At least enough to show how to read the variables you want to keep and/or use to filter on.
Let me explain better, Obviously that I have to read the file and each record, but I´m considering that SAS reading processing is record by record, só if the line I´m reading does not correspond to my filter criteria, I just discard it, validate it to move to next record.
File is a csv file by semicolon with more than one hundred columns. The question is SAS is reading line by line and applying filter, discarding the records, and write the dataset only with records and selected columns.
customerdId;Adress;Gender;Segment
1; Rua 18 norte lt 1/3 Brasilia, Brasil; Masculino ; A
2; Rua 19 norte lt 1/3 Brasilia, Brasil; Masculino ; B
I want to create a subssting of customers of segment = B;
myResult:
customerdId;segment
2;B
That is very simple.
With a delimited file you will have to read the fields up to at least the last one you need for your purposes. In this case that is all of the fields. (You could do something to peal off the last field in a line, but not sure if it saves any real time).
First let's convert your sample text into a physical file we can work with.
options parmcards=csv;
filename csv temp;
parmcards4;
customerdId;Adress;Gender;Segment
1; Rua 18 norte lt 1/3 Brasilia, Brasil; Masculino ; A
2; Rua 19 norte lt 1/3 Brasilia, Brasil; Masculino ; B
;;;;
Now you can read and subset the data easily.
data want;
infile csv dsd dlm=';' firstobs=2 truncover;
length customerId 8 Segment $8 dummy $1;
input customerID 2*dummy Segment;
if segment='B';
drop dummy;
run;
Results:
customer Obs Id Segment 1 2 B
You could then write back out a new SCSV (Semi-Colon Separated Values) file.
filename new temp;
data _null_;
file new dsd dlm=';';
length name $32 ;
do name='customerId','Segment';
put name @;
end;
run;
data _null_;
file new dsd dlm=';' mod ;
set want ;
put customerId Segment;
run;
Or you could skip making the SAS dataset.
filename new temp;
data _null_;
file new dsd dlm=';';
length name $32 ;
do name='customerId','Segment';
put name @;
end;
run;
data _null_;
infile csv dsd dlm=';' firstobs=2 truncover;
length customerId 8 Segment $8 dummy $1;
input customerID 2*dummy Segment;
if segment='B';
file new dsd dlm=';' mod ;
put customerId Segment;
run;
What is the field format of the file. If fixed width fields you can use something like this.
INPUT @column-number field1 @column-number field2.; If condition;
For CSV or other delimited field format you will could use.
length dummy $1;
input 20*dummy wanted-field1 30*dummy wanted-field2;
if condition;
As others said not sure how you expect to subset the observations without passing through all of them with something, be that SAS or some other language. Then you are still going to need to read in the subsetted data, assuming the goal is to get it into SAS.
What I used to do, is read the 2 columns with a trailing @ input statement. Then apply an IF statement. Finally, read in all the other variables that you want in the resulting SAS dataset.
This way you only read 2 variables into the PDV until you know you want that observation, then apply the IF statement and read in the remaining variables.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.