Hello,
For each day there is CSV file called tYYYYMMDD.
Each CSV file contain around 2 millions of rows.
I want to import CSV file from last 1 year and put them in one data set.
The problem is that it exceeds the maxmimum number of rows. (Question1-What is max number of rows possible in data set?)
I have also a data set called List_customers with list of customers that purcahsed in the shop in last 1 year .
I need to select rows from CSV files only for the list of customers in data set List_customers.
Question2:
What is the way to import multiple CSV files only for the customers in List_customers dataset and then put them in one data set?
Is it possible to perform import rows of CSV files only for these specific customers?
Or the only way is to import the fully CSV files and then perform inner join with List_customers dataset?
700 million observations is way below any SAS limit I know of.
What variables do you have in the CSV files, and what are their attributes (take them from the documentation you received along with the files)?
Hello,
In CSV file there are 25 varaibles and all are numeric.
I still ask if it is more efficient to import CSV files and select only the rows that are relevent to my customers and then stack the rows to one data set?
(Instead of import millions of rows that most of them are not relevent)?
Run the import for all data in one data step, as described in your other thread.
Load the customers into a hash object, and use the CHECK method to filter the relevant observations.
How can you know what's in a row until you read it? Unless you have some external logic or additional information you can't know. Therefore you kind of have to process the entire file and then parse out what you want. So you read the file but only write the records of interest.
@Ronein wrote:
Hello,
In CSV file there are 25 varaibles and all are numeric.
I still ask if it is more efficient to import CSV files and select only the rows that are relevent to my customers and then stack the rows to one data set?
(Instead of import millions of rows that most of them are not relevent)?
How many "relevent customers" are their?
If the number is "small" it might be as simple as adding something like:
if customer id in ('id1' 'id2' 'id3') ;
to the data step reading the file.
You would read all of the records but only keep the desired ones.
If you have "many" then read the whole and filter later such as by merging with a data set that contains the relevant customer id values.
Are the same customers "relevant" in all the source files?
Why do you reading 2 million records "exceeds the maximum number of rows"? Are you using one of the SAS editions that has file size limits because it is intended as a learning tool and not a production data system?
CSV files are text files. And text files with variable length lines. So you have to read the whole file.
That does not mean you have to write back out all of the lines read.
Similarly the location of the variables on the line is not fixed. So if you only want some of the variables on the line you have to read them all (or at least up to the last one on the line you want).
To filter observations based on a list of values a HASH object is useful. (If the list of customers it too large to load into a hash then use an indexed dataset instead.)
With your naming convention you can use a wildcard in the physical filename to read all of the CSV files in one step. If the CSV files have header rows then you will need to include logic to skip the header row.
Let's assume the key variable is named CUSTOMER in both your LIST_CUSTOMERS dataset and in the dataset you want to create from the CSV files.
Example:
data want ;
* Load the customer list into a hash object ;
if _n_= 1 then do;
declare hash h(dataset:'list_customers');
rc=h.definekey('customer');
rc=h.definedone();
end;
* Read all of the files for a particular year ;
* Skip the header line when starting a new file ;
length fname $256 ;
infile '/mypath/t2021*.csv' dsd truncover filename=fname ;
input @;
if fname ne lag(fname) then delete;
* Logic for reading the data from a line in the CSV file here ;
input customer var1 var2 .... ;
* Delete lines for customers that are not in the hash object ;
if h.find() then delete;
run;
The * in the filename is a wildcard. So the example was for 2021. If you also other files, like "t2021GEORGE.csv" that match that pattern you will have a problem (not a hard problem since you can use the FNAME variable to test the actual name of the file that was found.
You need the extra INPUT with the trailing @ to get the FNAME variable to be updated so you can test if it is the start of a new file or not. If the files do not contain header rows then you do not need this. But see below discussion about whether the DATE value appears in the data or not.
Yes. It makes no sense to make 365 separate SAS datasets. If the fields in the file do not contain the date then pull the date from the FNAME variable. Note the FNAME variable will not appear in the output dataset so use a new variable for this.
date_string = scan(fname,-2,'.t');
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.