Import series of CSV files only with specific rows

Ronein · Posted 06-05-2021 03:28 AM

Hello,

For each day there is CSV file called tYYYYMMDD.

Each CSV file contain around 2 millions of rows.

I want to import CSV file from last 1 year and put them in one data set.

The problem is that it exceeds the maxmimum number of rows. (Question1-What is max number of rows possible in data set?)

I have also a data set called List_customers with list of customers that purcahsed in the shop in last 1 year .

I need to select rows from CSV files only for the list of customers in data set List_customers.

Question2:

What is the way to import multiple CSV files only for the customers in List_customers dataset and then put them in one data set?

Is it possible to perform import rows of CSV files only for these specific customers?

Or the only way is to import the fully CSV files and then perform inner join with List_customers dataset?

Kurt_Bremser · Posted 06-05-2021 03:39 AM

700 million observations is way below any SAS limit I know of.

What variables do you have in the CSV files, and what are their attributes (take them from the documentation you received along with the files)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ronein · Posted 06-05-2021 04:48 AM

Hello,

In CSV file there are 25 varaibles and all are numeric.

I still ask if it is more efficient to import CSV files and select only the rows that are relevent to my customers and then stack the rows to one data set?

(Instead of import millions of rows that most of them are not relevent)?

Kurt_Bremser · Posted 06-05-2021 09:51 AM

Run the import for all data in one data step, as described in your other thread.

Load the customers into a hash object, and use the CHECK method to filter the relevant observations.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 06-05-2021 03:04 PM

How can you know what's in a row until you read it? Unless you have some external logic or additional information you can't know. Therefore you kind of have to process the entire file and then parse out what you want. So you read the file but only write the records of interest.

ballardw · Posted 06-05-2021 04:40 PM

@Ronein wrote:

Hello,

In CSV file there are 25 varaibles and all are numeric.

I still ask if it is more efficient to import CSV files and select only the rows that are relevent to my customers and then stack the rows to one data set?

(Instead of import millions of rows that most of them are not relevent)?

How many "relevent customers" are their?

If the number is "small" it might be as simple as adding something like:

if customer id in ('id1' 'id2' 'id3') ;

to the data step reading the file.

You would read all of the records but only keep the desired ones.

If you have "many" then read the whole and filter later such as by merging with a data set that contains the relevant customer id values.

Are the same customers "relevant" in all the source files?

Why do you reading 2 million records "exceeds the maximum number of rows"? Are you using one of the SAS editions that has file size limits because it is intended as a learning tool and not a production data system?

Tom · Posted 06-05-2021 10:27 AM

CSV files are text files. And text files with variable length lines. So you have to read the whole file.

That does not mean you have to write back out all of the lines read.

Similarly the location of the variables on the line is not fixed. So if you only want some of the variables on the line you have to read them all (or at least up to the last one on the line you want).

To filter observations based on a list of values a HASH object is useful. (If the list of customers it too large to load into a hash then use an indexed dataset instead.)

With your naming convention you can use a wildcard in the physical filename to read all of the CSV files in one step. If the CSV files have header rows then you will need to include logic to skip the header row.

Let's assume the key variable is named CUSTOMER in both your LIST_CUSTOMERS dataset and in the dataset you want to create from the CSV files.

Example:

data want ;
* Load the customer list into a hash object ;
  if _n_= 1 then do;
    declare hash h(dataset:'list_customers');
    rc=h.definekey('customer');
    rc=h.definedone();
  end;
* Read all of the files for a particular year ;
* Skip the header line when starting a new file ;
  length fname $256 ;
  infile '/mypath/t2021*.csv' dsd truncover filename=fname ;
  input @;
  if fname ne lag(fname) then delete;
  * Logic for reading the data from a line in the CSV file here ;
  input customer var1 var2 .... ;
* Delete lines for customers that are not in the hash object ;
  if h.find() then delete;
run;

Ronein · Posted 06-05-2021 12:49 PM

Thank you, some questions:
1- I need to import CSV files from last year ( so there will be 365 CSV files to import).
In the code you wrote, where can I see that you import CSV files from last 1 year?
Remember that each CSV file is called : tYYYYMMDD (I cannot see it in your program)
2-I see that you wrote two times input
input @;
input customer var1 var2 .... ;
Can you explain why?
3-Did you put all CSV files that you imported into one DATA set or into multiple data sets?

Tom · Posted 06-05-2021 01:07 PM

The * in the filename is a wildcard. So the example was for 2021. If you also other files, like "t2021GEORGE.csv" that match that pattern you will have a problem (not a hard problem since you can use the FNAME variable to test the actual name of the file that was found.

You need the extra INPUT with the trailing @ to get the FNAME variable to be updated so you can test if it is the start of a new file or not. If the files do not contain header rows then you do not need this. But see below discussion about whether the DATE value appears in the data or not.

Yes. It makes no sense to make 365 separate SAS datasets. If the fields in the file do not contain the date then pull the date from the FNAME variable. Note the FNAME variable will not appear in the output dataset so use a new variable for this.

date_string = scan(fname,-2,'.t');

Import series of CSV files only with specific rows

Re: Import series of CSV files only with specific rows

Re: Import series of CSV files only with specific rows

Re: Import series of CSV files only with specific rows

Re: Import series of CSV files only with specific rows

Re: Import series of CSV files only with specific rows

Re: Import series of CSV files only with specific rows

Re: Import series of CSV files only with specific rows

Re: Import series of CSV files only with specific rows

Registration is open

Registration is open

SAS Training: Just a Click Away