BookmarkSubscribeRSS Feed
Ronein
Onyx | Level 15

Hello,

For each day there is CSV file called tYYYYMMDD.

Each CSV file contain around 2 millions of rows.

I want to import CSV file from last 1 year and put them in one data set.

The problem is that it exceeds the maxmimum number of rows. (Question1-What is max number of rows possible in data set?)

I have also a data set called List_customers with list of customers that purcahsed in the shop in last 1 year .

I need to select rows from CSV files only for the list of customers in data set List_customers.

Question2:

What is the way to import multiple CSV files only for the customers in List_customers  dataset and then put them in one data set?

Is it possible to perform import rows of CSV files only for these specific customers?

Or the only way is to import the fully CSV files and then perform inner join with  List_customers  dataset?

 

8 REPLIES 8
Kurt_Bremser
Super User

700 million observations is way below any SAS limit I know of.

What variables do you have in the CSV files, and what are their attributes (take them from the documentation you received along with the files)?

Ronein
Onyx | Level 15

Hello,

In CSV file there are 25 varaibles and all are numeric.

I still ask if it is more efficient to import CSV files and select only the rows that are relevent to my customers and then stack the rows to one data set?

(Instead of import millions of rows that most of them are not relevent)?

Kurt_Bremser
Super User

Run the import for all data in one data step, as described in your other thread.

Load the customers into a hash object, and use the CHECK method to filter the relevant observations.

Reeza
Super User

How can you know what's in a row until you read it? Unless you have some external logic or additional information you can't know. Therefore you kind of have to process the entire file and then parse out what you want. So you read the file but only write the records of interest. 

 

 

ballardw
Super User

@Ronein wrote:

Hello,

In CSV file there are 25 varaibles and all are numeric.

I still ask if it is more efficient to import CSV files and select only the rows that are relevent to my customers and then stack the rows to one data set?

(Instead of import millions of rows that most of them are not relevent)?


How many "relevent customers" are their?

If the number is "small" it might be as simple as adding something like:

 

if customer id in ('id1' 'id2' 'id3') ;

to the data step reading the file.

 

 

You would read all of the records but only keep the desired ones.

If you have "many" then read the whole and filter later such as by merging with a data set that contains the relevant customer id values.

 

Are the same customers "relevant" in all the source files?

 

Why do you reading 2 million records "exceeds the maximum number of rows"? Are you using one of the SAS editions that has file size limits because it is intended as a learning tool and not a production data system?

 

Tom
Super User Tom
Super User

CSV files are text files.  And text files with variable length lines.  So you have to read the whole file.

 

That does not mean you have to write back out all of the lines read.

 

Similarly the location of the variables on the line is not fixed.  So if you only want some of the variables on the line you have to read them all (or at least up to the last one on the line you want).

 

To filter observations based on a list of values a HASH object is useful.  (If the list of customers it too large to load into a hash then use an indexed dataset instead.)

 

With your naming convention you can use a wildcard in the physical filename to read all of the CSV files in one step.  If the CSV files have header rows then you will need to include logic to skip the header row.

 

Let's assume the key variable is named CUSTOMER in both your LIST_CUSTOMERS dataset and in the dataset you want to create from the CSV files.

Example:

data want ;
* Load the customer list into a hash object ;
  if _n_= 1 then do;
    declare hash h(dataset:'list_customers');
    rc=h.definekey('customer');
    rc=h.definedone();
  end;
* Read all of the files for a particular year ;
* Skip the header line when starting a new file ;
  length fname $256 ;
  infile '/mypath/t2021*.csv' dsd truncover filename=fname ;
  input @;
  if fname ne lag(fname) then delete;
  * Logic for reading the data from a line in the CSV file here ;
  input customer var1 var2 .... ;
* Delete lines for customers that are not in the hash object ;
  if h.find() then delete;
run;

 

Ronein
Onyx | Level 15
Thank you, some questions:
1- I need to import CSV files from last year ( so there will be 365 CSV files to import).
In the code you wrote, where can I see that you import CSV files from last 1 year?
Remember that each CSV file is called : tYYYYMMDD (I cannot see it in your program)
2-I see that you wrote two times input
input @;
input customer var1 var2 .... ;
Can you explain why?
3-Did you put all CSV files that you imported into one DATA set or into multiple data sets?
Tom
Super User Tom
Super User

The * in the filename is a wildcard.  So the example was for 2021.  If you also other files, like "t2021GEORGE.csv" that match that pattern you will have a problem (not a hard problem since you can use the FNAME variable to test the actual name of the file that was found.

 

You need the extra INPUT with the trailing @ to get the FNAME variable to be updated so you can test if it is the start of a new file or not.  If the files do not contain header rows then you do not need this.  But see below discussion about whether the DATE value appears in the data or not.

 

Yes. It makes no sense to make 365 separate SAS datasets.  If the fields in the file do not contain the date then pull the date from the FNAME variable.  Note the FNAME variable will not appear in the output dataset so use a new variable for this.

date_string = scan(fname,-2,'.t');

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 3223 views
  • 0 likes
  • 5 in conversation