SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Using DI Studio to read multiple CSV files

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

Using DI Studio to read multiple CSV files

Please help, I'm somewhat skilled with DI Studio - but not necessarily with SAS programming. I need some good details on how to read multiple CSV files into a SAS table using DI Studio.

 

The files will all have the same fields - but the filenames will vary. For example the filenames could be "Detail_12345.csv, Detail_45678.csv, and Detail_98765.csv

 

I've been reading that I should create user written code to read in the files at once. Another suggestion was to create an iterative job to pass the filename to the main job that read each file.

 

What do you suggest I do - and what are the specific transformations I should utilize?

 

thank you!


Accepted Solutions
Solution
‎02-21-2017 12:37 PM
Respected Advisor
Posts: 4,173

Re: Using DI Studio to read multiple CSV files

[ Edited ]

To use DIS efficiently you need to get to an at least intermediate SAS coding level (Base SAS, SQL and macro language).

 

Without the coding skills debugging (reading logs) will be difficult and you'll also be challenged to implement not only logically correct but also performant processes.

 

As for your question:

If your .csv's all have the same structure and don't have a header row with column names then you could use a wildcard: Detail_*.csv

Create a job with an external file metadata object for one of the .csv's, a File Reader and a target SAS table and test if this works. Once tested change the name of the external file in the external file metadata to the asterix syntax (the wildcard name).

 

Should your .csv's have a header row then things become a bit more involved and you need to give as as much detail as you can (i.e. how the folder where the .csv's are looks like - read everything in it or just files following a naming pattern, how often do you need to run the process, is this about a fully productionworthy implementation, etc. ).

 

One way to go would be to first create a listing of .csv names (user written code), load these names into a temporary SAS Work table and then use this work tables as control table for a loop transformation. Inside the loop transformation (the inner job) is external table metadata (this time with a macro variable in double quotes as file name - the parameter passed to the inner job, a file reader and an append node which collects all the data read via the iterations through the inner job). 

 

If you've got header rows then there is eventually also a "hack" option which still would allow you to use the simple wildcard approach - but it's something I would first need to test myself before I can propose it.

 

I suggest that as a first step you implement a DIS job which reads a single .csv into your target table and we go from there.

 

 

View solution in original post


All Replies
Solution
‎02-21-2017 12:37 PM
Respected Advisor
Posts: 4,173

Re: Using DI Studio to read multiple CSV files

[ Edited ]

To use DIS efficiently you need to get to an at least intermediate SAS coding level (Base SAS, SQL and macro language).

 

Without the coding skills debugging (reading logs) will be difficult and you'll also be challenged to implement not only logically correct but also performant processes.

 

As for your question:

If your .csv's all have the same structure and don't have a header row with column names then you could use a wildcard: Detail_*.csv

Create a job with an external file metadata object for one of the .csv's, a File Reader and a target SAS table and test if this works. Once tested change the name of the external file in the external file metadata to the asterix syntax (the wildcard name).

 

Should your .csv's have a header row then things become a bit more involved and you need to give as as much detail as you can (i.e. how the folder where the .csv's are looks like - read everything in it or just files following a naming pattern, how often do you need to run the process, is this about a fully productionworthy implementation, etc. ).

 

One way to go would be to first create a listing of .csv names (user written code), load these names into a temporary SAS Work table and then use this work tables as control table for a loop transformation. Inside the loop transformation (the inner job) is external table metadata (this time with a macro variable in double quotes as file name - the parameter passed to the inner job, a file reader and an append node which collects all the data read via the iterations through the inner job). 

 

If you've got header rows then there is eventually also a "hack" option which still would allow you to use the simple wildcard approach - but it's something I would first need to test myself before I can propose it.

 

I suggest that as a first step you implement a DIS job which reads a single .csv into your target table and we go from there.

 

 

Super User
Posts: 5,427

Re: Using DI Studio to read multiple CSV files

Short version of @Patrick answer: your second option is definitely the prime choice.
An alternative if you want to avoid the complexity of inner/outer job loops is to use an aggregate file location. This would be simpler DIS wise but makes a bit harder to add log/audit functionality.
Either way, you may to take into account functionality to not read a file twice (look up audit table or physically move the file after reading).
Data never sleeps
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 375 views
  • 2 likes
  • 3 in conversation