BookmarkSubscribeRSS Feed
jinendergulati
Calcite | Level 5

Hi,

I'm trying to import multiple csv files in SAS DI Studio 4.9. Each csv file has different layout and these contain a trailer record at the end of each file. The layout of the trailer record is common across each file which contain the information like number of records in the file, time of generation etc. Preference is to create only 1 job to for this purpose and not have multiple branches of transformation in the job for each type of csv file. I believe it is feasible to read a csv with unknown column layout using PROC IMPORT but the trailer record is causing issues.

 

Is it feasible to have a generic code to read multiple csv files with trailer records?

 

Thanks,

Jinender

8 REPLIES 8
Tom
Super User Tom
Super User

It is possible to create a program that could handle that, not sure how to implement such a program in SAS/DI Studio.

 

Say your input filename is in the macro variable FILE and the name you want to give the resulting dataset is in the macro variable OUT.  Then code like this will make a copy of the file without the trailer and then use PROC IMPORT to read from that.  The copy step could be as simple as this if the trailer is just one line.  If not then you will need to make the step smart enough to recognize the trailer.

filename copy temp;
data _null_;
  infile "&file" end=eof;
  file copy ;
  input;
  if not eof then put _infile_;
run;
proc import datafile=copy out=&out replace dbms=csv;
run;

 

jinendergulati
Calcite | Level 5

Hi @Tom 

Thank you for quick turn around. The code works perfectly fine. 

 

Additional request, can we modify the code a bit more to read the trailer record as well and move the result in a separate dataset? e.g. if the last record in the csv file then write it to a new dataset. Trailer information is also comma delimited. The column names for the trailer dataset can be system generated e.g. A,B,C,D ; it doesn't matter as I know what each trailer column contains.  

Tom
Super User Tom
Super User

Sure.

data headers;
  infile "&file" dsd truncover end=eof;
  file copy ;
  input @;
  if not eof then put _infile_;
  else do;
     input (a b c d) (:$50.);
     output;
  end;
run;
proc import datafile=copy out=&out replace dbms=csv;
run;
Patrick
Opal | Level 21

You'll find in the forums here a lot of opinion why a SAS data step is superior over using Proc Import. 

Proc Import returns variables and attributes based on the data in the external file and though at least the variable attributes like lengths can change every single time you process new data. That's already an issue for "hand coded" SAS programs. It's certainly a big NO for implementation of metadata driven ETL processes using DIS.

 

I fully understand that there are things where implementation using DIS makes things harder but it's a one time effort for something that then should be running on a regular bases for "a long time". So robustness and ease of maintenance is eventually of higher importance than a few hours saved when implementing.

 

If you don't create external file metadata then you're loosing source to target lineage and impact analysis capabilities supported by the DIS client (as that's SAS metadata based).

I'm also of the opinion that you should implement a separate DIS job per data extract. Not only will this allow to then schedule this extraction process in parallel, it also makes it easier to investigate issues (like one of the source extract files with issues like a changed layout).

 

I understand the challenge with the trailer record. There are multiple ways how you could address this. 

One option: Have a pre-process which copies the source files to another folder location, remove the trailer record and write it to another file. Then read both files in a single DIS job. The pre-process could also be implemented via a reusable custom transformation which you just have as first step per job - with parameter for the source file name.

Given that the trailer record has always the same layout you could define external file metadata for this one which uses a macro variable in the name so you only need to define a single object for all the trailer record files.

Another option: External File Metadata definition allow you to add code or even use fully custom code. That should also allow to read the data as you need it but still use external file metadata for it.

 

And for implementation: Just make one job with one external file fully working and unit tested. Once you've got that you should be able to just copy the job, replace the file and table objects, remap things and then test.

jinendergulati
Calcite | Level 5

Hi @Patrick 

Thank you for your feedback. It is something that I wanted to test if it is feasible. And, I completely agree that Data step is much more robust than a Proc Import. Proc Import assigns the attribute length based on first few observations in the source file, so there is high risk of information truncation if the length increases in the subsequent records. Also, there are other benefits that you highlighted.

 

Talking about your suggested options, I'm more aligned to go for Option 1. I have attached a document to describe what I'm trying to achieve. I'm looking for a SAS code that can separate my actual information from the trailer record into 2 separate files and then I could use the DI transformations like file reader to read them separately. One important point, my source file can have 1 million+ records, so it be good to to split the records in a single step to avoid reading a large source file twice.

@Tom @Patrick : Can you help me write this code?

jinendergulati
Calcite | Level 5

Hi Again,

 

@Tom @Patrick , I can think of another solution, We feed the same csv file to 2 separate file readers in DI studio. Do we have any options in the file reader that can start/stop reading records at certain numbers? e.g. 1st file reader can read the file from 2nd record to (n-1)th record and the second file reader reads only the last record. I'm not sure, does it mean that DI job is reading the source file twice and it will double the total processing time?  

JackHamilton
Lapis Lazuli | Level 10

I don't know what's available in DI Studio vs. what's available in SAS Foundation, but in a data step you might be able to read each record twice and take conditional action:

 

data _null_;
    infile cards dsd;
    length var1 $8. var2 4. var3 $8.;
    input @ ;
    if _infile_ =: 'WHATEVER IDENTIFIES THE TRAILER RECORD' then 
        putlog "NOTE: Trailer record found";
    else 
        do;
        input var1 var2 var3;
        putlog 'NOTE: Values found in record ' _n_ ' are ' (_all_)(=);
        end;
    call execute('%put ;'); /* Is there a better way to write a blank line to the log?  */
cards4;
Joe, 12, Bill 
Steph, 33, Stefan
"Jane", 1, "Jill"   
WHATEVER IDENTIFIES THE TRAILER RECORD
Andre, 99, Joan
;;;;

There's only one physical read of each record.

Patrick
Opal | Level 21

I don't have DIS available right now but I believe you could implement below using DIS with External File Metadata and File Readers. The suggested approach will read the source data only once.

/* create sample source file */
data _null_;
  file '~/source.csv';
  infile datalines truncover;
  input;
  put _infile_;
  datalines;
1,2,3,4
5,6,7,8
trailer1, trailer2
;

/* 
  read sample source file 
  - write last source record to file trailer_&sysjobid..csv
*/
data information;
  infile '~/source.csv' dsd truncover end=eof;
  file "%sysfunc(pathname(work))/trailer_&sysjobid..csv";
  input @;
  if eof then
    do;
      call symputx('trailerFile',
                    "%sysfunc(pathname(work))/trailer_&sysjobid..csv",
                    'g'
                    );
      put _infile_;
      delete;
    end;
  input a b c d;
run;

/*
  read trailer file created in previous step
*/
data trailer;
  infile "&trailerFile" dsd truncover;
  input (var1 var2) (:$50.);
run;

proc print data=information;
run;
proc print data=trailer;
run;

External File Metadata allows you to define your own infile statements and I believe nothing prohibits to also add additional statements there. So what you could do is add below (eventually DIS will add the INFILE on its own; check the generated code once connected to a FileReader; also check in the generated code if DIS generates a macro variable for the path/filename so you could use this one instead of something hard-coded). The input statement with the variable mapping should then still get generated based on the metadata definition and mapping.

  infile '~/source.csv' dsd truncover end=eof;
  file "%sysfunc(pathname(work))/trailer_&sysjobid..csv";
  input @;
  if eof then
    do;
      call symputx('trailerFile',
                    "%sysfunc(pathname(work))/trailer_&sysjobid..csv",
                    'g'
                    );
      put _infile_;
      delete;
    end;

Above code will not read the last record into the target SAS table but write it to a file in SAS Work. The full path and name gets stored in macro variable &trailerFile. 

So now you can implement a 2nd External File definition where you use the macro variable &trailerFile as name (in double quotes). Because the layout of the trailer is the same for all your source data you only need to define this once and then you can use it in all your jobs.

 

N.B: If you can't be 100% sure that the very last line in your source is always the trailer (= never an additional empty line after the trailer line) then instead of using end=eof implement some check like: if substrn(_infile_,1,5) ="some string" then...

 

sas-innovate-white.png

🚨 Early Bird Rate Extended!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.

 

Lock in the best rate now before the price increases on April 1.

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 4432 views
  • 1 like
  • 4 in conversation