BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LAtwood
Calcite | Level 5

Hello,

I have text files where the filename changes according to the date and its record number.  For example, "unit1-20141021-0001.txt" and "unit1-2014-1020-0003.txt"

I want to create a SAS dataset with all of these files, however I'm not sure the easiest way to pick up all the files without typing out each filename as the "datafile" in the proc import statement.

This is what I have:

proc import out=WORK.MYDATA

datafile= "C:\unit_info\unit1-20141021-0000.txt"

DBMS = CSV Replace;

Getnames= No;

Datarow=20;

run;

Any help with this would be appreciated.

Thanks

Lori

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

I don't have a problem with firstobs=2 and I use this to read 9 or 10 files frequently.

One trouble shooting approach.

Since we are using a wildcard in the file name you may not read in an expected order and might be getting a file you don't expect. Try explicitly stating two files in the filename statement

FILENAME NORM1 ("C:\unit_info\unit1-2014-1020-0003.txt" "C:\unit_info\unit1-2014-1020-0004.txt");

for example and see if you get the same behavior. If not then possibly one of the files that matched the wildcard listing has a different file structure.

Also, verify that none of the input files have multiple blank lines at the end.

Or

completely untested:

change

if skip then skip=0;

else do;

to

if skip then do;

   skip=0;

   input #19; /* this will require adding N=19 to the infile statement*/

end;

else do;

View solution in original post

12 REPLIES 12
ballardw
Super User

Do you want to read these multiple CSV files into a single dataset?

Do they all have the same structure?

Do the files all reside in the same folder?

Do all of  the files have content you want to read starting at row 20?

Will you want to know which source file the records come from?

LAtwood
Calcite | Level 5

Yes I want to read them into a single dataset.

Yes they all have the same structure.

Yes they all reside in the same folder.

Yes they all need to start reading at row 20.

No I do not need to know the source file.

ballardw
Super User

First step:

Use Proc Import to read one file.

Use either F4 or copy the generated code from the log into the editor.

Since you are not getting names from the file you are going to have variables named Var1 to VarN, where N is the number of columns as delimited by commas.

You will want to look at the generated informat / format statements and see if the variable types match your expectations. Many values that consist of only digits will likely be created as numeric but you may want them to be character to preserve such things as leading zeroes. Also ID, account, telephone and other "numbers" you are not going to do arithmetic with should be examined and pick whether they should be numeric or character. Note that more than 12 digits are likely not to be kept accurately due to computer precision.

Also consider the width of character variables which may contain names or descriptions. Your example file may not have the longest likely value. Consider making them wider.

Since VAR1 is not very useful consider changing the name to something meaningful. If you know what these fields should be then add labels to provide more text to describe them.

Save the program!

After that is working you have a couple of ways to reference multiple files either with a FILENAME reference such as:

FILENAME infiles "C:\unit_info\unit1*.txt"; and using the reference in place of the file name in the INFILE statement or

use the name with the wildcard directly. I prefer the FILENAME approach as there are some things that can be done with filerefs but it is your choice.

On the INFILE statement add the EOV option with a name that makes sense to you. I use EOV=SKIP.

Before and after the generated input statement add some code so that you get something similar to:

   input @;                                                                                                                            

   if skip then skip=0;                                                                                                                

   else do;                                                                                                                           

      input                                                                                                                            

         Client            $                                                                                                           

         SITE                                                                                                                          

         CLINICIAN         $                                                                                                           

         SEX               $  

     ;

     output;

     end;/* this ends the do statement for the ELSE before INPUT. NOTE: If you don't have the OUTPUT there won't be anything in the dataset*/

run;

good luck

LAtwood
Calcite | Level 5

This has been tremendously helpful!!  However, after the first text file is imported into the dataset, it is inserting 19 null rows and is not starting at observation 20.  Below is the code:

FILENAME NORM1 "C:\unit_info\unit1*.txt";

data WORK.test

     %let _EFIERR_ = 0;

     infile NORM1 delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=20 EOV=SKIP;

          informat yok_dttm ANYDTDTM23.;

          informat fpm best32.;

          format yok_dttm DATETIME16.;

          format fpm best12.;

input @;

if skip then skip=0;

else do;

     input

          yok_dttm

          fpm

;

output;

end;

run;

ballardw
Super User

I don't have a problem with firstobs=2 and I use this to read 9 or 10 files frequently.

One trouble shooting approach.

Since we are using a wildcard in the file name you may not read in an expected order and might be getting a file you don't expect. Try explicitly stating two files in the filename statement

FILENAME NORM1 ("C:\unit_info\unit1-2014-1020-0003.txt" "C:\unit_info\unit1-2014-1020-0004.txt");

for example and see if you get the same behavior. If not then possibly one of the files that matched the wildcard listing has a different file structure.

Also, verify that none of the input files have multiple blank lines at the end.

Or

completely untested:

change

if skip then skip=0;

else do;

to

if skip then do;

   skip=0;

   input #19; /* this will require adding N=19 to the infile statement*/

end;

else do;

LAtwood
Calcite | Level 5

Thank you everyone for your help and input,it is much appreciated!

RW9
Diamond | Level 26 RW9
Diamond | Level 26

One other note, if they are all text files in one directory and you want them as one, you could just write a small batch file to append them all before importing:

copy c:\temp\*.txt c:\temp\total.txt

Then you would only need to know the name of the complete file to import.  Of course if you have column headers in each file then you would need to put an if statement around that to drop first records.  And this assumes all the files are of the same structure.

AnandSahu
Calcite | Level 5

what if I get an error "The submitted line exceeds maximum line length - 32768 bytes.".

how can I remove this error to eliminate this error in the process of reading a data.

jakarman
Barite | Level 11

The max for lrecl is 32767 SAS(R) 9.4 Statements: Reference, Third Edition. The infamous 2byte-1bit value  (64k bit machines). It is also the default with 9.4.

Mixing up cr/lf (windows) lf (unix) with filetransfers (binary/ascii) can be a cause for being trapped into this.

Normally you will not get into records as one string of this size, it is not human readable anymore.

Are you processing html/xml than those records are meaningless, just for humans. Reading those kind of files in streaming mode (recfm=n) and splitting shifting trunks of 32k with data you can always solve that 

---->-- ja karman --<-----
ballardw
Super User

This looks like you may have a code syntax error. it would be better to create a new thread and post example code.

jakarman
Barite | Level 11

Do not use proc import but the sas datastep. More reliable and more options.

I believe you can copy from the generated import te is run.

With the datastep use wildcarding that is * in the filename and retrieve the real filename being read as an option in the in file statement.

As long as the data structure is recognizable you can read an incredible amount of files.

---->-- ja karman --<-----
jakarman
Barite | Level 11

use the filename option of the input statement is you want to verify which one is read/processed. In the sas log these message are also coming. SAS(R) 9.4 Statements: Reference, Third Edition

I forgot not all options are working on each single file but some are for the whole of all (firstobs).   Did this before and input is not your input processing.

With a header and tailer section that can be solved.

   In the header section those  19 (20 skips) are to be done (using \\ ) . These are working for each file.

   In the tailer section something can be done for each file or just archiving a collection log process dataset

   the allfil section is a section when all files are processed

 

The source was:

filename impcsv '/folders/myfolders' ;


filename impcsv ' /folders/myshortcuts/ikke';

data orderlst (keep= x y z q order part eval) ;
  length filename filenmcr $250 ;    retain _nfl -1 eovfl  ;
  length z $3 q $1 order part eval $16   ; retain order part eval ; /* the targeted lays out */
  infile impcsv(12345*.csv) filename=filenmcr eov=eovfl eof=tailer end=endfl missover dsd dlm=";"  ;
  input @ ;
   if ( eovfl or _nfl <0 ) then link header;
   _nfl=_nfl+1;
   input x y z q ;            /* <<<<<<<< input for each csv file as record */ 
   output ;
   if ( endfl ) then link allfil ;
  
return;

header:
    filename=filenmcr;
    put "-- " filename " --"; _nfl=1 ; eovfl=0;
    input  ; /* advance one line, ignoring the header add as many as you need  */
    part=scan(filename,-6,"_");     /* some convention coming as part of a filename */
    order=scan(filename,-4,"_"); 
    eval=scan(filename,-3,"_"); 
return;

tailer: 
    put "total records" part _nfl / ;
return; 
allfil: 
    put "processing done" / ;
return; 
run;

proc print;
run;

---->-- ja karman --<-----

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 4124 views
  • 3 likes
  • 5 in conversation