Solved: Re: How to import data when filename changes?

LAtwood · Posted 10-21-2014 01:16 PM

Hello,

I have text files where the filename changes according to the date and its record number. For example, "unit1-20141021-0001.txt" and "unit1-2014-1020-0003.txt"

I want to create a SAS dataset with all of these files, however I'm not sure the easiest way to pick up all the files without typing out each filename as the "datafile" in the proc import statement.

This is what I have:

proc import out=WORK.MYDATA

datafile= "C:\unit_info\unit1-20141021-0000.txt"

DBMS = CSV Replace;

Getnames= No;

Datarow=20;

run;

Any help with this would be appreciated.

Thanks

Lori

ballardw · Posted 10-22-2014 04:48 PM

I don't have a problem with firstobs=2 and I use this to read 9 or 10 files frequently.

One trouble shooting approach.

Since we are using a wildcard in the file name you may not read in an expected order and might be getting a file you don't expect. Try explicitly stating two files in the filename statement

FILENAME NORM1 ("C:\unit_info\unit1-2014-1020-0003.txt" "C:\unit_info\unit1-2014-1020-0004.txt");

for example and see if you get the same behavior. If not then possibly one of the files that matched the wildcard listing has a different file structure.

Also, verify that none of the input files have multiple blank lines at the end.

Or

completely untested:

change

if skip then skip=0;

else do;

to

if skip then do;

skip=0;

input #19; /* this will require adding N=19 to the infile statement*/

end;

else do;

View solution in original post

ballardw · Posted 10-21-2014 01:23 PM

Do you want to read these multiple CSV files into a single dataset?

Do they all have the same structure?

Do the files all reside in the same folder?

Do all of the files have content you want to read starting at row 20?

Will you want to know which source file the records come from?

LAtwood · Posted 10-21-2014 01:26 PM

Yes I want to read them into a single dataset.

Yes they all have the same structure.

Yes they all reside in the same folder.

Yes they all need to start reading at row 20.

No I do not need to know the source file.

ballardw · Posted 10-21-2014 02:08 PM

First step:

Use Proc Import to read one file.

Use either F4 or copy the generated code from the log into the editor.

Since you are not getting names from the file you are going to have variables named Var1 to VarN, where N is the number of columns as delimited by commas.

You will want to look at the generated informat / format statements and see if the variable types match your expectations. Many values that consist of only digits will likely be created as numeric but you may want them to be character to preserve such things as leading zeroes. Also ID, account, telephone and other "numbers" you are not going to do arithmetic with should be examined and pick whether they should be numeric or character. Note that more than 12 digits are likely not to be kept accurately due to computer precision.

Also consider the width of character variables which may contain names or descriptions. Your example file may not have the longest likely value. Consider making them wider.

Since VAR1 is not very useful consider changing the name to something meaningful. If you know what these fields should be then add labels to provide more text to describe them.

Save the program!

After that is working you have a couple of ways to reference multiple files either with a FILENAME reference such as:

FILENAME infiles "C:\unit_info\unit1*.txt"; and using the reference in place of the file name in the INFILE statement or

use the name with the wildcard directly. I prefer the FILENAME approach as there are some things that can be done with filerefs but it is your choice.

On the INFILE statement add the EOV option with a name that makes sense to you. I use EOV=SKIP.

Before and after the generated input statement add some code so that you get something similar to:

input @;

if skip then skip=0;

else do;

input

Client $

SITE

CLINICIAN $

SEX $

;

output;

end;/* this ends the do statement for the ELSE before INPUT. NOTE: If you don't have the OUTPUT there won't be anything in the dataset*/

run;

good luck

LAtwood · Posted 10-22-2014 03:17 PM

This has been tremendously helpful!! However, after the first text file is imported into the dataset, it is inserting 19 null rows and is not starting at observation 20. Below is the code:

FILENAME NORM1 "C:\unit_info\unit1*.txt";

data WORK.test

%let _EFIERR_ = 0;

infile NORM1 delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=20 EOV=SKIP;

informat yok_dttm ANYDTDTM23.;

informat fpm best32.;

format yok_dttm DATETIME16.;

format fpm best12.;

input @;

if skip then skip=0;

else do;

input

yok_dttm

fpm

;

output;

end;

run;

ballardw · Posted 10-22-2014 04:48 PM

I don't have a problem with firstobs=2 and I use this to read 9 or 10 files frequently.

One trouble shooting approach.

Since we are using a wildcard in the file name you may not read in an expected order and might be getting a file you don't expect. Try explicitly stating two files in the filename statement

FILENAME NORM1 ("C:\unit_info\unit1-2014-1020-0003.txt" "C:\unit_info\unit1-2014-1020-0004.txt");

for example and see if you get the same behavior. If not then possibly one of the files that matched the wildcard listing has a different file structure.

Also, verify that none of the input files have multiple blank lines at the end.

Or

completely untested:

change

if skip then skip=0;

else do;

to

if skip then do;

skip=0;

input #19; /* this will require adding N=19 to the infile statement*/

end;

else do;

LAtwood · Posted 10-23-2014 11:33 AM

Thank you everyone for your help and input,it is much appreciated!

RW9 · Posted 10-23-2014 04:37 AM

One other note, if they are all text files in one directory and you want them as one, you could just write a small batch file to append them all before importing:

copy c:\temp\*.txt c:\temp\total.txt

Then you would only need to know the name of the complete file to import. Of course if you have column headers in each file then you would need to put an if statement around that to drop first records. And this assumes all the files are of the same structure.

AnandSahu · Posted 10-23-2014 12:06 PM

what if I get an error "The submitted line exceeds maximum line length - 32768 bytes.".

how can I remove this error to eliminate this error in the process of reading a data.

jakarman · Posted 10-23-2014 12:50 PM

The max for lrecl is 32767 SAS(R) 9.4 Statements: Reference, Third Edition. The infamous 2byte-1bit value (64k bit machines). It is also the default with 9.4.

Mixing up cr/lf (windows) lf (unix) with filetransfers (binary/ascii) can be a cause for being trapped into this.

Normally you will not get into records as one string of this size, it is not human readable anymore.

Are you processing html/xml than those records are meaningless, just for humans. Reading those kind of files in streaming mode (recfm=n) and splitting shifting trunks of 32k with data you can always solve that

---->-- ja karman --<-----

ballardw · Posted 10-23-2014 12:51 PM

This looks like you may have a code syntax error. it would be better to create a new thread and post example code.

jakarman · Posted 10-21-2014 01:53 PM

Do not use proc import but the sas datastep. More reliable and more options.

I believe you can copy from the generated import te is run.

With the datastep use wildcarding that is * in the filename and retrieve the real filename being read as an option in the in file statement.

As long as the data structure is recognizable you can read an incredible amount of files.

---->-- ja karman --<-----

jakarman · Posted 10-23-2014 01:47 AM

use the filename option of the input statement is you want to verify which one is read/processed. In the sas log these message are also coming. SAS(R) 9.4 Statements: Reference, Third Edition

I forgot not all options are working on each single file but some are for the whole of all (firstobs). Did this before and input is not your input processing.

With a header and tailer section that can be solved.

In the header section those 19 (20 skips) are to be done (using \\ ) . These are working for each file.

In the tailer section something can be done for each file or just archiving a collection log process dataset

the allfil section is a section when all files are processed

The source was:

filename impcsv '/folders/myfolders' ;

filename impcsv ' /folders/myshortcuts/ikke';

data orderlst (keep= x y z q order part eval) ;
length filename filenmcr $250 ;    retain _nfl -1 eovfl ;
length z $3 q $1 order part eval $16   ; retain order part eval ; /* the targeted lays out */
infile impcsv(12345*.csv) filename=filenmcr eov=eovfl eof=tailer end=endfl missover dsd dlm=";" ;
input @ ;
   if ( eovfl or _nfl <0 ) then link header;
   _nfl=_nfl+1;
   input x y z q ;            /* <<<<<<<< input for each csv file as record */
   output ;
   if ( endfl ) then link allfil ;

return;

header:
    filename=filenmcr;
    put "-- " filename " --"; _nfl=1 ; eovfl=0;
    input ; /* advance one line, ignoring the header add as many as you need */
    part=scan(filename,-6,"_");     /* some convention coming as part of a filename */
    order=scan(filename,-4,"_");
    eval=scan(filename,-3,"_");
return;

tailer:
put "total records" part _nfl / ;
return;
allfil:
put "processing done" / ;
return;
run;

proc print;
run;

---->-- ja karman --<-----

SAS Innovate 2025: Call for Content

Classroom Training Available!