BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fengyuwuzu
Pyrite | Level 9

I have 364 files, all have the same data structure, and the file names are like

commontext-2015-01-01.csv 

.......

commontext-2015-12-30.csv

 

I want to import and merge them into a single file. I guess I need to import all of them first, and then I can merge them, right?

can anyone give me some hints or direct me to an example ?

 

Thank you very much!

 

I found some code by searching:

data import_all;
 
*make sure variables to store file name are long enough;
length filename txt_file_name $256;
 
*keep file name from record to record;
retain txt_file_name;
 
*Use wildcard in input;
infile "Path\*.txt" eov=eov filename=filename truncover;
 
*Input first record and hold line;
input@;
 
*Check if this is the first record or the first record in a new file;
*If it is, replace the filename with the new file name and move to next line;
if _n_ eq 1 or eov then do;
txt_file_name = scan(filename, -1, "\");
eov=0;
end;
 
*Otherwise  go to the import step and read the files;
else input
 
*Place input code here;
 
;
run;

I will start from here.

1 ACCEPTED SOLUTION

Accepted Solutions
16 REPLIES 16
ballardw
Super User

You are basically there except for your informat, format (if needed) and input statements.

I would suggest a minor from:

else input
 
*Place input code here;
 
;
run;

to

 

else  do;
   input
 
*Place input code here;
 
   ;
   /* other code, labels, and such*/
end;
run;

One thing to flesh this out would be to run proc import for ONE of the files. The log will have a complete data step and you can copy the INFORMAT, FORMAT and Input statements though you may want to change some of the generated variable names. Or modify if something comes in as numeric that should be text such as zipcodes or ID "numbers". I generally round character lengths up, if the Import generates something with a length of 20 then other values in other files may be different so I guess 25. If I have documentation that provides additional information then use that to set lengths and such.

 

I strongly recommend adding LABELS early in the process.

 

fengyuwuzu
Pyrite | Level 9
Thank you. Actually I have a number of questions about the code:

I am confused by the "filename=filename" in:
infile "Path\*.txt" eov=eov filename=filename truncover;

also, "txt_file_name = scan(filename, -1, "\");" is to get the next file? where do I give the first file name?
data_null__
Jade | Level 19

@fengyuwuzu wrote:
Thank you. Actually I have a number of questions about the code:

I am confused by the "filename=filename" in:
infile "Path\*.txt" eov=eov filename=filename truncover;

also, "txt_file_name = scan(filename, -1, "\");" is to get the next file? where do I give the first file name?

FILENAME is the INFILE statement option name, it specifies a variable name.  In this case they are the same word.  Same for EOV it specifies a variable name.   Actually the assignment TXT_FILE_NAME is getting the current FILENAME.  EOV is set to 1 when a new file in the concatenation is opened.  It is not set to 1 for the first record of the first file hence the need for _N_ = 1 or EOV.

Reeza
Super User

The infile method here uses a wildcard method, so it assumes that all .txt files in the folder will be read. If you have extra txt files in the same folder that won't work. If you're using CSV you may need to change that to *.csv instead.

 

As mentioned, get it working for one file first to get your format/informat/labels and then move towards automation. 

 

You've stated "MERGE" but the process above actually appends files. So do you want the files to end up side by side by some common merged variable or is stacking the files what you actually want? It's a very important difference in the process.

fengyuwuzu
Pyrite | Level 9
They have the same data structure. I want to append (more data rows, keep the same columns).
ballardw
Super User

@fengyuwuzu wrote:

also, "txt_file_name = scan(filename, -1, "\");" is to get the next file? where do I give the first file name?

SCAN searches for "words", the number would be the number of a specific word in a string but since it is negative it counts backwards from the end, the "\" tells SCAN what to consider a boundary between words. In the code you borrowed this from the person wanted a value that was at the end of his files assigned to Txt_file_name.

 

With your files named like: commontext-2015-01-01.csv 

You could generate a text variable with the string for the date as:

Txt_file_name= substr(filename,12,10) ; /* replace 12 with the position of first digit after the "commontext"*/

Or get an actual SAS date value by

FileDate = input(substr(filename,12,10), yymmdd10.);

format FileDate yymmdd10.;

fengyuwuzu
Pyrite | Level 9

I used the code like

data import_all;
 
*make sure variables to store file name are long enough;
length filename txt_file_name $256;
 
*keep file name from record to record;
retain txt_file_name;
 
*Use wildcard in input;
infile "D:\data\Transaction2014*.csv" eov=eov filename=filename truncover delimiter = ',' MISSOVER DSD lrecl=32767;
       informat VAR1 yymmdd10. ;
       informat VAR2 $40. ;
       informat VAR3 best32. ;
       informat VAR4 best32. ;
       informat VAR5 best32. ;
       informat VAR6 best32. ;
       informat VAR7 $19. ;
       informat VAR8 $19. ;
       informat VAR9 $19. ;
       informat VAR10 $8. ;
       informat VAR11 $80. ;
       informat VAR12 $19. ;
       informat VAR13 $9. ;
       informat VAR14 $13. ;
       informat VAR15 $19. ;
       informat VAR16 $7. ;
       informat VAR17 $19. ;
       informat VAR18 $7. ;
       informat VAR19 $19. ;
       informat VAR20 $8. ;
       informat VAR21 $19. ;
       informat VAR22 $8. ;
       informat VAR23 $19. ;
       informat VAR24 $10. ;
       informat VAR25 $13. ;
       informat VAR26 $4. ;
       informat VAR27 $19. ;
       informat VAR28 anydtdtm40. ;
       format VAR1 yymmdd10. ;
       format VAR2 $40. ;
       format VAR3 best12. ;
       format VAR4 best12. ;
       format VAR5 best12. ;
       format VAR6 best12. ;
       format VAR7 $19. ;
       format VAR8 $19. ;
       format VAR9 $19. ;
       format VAR10 $8. ;
       format VAR11 $80. ;
       format VAR12 $19. ;
       format VAR13 $9. ;
       format VAR14 $13. ;
       format VAR15 $19. ;
       format VAR16 $7. ;
       format VAR17 $19. ;
       format VAR18 $7. ;
       format VAR19 $19. ;
       format VAR20 $8. ;
       format VAR21 $19. ;
       format VAR22 $8. ;
       format VAR23 $19. ;
       format VAR24 $10. ;
       format VAR25 $13. ;
       format VAR26 $4. ;
       format VAR27 $19. ;
       format VAR28 datetime. ;
 
*Input first record and hold line;
input@;
 
*Check if this is the first record or the first record in a new file;
*If it is, replace the filename with the new file name and move to next line;
if _n_ eq 1 or eov then do;
txt_file_name = scan(filename, -1, "\");
eov=0;
end;
 
*Otherwise  go to the import step and read the files;
else do;
input
 
/*Place input code here;*/
input
                VAR1
                VAR2 $
                VAR3
                VAR4
                VAR5
                VAR6
                VAR7 $
                VAR8 $
                VAR9 $
                VAR10 $
                VAR11 $
                VAR12 $
                VAR13 $
                VAR14 $
                VAR15 $
                VAR16 $
                VAR17 $
                VAR18 $
                VAR19 $
                VAR20 $
                VAR21 $
                VAR22 $
                VAR23 $
                VAR24 $
                VAR25 $
                VAR26 $
                VAR27 $
                VAR28 
;
end;
run;

However, in the import_all file, the first column is the file names with column name "txt_file_name", and this messed up other columns because format changed for other columns. How did this happen?

Tom
Super User Tom
Super User

You have the INPUT keyword twice.  That will cause it create variable named INPUT and shift everything over by one column meaning that you will be trying to read text in numeric variables.

fengyuwuzu
Pyrite | Level 9

Thank you. I removed the extra input and still got the first column as file names. But this time there is no column shift problem.

 

However, the first row of the first file are all missing except first column, file name. Other rows were read in correctly

data_null__
Jade | Level 19

I think you need to add DELETE in this DO block.  If I understand correctly you want to collect the file name when the file changes but skip reading the record.

if _n_ eq 1 or eov then do;
   txt_file_name = scan(filename, -1, "\");
   eov=0;
   delete;
   end;

 

fengyuwuzu
Pyrite | Level 9

without the "delete;", the first row of every file is blank -- all are missing values.

With "delete;" in the do loop, the first row of every file is skipped, so in the final merged file, the number rows is 364 rows less compared to the merged file without "delete;".

 

data import_all;
 
*make sure variables to store file name are long enough;
length filename txt_file_name $256;
 
*keep file name from record to record;
retain txt_file_name;
 
*Use wildcard in input;
infile "D:\data\Transaction2014*.csv" eov=eov filename=filename truncover delimiter = ',' MISSOVER DSD lrecl=32767;
       informat VAR1 yymmdd10. ;
       informat VAR2 $40. ;

      .... /* to save space */

       informat VAR26 $4. ;
       informat VAR27 $19. ;
       informat VAR28 anydtdtm40. ;
       format VAR1 yymmdd10. ;
       format VAR2 $40. ;
   
   .... /* to save space */

       format VAR26 $4. ;
       format VAR27 $19. ;
       format VAR28 datetime. ;
 
*Input first record and hold line;
input@;
 
*Check if this is the first record or the first record in a new file;
*If it is, replace the filename with the new file name and move to next line;
if _n_ eq 1 or eov then do;
txt_file_name = scan(filename, -1, "\");
eov=0;
delete; /* with or without it, the first row will be either missing values or skipped */ end; *Otherwise go to the import step and read the files; else do; input /*Place input code here;*/ VAR1 VAR2 $ .... /* to save space */ VAR26 $ VAR27 $ VAR28 ; end; run;

 

 

I found out the problem: the author says somewhere else, that

if _n_ eq 1 or eov then do;
txt_file_name = scan(filename, -1, "\");
eov=0;
end;

 assumes that each file has column headers ans uses the EOV option to account for it.

 

In my case, each file has no column headers. But, how should I change the code accordingly?

data_null__
Jade | Level 19

Did you want to indicate if using DELETE produced the desired result.  

 

If the first row of your CSV files are field names I think you want to skip that row.

 


@fengyuwuzu wrote:

without the "delete;", the first row of every file is blank -- all are missing values.

With "delete;" in the do loop, the first row of every file is skipped, so in the final merged file, the number rows is 364 rows less compared to the merged file without "delete;".

 

fengyuwuzu
Pyrite | Level 9

all of my data files has no headers. the first row is data directly.

 

when I include delete in the if condition block, the first row of each file is deleted in the merged file.

 

 

fengyuwuzu
Pyrite | Level 9

Dear all,

 

now I solve the problem. because there is no header in the csv files I changed

 

if _n_ eq 1 or eov then do;

 

to 

 

if _n_ eq 0 or eov then do;

 and the first row of each file can be correctly read in. 

 

Thank you for all your help. Every time I ask question here I learn extra stuff.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1375 views
  • 0 likes
  • 5 in conversation