BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Lin_Clare
Calcite | Level 5

Hi,

I need to import ~80 txt files (delimiter=tab) and concatenate them into one file, but don't know how to realized it as the raw data

1) first 3 rows are some useless header information, column name in row 4 and data start in row 5

2) txt file contains ~126 columns including character and numeric with different informat

3) some files have 126 columns, some have 128 columns (common 126+ extra 2)

4) Column sequnce in different txt file are different, column names are the same but some are uppercase while some lowercase

Anyone know how to efficiently get the result as import every file with correct column name and format, concatenate by column names (still keep the 2 extra columns) and get into one file?

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
data_null__
Jade | Level 19

Download the macro in this thread https://communities.sas.com/message/210551#210551

You will need to define an empty data set as described in the header of the macro as the model for what fields are to be read from the 80 or so files.  Like named fields, ignoring case, will be read using the same INFORMAT data type etc.  The order of the columns does not matter.

This is pretty much is exactly what you need.

View solution in original post

18 REPLIES 18
MaikH_Schutze
Quartz | Level 8

Hi,

You can definitiely design a MACRO definition to deal with the various file layouts but you will still need to use the basic INFILE/INPUT statement syntax.

For instance you will want use the FIRSTOBS= data set option to address the starting row issue you mentioned.

data want;

infile file-specification firstobs=5;

input var1 $ 1-5 var2 6-10 var3 $ 11-15;

run;

The example above uses column input which is quite flexible and will allow you to "input" variables in whichever order you like.

Lin_Clare
Calcite | Level 5

The problem here is that I have ~80 files with each have ~126 columns with different sequnce , it's not realistic to use input and specify each column information (type, name, length) one by one

Actually this is only the data for one operation step, I need to process ~200 operations like that. So need to find a more systematci way.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Is there any metadata associated to the files?  If not how will you know some columns are the same as other columns, e.g. if file1 has 120 columns, and file 2 have 80 columns, does column 4 mean the same as col4 from previous file etc?  If you have some metadata, i.e. information about the structure of each file, then you can use that to create your import programs. 

Lin_Clare
Calcite | Level 5

first I tried to remove the header 3 rows of one txt file, use Proc import with datarow=2 to get it in as a first table with correct column names and format, then use input datafile with firstobs=5 to get all the other files in, and append to the first table with proc SQL union. I get the error indicate data type mismatch for some columns.

Then I went back to check the data structure by manually remove the 3 header rows of another txt file, import it as the first table above. In log file I can get the column names, format of the variables of these 2 files, I compared the column names, sequence and format of them in excel, and find the difference are only sequence and case of column names

data_null__
Jade | Level 19

Download the macro in this thread https://communities.sas.com/message/210551#210551

You will need to define an empty data set as described in the header of the macro as the model for what fields are to be read from the 80 or so files.  Like named fields, ignoring case, will be read using the same INFORMAT data type etc.  The order of the columns does not matter.

This is pretty much is exactly what you need.

Lin_Clare
Calcite | Level 5

Thanks. I will try it tomorrow when back to work,  can not login into my desk-top right now.

Lin_Clare
Calcite | Level 5

Hi data_null_,

Thanks for your code.

But I am kind of get stucked with preparing the input files. The Macro need all the input files with first row as specific variable names, while I can not get that as my files have some header rows above variable name row.

Tom
Super User Tom
Super User

Just set the macro parameter NAMEROW to 4 when you call the macro so that it will skip the first 3 lines and read the headers from line 4.

Lin_Clare
Calcite | Level 5

Hi Tom,

Thank you.

below is my code, I get the column and format list of the variables in log file of proc import, only define the first 100 as the last 26~28 are useless to me.

--------------------------------------------------------------------------------------------------------------

filename indata "D:\Temp\EP\*.txt" lrecl=32767;

data first;

         attrib Lot                 length=$8;

         attrib Event               length=$1;

         attrib Product             length=$10;

         attrib Recipe              length=$10;

         attrib Route               length=$10;

         attrib Entity              length=$6;

         attrib LotList             length=$8;

         attrib Date                length=8 informat=ANYDTDTM40. format=DATETIME.;

   attrib Time B_oHe_Leak_Rate A_oHe_Leak_Rate B_iHe_Leak_Rate A_iHe_Leak_Rate B_sRF_ref B_sRF B_tRF_ref B_tRF

   A_sRF_ref A_sRF A_tRF_ref A_tRF robot_rtn robot_xtn HX_rho_dome HX_rho_wall A_wfr_prsnt HX_T_wall HX_T_dome

   A_wfr_T A_rcp_stp A_CNT_T A_TOP_T A_GSH_T A_SID_T A_iIHC_P A_iIHC_flow A_oIHC_P A_oIHC_flow A_bRF A_bRF_ref

   A_sO2 A_sAr A_sHe A_sSiH4 A_tHe A_tSiH4 A_tAr A_tO2 A_NF3 A_P A_rTV_stp A_TTV_stp A_echuck_V A_echuck_I B_wfr_prsnt

   B_wfr_T B_rcp_stp B_CNT_T B_TOP_T B_GSH_T B_SID_T B_iIHC_P B_iIHC_flow B_oIHC_P B_oIHC_flow B_bRF B_bRF_ref

   B_sO2 B_sAr B_sHe B_sSiH4 B_tHe B_tSiH4 B_tAr B_tO2 B_NF3 B_P B_rTV_stp B_TTV_stp B_echuck_V B_echuck_I HX_flow_wall

   HX_flow_dome CF_xtn_corr CF_rot_corr A_foreline_P B_foreline_P A_slot B_slot A_elapsed_stp_time B_elapsed_stp_time

   buffer_P Buffer_N2_flow Buffer_N20_Setpoint ChamberA ChamberB ChamberC ChamberD Batch Operation length=8;

   stop;

   call missing(of _all_);

   run;

csvReader macro definition lines

%csvReader(data=first,out=final,fileref=indata,fileNameLength=$500,namerow=4);

------------------------------------------------------------------------------------------------------------------------------------------

and get error for every raw file like below

NOTE: The infile INDATA is:
      Filename=D:\Temp\EP\141012_040223_0511_K4411740.txt,
      File List=D:\Temp\EP\*.txt,RECFM=V,LRECL=32767

NOTE: The data set WORK.CSVREADER_VINFO has 102 observations and 3 variables.
ERROR: Invalid data set name at line 9 column 218.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".

Actually the 218th column in the raw file is empty with column name as "SubEntityId___5", not sure why it get an error in line 9 specifically.

data_null__
Jade | Level 19

This statement is causing the error

finfo.output(dataset:cats("%unquote(&finfodatasetrootname)",filename));

but is it not essential so you could comment it out and see if that fixes it.

My program assumes that the FILENAME(s) can be used as SAS names and that doesn't seem to work here.

Lin_Clare
Calcite | Level 5

Yes, after comment it, there is no error.

Thank you!

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Yes, see your hitting the issues I noted above, you need to identify a "logical" method of removing header rows/getting column types to then be able to define a "logical" program to import them.  Then you would need to know the content and intent of each file to ensure that it matches other tables, for instance:

File 1

VAR1     VAR2

001         12DEC12

File 2

VAR1     VAR2

1             19JAN2014

Now looking at it manually we could probable assume that 001 and 1 are the same (although maybe not, maybe there is a file identifier to be applied?  We could probably also guess that the var2 is date in both case, but does the data reference the same thing, i.e. File 1 it could be date of birth, in file 2 date of joining.  So a clear understanding of each file needs to be in place, hence why there should be some sort of documentation (specs or metadata), plus I would advise a mapping documentation show which columns go where in the final output.  Once you have that, the actual import programs should write themselves.

Lin_Clare
Calcite | Level 5


Hi RW9

Here is what the raw txt files looks like

------------------------------------------------------------------------------------------------------------

#

# Date:10/12/2014 12:19:00 AM

# Batch:1413044241 Route:FL65.8P1CB Event:N/A

Time        A_Rate       A_sRF        B_Rate       B_sRF     CHA       CHB.....

3111          26               16             71                44         TI            FI   .....

4210          33                18            68                42          AI           EI ......

......

-------------------------------------------------------------------------------------------------------------

And for example, another file would looks like as below

-------------------------------------------------------------------------------------------------------------

#

# Date:10/11/2014 10:19:00 AM

# Batch:1213043241 Route:FL65.8P1CB Event:N/A

Time        A_rate      B_rate     A_srf        B_srf      NR_flow   CHA       CHB.....

3111          26            71           16           44            G           TI            FI   .....

4210          33            68           15           42            F           AI            EI ......

......

-----------------------------------------------------------------------------------------------------------------

I am stucked with can not import all the files with column names and format automatically, since there are 3 row headers.

Ksharp
Super User

Why not use proc import ?

proc import datafile='c:\temp\x.txt' out=x dbms=tab replace;

datarow=4;

run;

Xia Keshan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 2779 views
  • 9 likes
  • 6 in conversation