Solved: Re: Import txt files with difference variable sequnce and concatenate ...

Lin_Clare · Posted 01-08-2015 09:07 AM

Hi,

I need to import ~80 txt files (delimiter=tab) and concatenate them into one file, but don't know how to realized it as the raw data

1) first 3 rows are some useless header information, column name in row 4 and data start in row 5

2) txt file contains ~126 columns including character and numeric with different informat

3) some files have 126 columns, some have 128 columns (common 126+ extra 2)

4) Column sequnce in different txt file are different, column names are the same but some are uppercase while some lowercase

Anyone know how to efficiently get the result as import every file with correct column name and format, concatenate by column names （still keep the 2 extra columns) and get into one file?

Thanks

data_null__ · Posted 01-08-2015 10:08 AM

Download the macro in this thread https://communities.sas.com/message/210551#210551

You will need to define an empty data set as described in the header of the macro as the model for what fields are to be read from the 80 or so files. Like named fields, ignoring case, will be read using the same INFORMAT data type etc. The order of the columns does not matter.

This is pretty much is exactly what you need.

View solution in original post

MaikH_Schutze · Posted 01-08-2015 09:27 AM

Hi,

You can definitiely design a MACRO definition to deal with the various file layouts but you will still need to use the basic INFILE/INPUT statement syntax.

For instance you will want use the FIRSTOBS= data set option to address the starting row issue you mentioned.

data want;

infile file-specification firstobs=5;

input var1 $ 1-5 var2 6-10 var3 $ 11-15;

run;

The example above uses column input which is quite flexible and will allow you to "input" variables in whichever order you like.

Lin_Clare · Posted 01-08-2015 10:01 AM

The problem here is that I have ~80 files with each have ~126 columns with different sequnce , it's not realistic to use input and specify each column information (type, name, length) one by one

Actually this is only the data for one operation step, I need to process ~200 operations like that. So need to find a more systematci way.

RW9 · Posted 01-08-2015 09:37 AM

Is there any metadata associated to the files? If not how will you know some columns are the same as other columns, e.g. if file1 has 120 columns, and file 2 have 80 columns, does column 4 mean the same as col4 from previous file etc? If you have some metadata, i.e. information about the structure of each file, then you can use that to create your import programs.

Lin_Clare · Posted 01-08-2015 09:55 AM

first I tried to remove the header 3 rows of one txt file, use Proc import with datarow=2 to get it in as a first table with correct column names and format, then use input datafile with firstobs=5 to get all the other files in, and append to the first table with proc SQL union. I get the error indicate data type mismatch for some columns.

Then I went back to check the data structure by manually remove the 3 header rows of another txt file, import it as the first table above. In log file I can get the column names, format of the variables of these 2 files, I compared the column names, sequence and format of them in excel, and find the difference are only sequence and case of column names

data_null__ · Posted 01-08-2015 10:08 AM

Download the macro in this thread https://communities.sas.com/message/210551#210551

You will need to define an empty data set as described in the header of the macro as the model for what fields are to be read from the 80 or so files. Like named fields, ignoring case, will be read using the same INFORMAT data type etc. The order of the columns does not matter.

This is pretty much is exactly what you need.

Lin_Clare · Posted 01-08-2015 10:55 AM

Thanks. I will try it tomorrow when back to work, can not login into my desk-top right now.

Lin_Clare · Posted 01-08-2015 09:12 PM

Hi data_null_,

Thanks for your code.

But I am kind of get stucked with preparing the input files. The Macro need all the input files with first row as specific variable names, while I can not get that as my files have some header rows above variable name row.

Tom · Posted 01-09-2015 12:11 AM

Just set the macro parameter NAMEROW to 4 when you call the macro so that it will skip the first 3 lines and read the headers from line 4.

Lin_Clare · Posted 01-09-2015 07:03 AM

Hi Tom,

Thank you.

below is my code, I get the column and format list of the variables in log file of proc import, only define the first 100 as the last 26~28 are useless to me.

--------------------------------------------------------------------------------------------------------------

filename indata "D:\Temp\EP\*.txt" lrecl=32767;

data first;

attrib Lot length=$8;

attrib Event length=$1;

attrib Product length=$10;

attrib Recipe length=$10;

attrib Route length=$10;

attrib Entity length=$6;

attrib LotList length=$8;

attrib Date length=8 informat=ANYDTDTM40. format=DATETIME.;

attrib Time B_oHe_Leak_Rate A_oHe_Leak_Rate B_iHe_Leak_Rate A_iHe_Leak_Rate B_sRF_ref B_sRF B_tRF_ref B_tRF

A_sRF_ref A_sRF A_tRF_ref A_tRF robot_rtn robot_xtn HX_rho_dome HX_rho_wall A_wfr_prsnt HX_T_wall HX_T_dome

A_wfr_T A_rcp_stp A_CNT_T A_TOP_T A_GSH_T A_SID_T A_iIHC_P A_iIHC_flow A_oIHC_P A_oIHC_flow A_bRF A_bRF_ref

A_sO2 A_sAr A_sHe A_sSiH4 A_tHe A_tSiH4 A_tAr A_tO2 A_NF3 A_P A_rTV_stp A_TTV_stp A_echuck_V A_echuck_I B_wfr_prsnt

B_wfr_T B_rcp_stp B_CNT_T B_TOP_T B_GSH_T B_SID_T B_iIHC_P B_iIHC_flow B_oIHC_P B_oIHC_flow B_bRF B_bRF_ref

B_sO2 B_sAr B_sHe B_sSiH4 B_tHe B_tSiH4 B_tAr B_tO2 B_NF3 B_P B_rTV_stp B_TTV_stp B_echuck_V B_echuck_I HX_flow_wall

HX_flow_dome CF_xtn_corr CF_rot_corr A_foreline_P B_foreline_P A_slot B_slot A_elapsed_stp_time B_elapsed_stp_time

buffer_P Buffer_N2_flow Buffer_N20_Setpoint ChamberA ChamberB ChamberC ChamberD Batch Operation length=8;

stop;

call missing(of _all_);

run;

csvReader macro definition lines

%csvReader(data=first,out=final,fileref=indata,fileNameLength=$500,namerow=4);

------------------------------------------------------------------------------------------------------------------------------------------

and get error for every raw file like below

NOTE: The infile INDATA is:
Filename=D:\Temp\EP\141012_040223_0511_K4411740.txt,
File List=D:\Temp\EP\*.txt,RECFM=V,LRECL=32767

NOTE: The data set WORK.CSVREADER_VINFO has 102 observations and 3 variables.
ERROR: Invalid data set name at line 9 column 218.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".

Actually the 218th column in the raw file is empty with column name as "SubEntityId___5", not sure why it get an error in line 9 specifically.

data_null__ · Posted 01-09-2015 08:31 AM

This statement is causing the error

finfo.output(dataset:cats("%unquote(&finfodatasetrootname)",filename));

but is it not essential so you could comment it out and see if that fixes it.

My program assumes that the FILENAME(s) can be used as SAS names and that doesn't seem to work here.

Lin_Clare · Posted 01-09-2015 10:07 AM

Yes, after comment it, there is no error.

Thank you!

RW9 · Posted 01-08-2015 10:09 AM

Yes, see your hitting the issues I noted above, you need to identify a "logical" method of removing header rows/getting column types to then be able to define a "logical" program to import them. Then you would need to know the content and intent of each file to ensure that it matches other tables, for instance:

File 1

VAR1 VAR2

001 12DEC12

File 2

VAR1 VAR2

1 19JAN2014

Now looking at it manually we could probable assume that 001 and 1 are the same (although maybe not, maybe there is a file identifier to be applied? We could probably also guess that the var2 is date in both case, but does the data reference the same thing, i.e. File 1 it could be date of birth, in file 2 date of joining. So a clear understanding of each file needs to be in place, hence why there should be some sort of documentation (specs or metadata), plus I would advise a mapping documentation show which columns go where in the final output. Once you have that, the actual import programs should write themselves.

Lin_Clare · Posted 01-08-2015 09:04 PM

Hi RW9

Here is what the raw txt files looks like

------------------------------------------------------------------------------------------------------------

#

# Date:10/12/2014 12:19:00 AM

# Batch:1413044241 Route:FL65.8P1CB Event:N/A

Time A_Rate A_sRF B_Rate B_sRF CHA CHB.....

3111 26 16 71 44 TI FI .....

4210 33 18 68 42 AI EI ......

......

-------------------------------------------------------------------------------------------------------------

And for example, another file would looks like as below

-------------------------------------------------------------------------------------------------------------

#

# Date:10/11/2014 10:19:00 AM

# Batch:1213043241 Route:FL65.8P1CB Event:N/A

Time A_rate B_rate A_srf B_srf NR_flow CHA CHB.....

3111 26 71 16 44 G TI FI .....

4210 33 68 15 42 F AI EI ......

......

-----------------------------------------------------------------------------------------------------------------

I am stucked with can not import all the files with column names and format automatically, since there are 3 row headers.

Ksharp · Posted 01-09-2015 01:21 AM

Why not use proc import ?

proc import datafile='c:\temp\x.txt' out=x dbms=tab replace;

datarow=4;

run;

Xia Keshan

Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

Re: Import txt files with difference variable sequnce and concatenate into one file

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away