read in multiple text files with different structures

Reply
Occasional Contributor
Posts: 5

read in multiple text files with different structures

Hi,

 

I am trying to read multiple tab delimited text files simultaneously into SAS datasets. I am using filename pipe, but have trouble in writing the input statement in data step. Since the number of variables and variable names are not the same in the text files, I'm wondering how can I input these variables? Also, proc import did not work in my case. 

 

Thanks,

Hera

Grand Advisor
Posts: 10,196

Re: read in multiple text files with different structures

Please post code you have tried and describe why Proc Import did not work.

 

I suspect that with multiple structures that you have some work on your hands as there is not likely to be a 4 or 5 line bit of code that works.

Approaching a project like this I have started by seeing if any of the data sets have a common structure so I could write code that only had to change input filename and output dataset name for common code.

Then do anything needed to read "singleton" layout files.

Additional things to consider, and issue with guessing approaches like Proc Import, are variables that appear in different files that may require linkage with similar fields in other source files. You want to ensure that the data types and characteristics are uniform: all numeric or character, the same size (especially if character), likely even labels and formats.

 

Occasional Contributor
Posts: 5

Re: read in multiple text files with different structures

filename baby pipe 'dir "D:\Original_Data\Claims\Baby\*.txt" /b';
data babyclaims;
length fname file text $200.;
infile baby truncover;
input fname $;
file='SAS_'||tranwrd(tranwrd(scan(fname,1,'.'),'-','_'), '', '_');
text='D:\Original_Data\Claims\Baby\\'||fname;
call symput('num_files', put(_n_,best.));
run;

data _null_;
set babyclaims;
call symputx('text'||strip(put(_n_,best.)), text, 'G');
call symputx('file'||strip(put(_n_,best.)), file, 'G');
run;

 

%macro import;
%do i=1 %to &num_files.;

proc import file="&&text&i" out=babysas.&&file&i dbms=TAB replace;*/
getnames=yes;
run;
%end;
%mend;
%import;

 

When I'm using proc import, some files can be imported correctly, but other can only import parial information of the file, say, only the first column of the file in SAS datasets.

 

Some files do have common variables, but since there are over 30 variables in one file and I have over 100 files and each file is 2-3MB, I'm wondering if there is a way I can do without opening all the files and manually imput all the variable names?

 

Many thanks!

 

Hera

Grand Advisor
Posts: 10,196

Re: read in multiple text files with different structures

At this point I would suggest that you modify the macro you are using to read the data to only read a few records, maybe system option OBS=10. Add

 

proc printo log="a different file specification for each data set read.LOG";

run;

 

before each PROC IMPORT. The file specification should have enough of the read file to match the data and the log.

You are likely asking why do this. There is some method to the madness of generating "100's" of log files.

 

The LOG will have the Data step code generated to read each data set. You can then edit the LOG files into SAS program files by removing performance metrics and such. Modify the variable names, informats and formats for consistency. Check for the odd column heading differences such as : Client_Id or Id_client that may pop up depending on your particular needs for "these should be the same but didn't come in the same". Dates, times or datetimes may be entertaining but learn about the INFORMATS Anydtdte, Andydtdtm and Anydttme could be very helpful.

 

You may want to request some guidance from the users if other than yourself about which dataset have higher priority and work on those first.

 

Or if you have documentation build datasets to generate code.

 

Also, look for files that have changing structure within them. Proc Import will not work at all though you may get a headstart on the first encountered layout.

 

After you have read the data, or while reading, you might also want to consider if the data needs to be normalized. For instance, if you have values that each column represents a month you might want to add a month variable and single value and then generate 12 records from one row of data.

 

Good luck.

Respected Advisor
Posts: 3,775

Re: read in multiple text files with different structures

That's a tall task with sparse info.  Different structures can mean anything.  Why did PROC IMPORT not work?  

 

Give examples and goals.

Occasional Contributor
Posts: 5

Re: read in multiple text files with different structures

Right now, I have files with different variable names and different number of variables. For the common variables, they are of the same type. 

If I use proc import for just one file, the below error is coming out:

10146 7528697355.331331310353648000.07/20/2013.09/21/2013..07/20/2013 08:00:00 AM.2273252.142736
ZONE 333333333303333333333333333330332332333303323323333003323323333233333333244033333330333333
NUMR 75286973559331331310353648000907F20F2013909F21F20139907F20F2013008A00A0001D922732529142736

91 0700.282N00000X.A060.6.83.2.R...07/20/2013.M.I..V3101.D9......11/10/2013.0111.Enc.1,502.00
ZONE 333303334333335043330303303050003323323333040400533330430000003323323333033330466032333233
NUMR 07009282E0000089106096983929299907F20F20139D99996310194999999911F10F20139011195E391C502E00

181 ..188,756.78.07/20/2013.1 205
ZONE 0033323332330332332333303
NUMR 99188C756E78907F20F201391
Encrypted_Beneficiary_ID=7528697355 TCN=331331310353648000 From_Date=07/20/2013 To_Date=09/21/2013
Adjustment_Source_Type= Admission_Date=20JUL13:08:00:00 Billing_Provider_ID=2273252
Billing_Provider_NPI=1427360700 Billing_Provider_Taxonomy=282N00000X Billing_Provider_Type=A060
Billing_Provider_Type_CID=6 Claim_Status_CID=83 Claim_Type_CID=2 Claim_Type_Code=R CMH_Hospital_ID=
CMH_Hospital_Type= Date_Of_Birth=07/20/2013 Gender=M Invoice_Type=I Place_Of_Service=.
Principal_Diagnosis_Code=V3101 Diagnosis_Category_Code=D9 Rendering_Provider_ID=.
Rendering_Provider_NPI=. Rendering_Provider_ID_Type_CID=. Rendering_Provider_Taxonomy=
Special_Program= Status_Effective_Date=11/10/2013 Type_Of_Bill=111 Transaction_Type=Enc
Total_Billed_Units=. Submitted_DRG_Code= Submitted_Charges=188,757 Birth_Date=07/20/2013
COUNT_Medicaid_ID=1 _ERROR_=1 _N_=10145

 

I'm thinking maybe it's the file's problem? The TAB is not recognized by SAS?

Respected Advisor
Posts: 3,775

Re: read in multiple text files with different structures

SAS does reconize the tab as the record delimiter.  I don't see the invalid data messages that goes with the recordsdump, but it looks like PROC IMPORT is working pretty well.  At least for the very small bit of information you have provide.

 

Is the coal to combine the files?

 

You mention different variables and names.  Does the same data have different names in different files?  You have a bit of mess it sounds like and you don't really know the extent of it. 

 

I might start by writting a program to just read the name row from each file and make report of common names etc.

Trusted Advisor
Posts: 1,114

Re: read in multiple text files with different structures

Your log excerpt shows: Numeric variable Total_Billed_Units has got a missing value from raw data value 1,502.00.

 

Possibly, the first 20 records of the file had Total_Billed_Units<1000, so that SAS did not recognize that it should use informat COMMA. to read this column. Therefore, I would suggest to add the statement

guessingrows=20000;

(or with an even higher value than 20000; guessingrows=MAX would be the last resort) to your PROC IMPORT step. The default value 20 is probably insufficient, at least for this particular file.

 

You should also keep an eye on variable Submitted_Charges, whose value for obs. 10145 is displayed as 188,757 in the log, but raw data value was 188,756.78. Maybe it has been (inappropriately) formatted with something like the COMMA7. format. But the stored value is probably (hopefully) correct in this case.

Grand Advisor
Posts: 17,311

Re: read in multiple text files with different structures

So you have no consistent structure at all between the files? Ie are there X many file structures or do you genuinely have no idea of the structure?

Do you have documentation that defines the file structures?  

 

You should include a larger portion of the log to help determine the error. 

 

If proc import won't work, you'll have to write the code for each and there's no way around that. 

Esteemed Advisor
Esteemed Advisor
Posts: 7,190

Re: read in multiple text files with different structures

What does the Import Agreement (or Import Specs) say for each file - you do have that don't you?  If not then how will you know what the file contains, should look like, prevent from changing.  Documentation is more important that programming.  If you have the above information, each import step should write itself.

Ask a Question
Discussion stats
  • 9 replies
  • 508 views
  • 0 likes
  • 6 in conversation