02-03-2016 03:10 PM
I am trying to read multiple tab delimited text files simultaneously into SAS datasets. I am using filename pipe, but have trouble in writing the input statement in data step. Since the number of variables and variable names are not the same in the text files, I'm wondering how can I input these variables? Also, proc import did not work in my case.
02-03-2016 03:20 PM
Please post code you have tried and describe why Proc Import did not work.
I suspect that with multiple structures that you have some work on your hands as there is not likely to be a 4 or 5 line bit of code that works.
Approaching a project like this I have started by seeing if any of the data sets have a common structure so I could write code that only had to change input filename and output dataset name for common code.
Then do anything needed to read "singleton" layout files.
Additional things to consider, and issue with guessing approaches like Proc Import, are variables that appear in different files that may require linkage with similar fields in other source files. You want to ensure that the data types and characteristics are uniform: all numeric or character, the same size (especially if character), likely even labels and formats.
02-03-2016 03:32 PM
filename baby pipe 'dir "D:\Original_Data\Claims\Baby\*.txt" /b';
length fname file text $200.;
infile baby truncover;
input fname $;
file='SAS_'||tranwrd(tranwrd(scan(fname,1,'.'),'-','_'), '', '_');
call symput('num_files', put(_n_,best.));
call symputx('text'||strip(put(_n_,best.)), text, 'G');
call symputx('file'||strip(put(_n_,best.)), file, 'G');
%do i=1 %to &num_files.;
proc import file="&&text&i" out=babysas.&&file&i dbms=TAB replace;*/
When I'm using proc import, some files can be imported correctly, but other can only import parial information of the file, say, only the first column of the file in SAS datasets.
Some files do have common variables, but since there are over 30 variables in one file and I have over 100 files and each file is 2-3MB, I'm wondering if there is a way I can do without opening all the files and manually imput all the variable names?
02-03-2016 06:41 PM
At this point I would suggest that you modify the macro you are using to read the data to only read a few records, maybe system option OBS=10. Add
proc printo log="a different file specification for each data set read.LOG";
before each PROC IMPORT. The file specification should have enough of the read file to match the data and the log.
You are likely asking why do this. There is some method to the madness of generating "100's" of log files.
The LOG will have the Data step code generated to read each data set. You can then edit the LOG files into SAS program files by removing performance metrics and such. Modify the variable names, informats and formats for consistency. Check for the odd column heading differences such as : Client_Id or Id_client that may pop up depending on your particular needs for "these should be the same but didn't come in the same". Dates, times or datetimes may be entertaining but learn about the INFORMATS Anydtdte, Andydtdtm and Anydttme could be very helpful.
You may want to request some guidance from the users if other than yourself about which dataset have higher priority and work on those first.
Or if you have documentation build datasets to generate code.
Also, look for files that have changing structure within them. Proc Import will not work at all though you may get a headstart on the first encountered layout.
After you have read the data, or while reading, you might also want to consider if the data needs to be normalized. For instance, if you have values that each column represents a month you might want to add a month variable and single value and then generate 12 records from one row of data.
02-03-2016 04:26 PM
Right now, I have files with different variable names and different number of variables. For the common variables, they are of the same type.
If I use proc import for just one file, the below error is coming out:
10146 7528697355.331331310353648000.07/20/2013.09/21/2013..07/20/2013 08:00:00 AM.2273252.142736
181 ..188,756.78.07/20/2013.1 205
Encrypted_Beneficiary_ID=7528697355 TCN=331331310353648000 From_Date=07/20/2013 To_Date=09/21/2013
Adjustment_Source_Type= Admission_Date=20JUL13:08:00:00 Billing_Provider_ID=2273252
Billing_Provider_NPI=1427360700 Billing_Provider_Taxonomy=282N00000X Billing_Provider_Type=A060
Billing_Provider_Type_CID=6 Claim_Status_CID=83 Claim_Type_CID=2 Claim_Type_Code=R CMH_Hospital_ID=
CMH_Hospital_Type= Date_Of_Birth=07/20/2013 Gender=M Invoice_Type=I Place_Of_Service=.
Principal_Diagnosis_Code=V3101 Diagnosis_Category_Code=D9 Rendering_Provider_ID=.
Rendering_Provider_NPI=. Rendering_Provider_ID_Type_CID=. Rendering_Provider_Taxonomy=
Special_Program= Status_Effective_Date=11/10/2013 Type_Of_Bill=111 Transaction_Type=Enc
Total_Billed_Units=. Submitted_DRG_Code= Submitted_Charges=188,757 Birth_Date=07/20/2013
COUNT_Medicaid_ID=1 _ERROR_=1 _N_=10145
I'm thinking maybe it's the file's problem? The TAB is not recognized by SAS?
02-03-2016 04:54 PM
SAS does reconize the tab as the record delimiter. I don't see the invalid data messages that goes with the recordsdump, but it looks like PROC IMPORT is working pretty well. At least for the very small bit of information you have provide.
Is the coal to combine the files?
You mention different variables and names. Does the same data have different names in different files? You have a bit of mess it sounds like and you don't really know the extent of it.
I might start by writting a program to just read the name row from each file and make report of common names etc.
02-03-2016 06:11 PM
Your log excerpt shows: Numeric variable Total_Billed_Units has got a missing value from raw data value 1,502.00.
Possibly, the first 20 records of the file had Total_Billed_Units<1000, so that SAS did not recognize that it should use informat COMMA. to read this column. Therefore, I would suggest to add the statement
(or with an even higher value than 20000; guessingrows=MAX would be the last resort) to your PROC IMPORT step. The default value 20 is probably insufficient, at least for this particular file.
You should also keep an eye on variable Submitted_Charges, whose value for obs. 10145 is displayed as 188,757 in the log, but raw data value was 188,756.78. Maybe it has been (inappropriately) formatted with something like the COMMA7. format. But the stored value is probably (hopefully) correct in this case.
02-03-2016 04:35 PM
So you have no consistent structure at all between the files? Ie are there X many file structures or do you genuinely have no idea of the structure?
Do you have documentation that defines the file structures?
You should include a larger portion of the log to help determine the error.
If proc import won't work, you'll have to write the code for each and there's no way around that.
02-04-2016 04:30 AM
What does the Import Agreement (or Import Specs) say for each file - you do have that don't you? If not then how will you know what the file contains, should look like, prevent from changing. Documentation is more important that programming. If you have the above information, each import step should write itself.