05-13-2012 06:06 PM
I was given a text file named have_2.txt with 37 variables and over million records (a sample is attached).
I need to read it into SAS.
Some variable names of have_2.txt file were having two or more standing alone words. So, I have manually changed those names by putting an “underscore” in between like below.
Days Star Onn (oldest Amb)
The revised file is named as have_3.txt (attached).
Q1: Does this manual name changes in text file affect my final SAS data file
Q2: To read the data into SAS, I have tried below code but honestly not having full knowledge about what is happening internally.
This code reads the data into SAS but values of some variables are truncated.
I would really appreciate it if anyone of you could help me to develop the most accurate code.
libname a "C:\MNS\";
Categor $ 18 /*length of this variable was set at $18 by trial and error, otherwise values under this variable are truncated*/
ABBRV_name $ 41; /*length of this variable was set at $41 by trial and error, otherwise values under this variable are truncated*/
Infile "C:\MNS\have_3.txt" firstobs=2 dlm="," dsd LRECL=4096 truncover TERMSTR=CRLF;
Informat Account_date ddmmyy10. Date_cancel ddmmyy10. Date_Of_Firs_Deposit ddmmyy10.;
Format Account_date date9. Date_cancel date9. Date_Of_Firs_Deposit date9.;
05-13-2012 11:27 PM
Oh, It is CSV file.You can use proc import to get it conveniently.
Then change some variables into DATE .
proc import datafile='c:\temp\have_2.txt' out=have dbms=csv replace; getnames=yes;run;
05-14-2012 12:19 AM
You can directly import the csv file using SAS Import wizard,and you can also see the code what sas is using internally in the log file or once import is done if you press F4 you will get the complete code what SAS has used internally in your editor.
Hope it helps...
data WORK.ss ;
infile 'F:\Community_SAS\have_3.txt' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Loan_number best32. ;
informat Account_date $19. ;
informat ABCDEF_number best32. ;
informat Current_number $12. ;
informat ABBRV_name $16. ;
informat Notifcation_type $4. ;
informat Regulati_code best32. ;
informat Office_code $4. ;
informat XXXX_code $5. ;
informat Seconda_Nameser_code $5. ;
informat Country_Name $19. ;
informat AXZREST_code best32. ;
informat Product_Code $5. ;
informat Localnal_nominclacation_code $3. ;
informat Date_cancel anydtdtm40. ;
informat Date_Of_Firs_Deposit anydtdtm40. ;
informat Days_Star_Onn_oldest_Amb best32. ;
informat Officerv_longe_name $5. ;
informat Closing_balance comma32. ;
informat Locatio $10. ;
informat Station $8. ;
informat XYZ_Limit best32. ;
informat XYZ_Balance best32. ;
informat Exitt best32. ;
informat Outstan best32. ;
informat Categor $16. ;
informat Rurald $33. ;
informat Dividee_DD_Names $11. ;
informat Opennq best32. ;
informat Regional_Grade_Numb $10. ;
informat Delinquency_Band $9. ;
informat Fiscal_Quarter $9. ;
informat Fiscal_Year best32. ;
informat Krrears_Sand $9. ;
informat BpplicationNumber $1. ;
informat SkkkkUsed $1. ;
informat Scccc_Band $1. ;
format Loan_number best12. ;
format Account_date $19. ;
format ABCDEF_number best12. ;
format Current_number $12. ;
format ABBRV_name $16. ;
format Notifcation_type $4. ;
format Regulati_code best12. ;
format Office_code $4. ;
format XXXX_code $5. ;
format Seconda_Nameser_code $5. ;
format Country_Name $19. ;
format AXZREST_code best12. ;
format Product_Code $5. ;
format Localnal_nominclacation_code $3. ;
format Date_cancel datetime. ;
format Date_Of_Firs_Deposit datetime. ;
format Days_Star_Onn_oldest_Amb best12. ;
format Officerv_longe_name $5. ;
format Closing_balance comma12. ;
format Locatio $10. ;
format Station $8. ;
format XYZ_Limit best12. ;
format XYZ_Balance best12. ;
format Exitt best12. ;
format Outstan best12. ;
format Categor $16. ;
format Rurald $33. ;
format Dividee_DD_Names $11. ;
format Opennq best12. ;
format Regional_Grade_Numb $10. ;
format Delinquency_Band $9. ;
format Fiscal_Quarter $9. ;
format Fiscal_Year best12. ;
format Krrears_Sand $9. ;
format BpplicationNumber $1. ;
format SkkkkUsed $1. ;
format Scccc_Band $1. ;
if _ERROR_ then call symputx('_EFIERR_',1);
05-14-2012 12:09 PM
The onlly affect your row header changes would have would be on the names of variables if using PROC IMPORT to generate code to read the data.
You get truncated values because PROC IMPORT, or the code generated there by or that you write is either not allowing enough characters $ xx or BESTXX., or in some cases they shouldn't be numeric variables but strings as they exceed the length of numeric precision for your OS.
I would read the data and then do a proc freq (possibly limiting the the number of observations to some extent) excluding unique variables such as IDs if such exist to see what the I have. In the case of truncated fields make the informat larger and in some cases change from numeric to string.
At which point I go through the code created by proc import an look for things that may not work the way I want or expect.
Such as should loan_number, ABCDEF_number, current_number and such actually be numeric (am i going to do arithmatic with them? probably not numeric, are they longer than the number of significant digits my system supports- probably not numeric).
If this read code is going to be used more than once I would look to get varaible names that are somewhat more meaningful and assign labels. For instance you have a variable "Locatio" which makes me think that it would likely be referring to a Location which probably makes a better variable name as whole words do when possible. But what "location" does it refer to: Mortgage property, branch office, business address? Use a label to make the use of these variables easier in the future.
You also have a variable Localnal_nominclacation_code which makes me think a search and replace was done on the row of variable names and you got some changes you did not expect or want.
Typically at this stage I remove FORMATS except for date, datetime or time variables as they will default to the format equivalent of the informat mostly.
05-14-2012 03:27 PM
Hi Ksharp, Shivas and Ballardw,
Thank all of you for these excellent help.
I have used both Ksharp’s and Shivas’s approaches and they produce same results except the fact Shivas’s appraoch for importing using proc import wizard produce variable names as standing alone words for variables having more than one word.
Could you also help in these 3 specific issues:
1). I wanted to convert two date variables named Date_cancel and Date_Of_Firs_Deposit into date9. format. I created the code very below but it doesn’t do the job
2). The variables named Account_date, Dividee_DD_Names were imported as character variables but I need to convert them also into a date9. format. Could help me to code this?
3). Isn’t there any way to avoid a further data step and incorporate the date conversions into Ksharp’s proc import code?
Informat Date_cancel ddmmyy10. Date_Of_Firs_Deposit ddmmyy10.;
Format Date_cancel date9. Date_Of_Firs_Deposit date9.;
I really appreciate your help.
05-14-2012 05:57 PM
For 1) You would be better off modifying the FORMAT statement in the original program, if using a datastep and not basic proc import. When you say what you tried didn't do the job how did it not do the job? Failure, syntax warning, unexpected date values when printing?
For 2) try changing the INFORMAT to ANYDTDTE10. and remove the $ in the INPUT statement following the variables. Assign your format DATE9. as well.
For 3) Not in proc import. That's why I usually run proc import and then pull the generated code from the LOG to modify.