Hi all,
I have a huge text file which I need to import into SAS. I cannot open it in notepad (too big) and when I open it in wordpad the rows get messed up and no way I can determine the starting position of each field
so I write my code based on some of the layout I got from business
Column Name | DATA TYPE |
CONTACT_ID | VARCHAR2 (17 Byte) |
COMPANY_NAME | VARCHAR2 (39 Byte) |
FIRST_NAME | VARCHAR2 (30 Byte) |
LAST_NAME | VARCHAR2 (33 Byte) |
GENDER | VARCHAR2 (3 Byte) |
PREFERED_LANGUAGE | VARCHAR2 (4 Byte) |
MARITAL_STATUS | VARCHAR2 (12 Byte) |
HOME_PHONE | VARCHAR2 (19 Byte) |
CELL_PHONE | VARCHAR2 (19 Byte) |
BUISINESS_PHONE | VARCHAR2 (21 Byte) |
VARCHAR2 (50 Byte) | |
DT_BIRTH | VARCHAR2 (8 Byte) |
VIP_INDICATOR | VARCHAR2 (3 Byte) |
STAFF_INDICATOR | VARCHAR2 (3 Byte) |
CLIENT_SINCE_DATE | VARCHAR2 (8 Byte) |
DNSH | VARCHAR2 (3 Byte) |
HOME_DNCL | VARCHAR2 (3 Byte) |
CELL_DNCL | VARCHAR2 (3 Byte) |
BUSINESS_DNCL | VARCHAR2 (3 Byte) |
DNSO | VARCHAR2 (3 Byte) |
DT_TO_DATE | VARCHAR2 (8 Byte) |
data clientinfo;
infile "&rootname\clientinfo.txt" truncover lrecl=32767 pad obs=5;
input
@1 CONTACT_ID $17.
@18 COMPANY_NAME $39.
@57 FIRST_NAME $30.
@87 LAST_NAME $33.
@110 GENDER $3.
@113 PREFERED_LANGUAGE $4.
@117 MARITAL_STATUS $12.
@129 HOME_PHONE $19.
@148 CELL_PHONE $19.
@167 BUISINESS_PHONE $21.
@188 EMAIL $50.
@238 DT_BIRTH $8.
@246 VIP_INDICATOR $3.
@249 STAFF_INDICATOR $3.
@252 CLIENT_SINCE_DATE $8.
@260 DNSH $3.
@263 HOME_DNCL $3.
@266 CELL_DNCL $3.
@269 BUSINESS_DNCL $3.
@272 DNSO $3.
@275 DT_TO_DATE $8.;
but when I run this I don't get an accurate output (some of the values getting truncated). Is there any way to find out the starting position of the fields or even to import the file?
I even tried with the proc import wizard on pc sas and did not work.
I was able to import 3 other small text files which was able to open and I know this big one is in the same format as those 3 (the value of the first field from the next row start right after the end of value of the last field from the previous one)
Any suggestions please? 🙂
I'm afraid that you may have to post examples of the file that the truncation occurs on.
I am going to guess without example data that your issue is mostly coming from values that have embedded blanks.
This line:
@18 COMPANY_NAME $39.
Says to read, starting at column 18. But since you don't have any further instructions it will stop at the first blank if there is one.
You might try fixed column which would look like
company_name $ 18-56
I just tried with fixed column and got same result
the first 4 variables are importing properly. Truncation starts after that so I guess I need to keep guessing about the layout...
You haven’t specified a delimiter, there’s no delimiter at all?
If you have smaller files that are the exact same format use the same code but point it to your big file.
If you can show the code that works for the smaller files that may help.
My approach for these type of issues is to to use PowerShell commands to first get a small subset of the file. Perfect your program on the small file and then implement on the large file. You already have small files so this isn’t necessary.
sorry I was not clear enough
when I said same format I meant all fiels have blanks between the fields and the next row starts right after the previous so for example if one of the smaller files had 3 variables :code1 age and code2 the text file would be like :
AAA 45 BBBAAA 55 BBBAAA 66 BBBAAA....... so when I use a code like:
data clientinfo;
infile "&rootname\clientinfo.txt" truncover lrecl=32767 pad obs=5;
input
@1 code1 $3.
@6 age $2.
@10 code2 $3.
I'd get a perfect output
All these files have different layout though(completely different files) it just that I cannot open the huge file
Can you please share a link to some docs about the powershell thing?
ps. I have 2 blank spaces between each field
To see what is in your file I suggest to use the following code:
data _null_;
infile "<yourFile>" obs=10;
input;
list;
run;
If the input record contains non printable chars then you will get the input record also displayed in hex. So you will see the delimiter between data values. Maybe it is x09 for a tab character.
I just ran this:
data _null_;
infile "&rootname\clientinfo.txt";
input;
put _infile_;
if _n_ > 20 then
stop;
run;
and in the log I got exactly the same that I'd get in my output file from the above code so looks like the text file I was provided is all messed up
@Tal wrote:
I just ran this:
data _null_;
infile "&rootname\clientinfo.txt";
input;
put _infile_;
if _n_ > 20 then
stop;
run;
and in the log I got exactly the same that I'd get in my output file from the above code so looks like the text file I was provided is all messed up
Modify your code:
data _null_;
infile "&rootname\clientinfo.txt";
file "&rootname\testclient.txt";
input;
put _infile_;
if _n_ > 20 then
stop;
run;
So you can post the resulting file (testclient.txt) as an attachment here, if there's no sensitive data in it. This might give us more clues.
Thanks all for your inputs
Turns out that the text file I was provided had a lot of bad data.
I was only reading in 5 records but when I pulled a 1000 I realized everything is fine
Looks like the first records from that file were clients from different countries out of Canada with missing gender, weird postal codes....
which I thought were truncated
But all good
Thanks again
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.