02-07-2012 02:43 PM
I am trying to extract data from a text file using SAS 9.2. The problem with the data is there are some unwanted non alphabetic characters (Copyright Symbol, Pie Sign, Function (f) sign, etc..) in the data which is not allowing me to extract all rows (only 14 top rows out of 100000 records). When I tried to open the data in excel, I found out that there are similar "unwanted" characters on line 10, 11 but still SAS is able to read all those value and load it into the dataset. Can you think of any reason why I am not able to extract data?
I am thinking of using the following alternatives:
1. Using Encoding with my filename statement (less confident that it will solve my issue)
2. Removing few rows from my dataset and trying to extract
Also, Is there a way in SAS to instruct SAS to read values for only few columns (ex: first three ) (I have 60 columns and I am extracting only three columns but still it will read all columns for left to right).
Can you suggest any better approach for this sort of ETL problem?
filename in "path\fn_nm.txt";
infile in dlm = '09'X lrecl = 360 missover;
attrib Cst_Id length = 7. informat = 7. format = 7.;
attrib Cst_Nm length = $40. informat = $30. format = $30.;
attrib Cst_Add length = $20 information = $20. format = $20.;
Thanks for your time and help!
02-07-2012 03:15 PM
While there is probably an easier solution using an encoding option, if you don't want/need those characters to remain in the data, you could always use a pre-processing brute force approach and eliminate any of the undesired characters. An example of how you could first read the file one character at a time, and eliminate any undesired characters, can be found at: http://communities.sas.com/message/113806#113806.
If you decide to try such an approach, you might want to only keep ASCII characters between 1 and 127.
02-07-2012 03:32 PM
I do not envy you for having to deal with such poorly prepared files. It will still read the full lines, if you want to avoid that you could read the file through a pipe with something like the cut command. If you are using windows and have files with lots of nonprintable ascii values add ignoredoseof option if you are terminating reading the file before you would normally expect it to end. You can use utf-8 encoding to read in the other values you mentioned and you could strip them out later using a variety of methods.
infile 'path/to_file.txt' encoding='utf-8' ignoredoseof lrecl=360 truncover;
attrib cst_id length=7 informat=7. format=7.
cst_nm length=$40 informat=$30. format=7.
cst_add length=$20 informat=$20. format=$20.;
02-08-2012 02:10 PM
I have been getting hit by this a lot lately. Ususally I get a TRANSCODING error in SAS. This is because SAS is trying to convert the strange hexcodes into the equivalent hexcodes for the character set that you are using. For example what hexcode represents a copyright symbol could be different for english or japanese.
I have been seeing this where users have attempted to translate to English. They have left in symbols or other characters that appear fine in their local language version of SAS, but when I try to read then using my English version of SAS it cannot transcode those symbols and it throws errors.
02-09-2012 05:11 PM
I am still working on the above file. I have a related question and would appreciate your help:
I know that in SAS after the pointer locator '@' we can provide a keyword/occurence of word and read the data from there. For ex:
filename aa '\path\test.txt';
@1 ID $6.
My question is instead of reading a character, can i read number/find numbers which begins from either 1-9? (1,2,3,...9)
I tried reading in the following manner:
filename aa '\path\test.txt';
@1 ID $6.
But failed. ..
Let me know if you need further information.
02-09-2012 07:00 PM
You may be able to parse what you are looking for from the SAS automatic variable _infile_ which represents the entire current input "line" read from your source file.
02-09-2012 07:23 PM
We have a couple of people on the forum who are good with regular expressions. I'm not, but I would think they would provide a mechanism for what you want to do.
02-09-2012 08:18 PM
Like I said, regular expressions are still not my forte. However, that said, will something like the following suffice?
format want $8.;
re = prxparse('/[0-9]/');
input stuff $50. @;
INPUT @x want;
THIS LINE HAS XYZ 7 NUMBERS
NONE 6 ON THIS LINE