Help using Base SAS procedures

Error extracting data from txt file

Reply
Contributor
Posts: 65

Error extracting data from txt file

I am trying to extract data from a text file using SAS 9.2. The problem with the data is there are some unwanted non alphabetic characters (Copyright Symbol, Pie Sign, Function (f) sign, etc..)  in the data which is not allowing me to extract all rows (only 14 top rows out of 100000 records). When I tried to open the data in excel, I found out that there are similar "unwanted" characters on line 10, 11 but still SAS is able to read all those value and load it into the dataset. Can you think of any reason why I am not able to extract data?

I am thinking of using the following alternatives:

1. Using Encoding with my filename statement (less confident that it will solve my issue)

2. Removing few rows from my dataset and trying to extract

Also, Is there a way in SAS to instruct SAS to read values for only few columns (ex: first three ) (I have 60 columns and I am extracting only three columns but still it will read all columns for left to right).

Can you suggest any better approach for this sort of ETL problem?

Code:

filename in "path\fn_nm.txt";

data new_data;

infile in dlm = '09'X lrecl = 360 missover;

attrib Cst_Id length = 7. informat = 7. format = 7.;

attrib Cst_Nm length = $40. informat = $30. format = $30.;

attrib Cst_Add length = $20 information = $20. format = $20.;

input

@1 Cst_Id

@8 Cst_Nm

@Cst_Add

;

run;

Thanks for your time and help!

PROC Star
Posts: 7,363

Error extracting data from txt file

While there is probably an easier solution using an encoding option, if you don't want/need those characters to remain in the data, you could always use a pre-processing brute force approach and eliminate any of the undesired characters.  An example of how you could first read the file one character at a time, and eliminate any undesired characters, can be found at: http://communities.sas.com/message/113806#113806.

If you decide to try such an approach, you might want to only keep ASCII characters between 1 and 127.

Trusted Advisor
Posts: 1,300

Error extracting data from txt file

I do not envy you for having to deal with such poorly prepared files.  It will still read the full lines, if you want to avoid that you could read the file through a pipe with something like the cut command.  If you are using windows and have files with lots of nonprintable ascii values add ignoredoseof option if you are terminating reading the file before you would normally expect it to end.  You can use utf-8 encoding to read in the other values you mentioned and you could strip them out later using a variety of methods.

data foo;

infile 'path/to_file.txt' encoding='utf-8' ignoredoseof lrecl=360 truncover;

attrib cst_id length=7 informat=7. format=7.

        cst_nm length=$40 informat=$30. format=7.

         cst_add length=$20 informat=$20. format=$20.;

input;

cst_id=compress(scan(_infile_,1,'09'x),,'kw');

cst_nm=compress(scan(_infile_,2,'09'x),,'kw');

cst_add=compress(scan(_infile_,3,'09'x),,'kw');

run;

Contributor
Posts: 65

Error extracting data from txt file

Thanks for the reply guys!

I am still working on it and will update you once I have got some results to share. Thanks again.

Super User
Super User
Posts: 6,500

Error extracting data from txt file

I have been getting hit by this a lot lately.  Ususally I get a TRANSCODING error in SAS.  This is because SAS is trying to convert the strange hexcodes into the equivalent hexcodes for the character set that you are using.  For example what hexcode represents a copyright symbol could be different for english or japanese.

I have been seeing this where users have attempted to translate to English.  They have left in symbols or other characters that appear fine in their local language version of SAS, but when I try to read then using my English version of SAS it cannot transcode those symbols and it throws errors.

Contributor
Posts: 65

Error extracting data from txt file

I am still working on the above file. I have a related question and would appreciate your help:

I know that in SAS after the pointer locator '@' we can provide a keyword/occurence of word and read the data from there. For ex:

filename aa '\path\test.txt';

data file;

infile aa;

input

     @1 ID $6.

     @'Monday' .....

My question is instead of reading a character, can i read number/find numbers which begins from either 1-9? (1,2,3,...9)

I tried reading in the following manner:

filename aa '\path\test.txt';

data file;

infile aa;

input

     @1 ID $6.

     @[1-9] .....

But failed. ..

Let me know if you need further information.

Super User
Posts: 10,500

Error extracting data from txt file

You may be able to parse what you are looking for from the SAS automatic variable _infile_ which represents the entire current input "line" read from your source file.

PROC Star
Posts: 7,363

Error extracting data from txt file

We have a couple of people on the forum who are good with regular expressions.  I'm not, but I would think they would provide a mechanism for what you want to do.

Trusted Advisor
Posts: 1,300

Error extracting data from txt file

What you are saying reminds me of this post:

http://communities.sas.com/message/110917#110917

PROC Star
Posts: 7,363

Error extracting data from txt file

Like I said, regular expressions are still not my forte.  However, that said, will something like the following suffice?

data want;

  format want $8.;

  retain re;

  re = prxparse('/[0-9]/');

  input stuff $50. @;

  x=prxmatch(re,stuff)+1;

  INPUT @x want;

  DATALINES;

THIS LINE HAS XYZ 7 NUMBERS

NONE 6 ON THIS LINE

3 xyz

;

Ask a Question
Discussion stats
  • 9 replies
  • 220 views
  • 6 likes
  • 5 in conversation