BookmarkSubscribeRSS Feed
Pritish
Quartz | Level 8

I am trying to extract data from a text file using SAS 9.2. The problem with the data is there are some unwanted non alphabetic characters (Copyright Symbol, Pie Sign, Function (f) sign, etc..)  in the data which is not allowing me to extract all rows (only 14 top rows out of 100000 records). When I tried to open the data in excel, I found out that there are similar "unwanted" characters on line 10, 11 but still SAS is able to read all those value and load it into the dataset. Can you think of any reason why I am not able to extract data?

I am thinking of using the following alternatives:

1. Using Encoding with my filename statement (less confident that it will solve my issue)

2. Removing few rows from my dataset and trying to extract

Also, Is there a way in SAS to instruct SAS to read values for only few columns (ex: first three ) (I have 60 columns and I am extracting only three columns but still it will read all columns for left to right).

Can you suggest any better approach for this sort of ETL problem?

Code:

filename in "path\fn_nm.txt";

data new_data;

infile in dlm = '09'X lrecl = 360 missover;

attrib Cst_Id length = 7. informat = 7. format = 7.;

attrib Cst_Nm length = $40. informat = $30. format = $30.;

attrib Cst_Add length = $20 information = $20. format = $20.;

input

@1 Cst_Id

@8 Cst_Nm

@Cst_Add

;

run;

Thanks for your time and help!

9 REPLIES 9
art297
Opal | Level 21

While there is probably an easier solution using an encoding option, if you don't want/need those characters to remain in the data, you could always use a pre-processing brute force approach and eliminate any of the undesired characters.  An example of how you could first read the file one character at a time, and eliminate any undesired characters, can be found at: http://communities.sas.com/message/113806#113806.

If you decide to try such an approach, you might want to only keep ASCII characters between 1 and 127.

FriedEgg
SAS Employee

I do not envy you for having to deal with such poorly prepared files.  It will still read the full lines, if you want to avoid that you could read the file through a pipe with something like the cut command.  If you are using windows and have files with lots of nonprintable ascii values add ignoredoseof option if you are terminating reading the file before you would normally expect it to end.  You can use utf-8 encoding to read in the other values you mentioned and you could strip them out later using a variety of methods.

data foo;

infile 'path/to_file.txt' encoding='utf-8' ignoredoseof lrecl=360 truncover;

attrib cst_id length=7 informat=7. format=7.

        cst_nm length=$40 informat=$30. format=7.

         cst_add length=$20 informat=$20. format=$20.;

input;

cst_id=compress(scan(_infile_,1,'09'x),,'kw');

cst_nm=compress(scan(_infile_,2,'09'x),,'kw');

cst_add=compress(scan(_infile_,3,'09'x),,'kw');

run;

Pritish
Quartz | Level 8

Thanks for the reply guys!

I am still working on it and will update you once I have got some results to share. Thanks again.

Tom
Super User Tom
Super User

I have been getting hit by this a lot lately.  Ususally I get a TRANSCODING error in SAS.  This is because SAS is trying to convert the strange hexcodes into the equivalent hexcodes for the character set that you are using.  For example what hexcode represents a copyright symbol could be different for english or japanese.

I have been seeing this where users have attempted to translate to English.  They have left in symbols or other characters that appear fine in their local language version of SAS, but when I try to read then using my English version of SAS it cannot transcode those symbols and it throws errors.

Pritish
Quartz | Level 8

I am still working on the above file. I have a related question and would appreciate your help:

I know that in SAS after the pointer locator '@' we can provide a keyword/occurence of word and read the data from there. For ex:

filename aa '\path\test.txt';

data file;

infile aa;

input

     @1 ID $6.

     @'Monday' .....

My question is instead of reading a character, can i read number/find numbers which begins from either 1-9? (1,2,3,...9)

I tried reading in the following manner:

filename aa '\path\test.txt';

data file;

infile aa;

input

     @1 ID $6.

     @[1-9] .....

But failed. ..

Let me know if you need further information.

ballardw
Super User

You may be able to parse what you are looking for from the SAS automatic variable _infile_ which represents the entire current input "line" read from your source file.

art297
Opal | Level 21

We have a couple of people on the forum who are good with regular expressions.  I'm not, but I would think they would provide a mechanism for what you want to do.

FriedEgg
SAS Employee

What you are saying reminds me of this post:

http://communities.sas.com/message/110917#110917

art297
Opal | Level 21

Like I said, regular expressions are still not my forte.  However, that said, will something like the following suffice?

data want;

  format want $8.;

  retain re;

  re = prxparse('/[0-9]/');

  input stuff $50. @;

  x=prxmatch(re,stuff)+1;

  INPUT @x want;

  DATALINES;

THIS LINE HAS XYZ 7 NUMBERS

NONE 6 ON THIS LINE

3 xyz

;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 926 views
  • 6 likes
  • 5 in conversation