BookmarkSubscribeRSS Feed
deleted_user
Not applicable
If I have a raw data like the following:

'aaaa' = 'aaaaabbbbbccccddd'
'bbbb' = 'aaabbbcccdddd'

I have over 2000 of such lines of data. I need to store them into a variable so that I can take out the duplicates with a proc sql distinct statement. Or is there a better way to remove duplicates?

thanks
5 REPLIES 5
DanielSantos
Barite | Level 11
Two questions here.

First, to read raw data, just use the common file reading features of datastep.

See the online documentation,

INFILE statement: http://support.sas.com/documentation/cdl/en/lrdict/61724/HTML/default/a000146932.htm

INPUT statement: http://support.sas.com/documentation/cdl/en/imlug/59656/HTML/default/langref_sect141.htm

And yes, from my point of view, select distinct or proc sort nodupkey will be the best way to remove duplicates, which means for both, you'll have to sort first question 1.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt.
deleted_user
Not applicable
Hi,

Thanks for the input. The problem I am having right now is to store the entire line of data into one single variable. With the above format of the raw data, i can only store whatever is there before the first space.

thanks
DanielSantos
Barite | Level 11
Have you tried using the infile buffer variable?

Something like this:

data _null_;
infile myfile.

input; * read one line;
put _infile_; * dispaly _infile_ buffer;

run;

Cheers from Portugal.

Daniel Santos @ www.cgd.pt.
deleted_user
Not applicable
Hi,

Thanks again for the input. I read through some document about infile buffer varible. I am still not quite sure how it works. Especially in your example, is that suppose to be a period after the word myfile? The put _infile_ displays _infile_buffer and stores it to the variable myfile?

Also, I defined the lenght of the variable like the following:

input line $ 150.;

This actually read each line and stores the entire line into that variable line. This method works too right?

thanks
DanielSantos
Barite | Level 11
Hi cosmid.

You are right, it is not supposed to be a period after the myfile. 🙂
I am sorry, I've misspelled that.

When dealing with complex text parsing, I find always better to access the automatic _INFILE_ buffer variable. Being a buffer, there's no need to pre-alocate it's maximum size (as you should do with a variable) and It will hold precisely the exact record that was retrieved from the file.
Benefits for that? Say, I just want to parse the line and retrieve some 4 char code placed somewhere in the middle. Using _INFILE, there's no need to pre-alocate a "large-enough" variable to hold the line, I just have to process the _INFILE_ auto variable and extract what I need from it.

Check the paper of Howard Schreier about the _INFILE_ var:
http://www.nesug.org/Proceedings/nesug01/cc/cc4018bw.pdf

Now, unless each line as strictly 150 chars OR you specified the option TRUNCOVER in the INFILE statement, I wouldn't use:

input line $ 150.;

Instead, try this:

length LINE $150;
input LINE;

Here, the first 150 chars of each line are read to the LINE variable.
Of course, if no line has more than 150 chars, no truncation will occur and every line will be processed entirely.

Hope this helps.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 13309 views
  • 0 likes
  • 2 in conversation