DATA Step, Macro, Functions and more

read underscore

Reply
Super Contributor
Posts: 647

read underscore


there are text files of varying lengths. I would like to get the first part of files before processing them.
like if there are fles like this:

credit_scores_201501_usa.txt
mortgage_scores_201501_usa_st.txt
I only want credit_scores_201501 and mortgage_scores_201501.
There is either one underscore or two underscores before '.' that will give me the file name I'm interested in.

how to read number of underscores in a text file

Super Contributor
Posts: 647

read underscores in text file


there are text files of varying lengths. I would like to get the first part of files before processing them.
like if there are fles like this:

credit_scores_201501_usa.txt
mortgage_scores_201501_usa_st.txt
I only want credit_scores_201501 and mortgage_scores_201501.
There is either one underscore or two underscores before '.' that will give me the file name I'm interested in.

 

 And how to read number of underscores in a text file?

Super User
Posts: 5,079

Re: read underscore

[ Edited ]

Counting the number of underscores is relatively straightforward:

 

n_underscores = countc(filename, '_');

 

That doesn't finish the job though.  Are you comfortable with the rest of it, or do you need further help?

 

This might be easier:

 

new_filename = catx('_', scan(filename, 1, '_'), scan(filename, 2, '_'), scan(filename, 3, '_') );

 

Super Contributor
Posts: 647

Re: read underscore

There are not just three underscores in the file. 

   i wan the file names will have either 8 or 9 underscores. if it has 8 i want the the file name before the 7th underscore. if it has 9 I want  file name before the 8th underscore.

Super User
Posts: 5,079

Re: read underscore

OK, working with that idea ...

 

n_underscores = countc(filename, '_');

 

new_filename = scan(filename, 1, '_');

 

do i=2 to n_underscores-1;

   new_filename = catx('_', new_filename, scan(filename, i, '_') );

end;

 

If I'm understanding you correctly, this does it.  But there is a possiblity you may need to change n_underscores-1 to n_underscores-2.

Super Contributor
Posts: 647

Re: read underscore

Thanks!

but the idea is to output the file name that is either one  underscores or two underscores before '.' based on the length of file name.

 

Super User
Posts: 5,079

Re: read underscore

Now I'm confused.  Are you saying that you want only the date piece, or all the pieces up to and including the date piece?

 

Super Contributor
Posts: 647

Re: read underscore

all the pieces up to and including the date piece

Super User
Posts: 5,079

Re: read underscore

Doesn't my code in message #4 do exactly that?

Super Contributor
Posts: 647

Re: read underscore

Its reading the first three words. 

Super User
Posts: 5,079

Re: read underscore

Yes, my original response reads three words.  But my second response in message #4 should read more than that.

Super User
Posts: 6,928

Re: read underscore

countw(string,'_') - 1

will give you the number of underscores in string.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Valued Guide
Posts: 797

Re: read underscore

If sometimes you remove two underscore-preceded "words" and sometimes you remove one, then getting a count of underscores would not seem to be the main issue.  It's coming up with a criterion to stop removing words.

 

It appears that you really want to remove trailing words until the last word is all numeric.  If that's the case then you want to see if the NOTDIGIT function applied to that last word returns a zero,  i.e.

 

Keep removing last word until    ( NOTDIGIT(last word of phrase)=0).

 

Getting the last underscore-separated word (trimmed of trailing blanks) is just the function   trim(SCAN(phrase,-1,'_')).  So the stopping criterion is:

     until (NOTDIGIT(trim(scan(phrase,-1,'_')))=0)

 

This program uses that approach:

 

data have;
  input phrase $40. ;
datalines;
credit_scores_201501_usa.txt
mortgage_scores_201501_usa_st.txt
run;

data want;
  set have;
  length rslt $40;
  rslt=phrase;
  do until (notdigit(trim(scan(rslt,-1,'_')))=0);
    rslt=tranwrd(rslt,cats('_',scan(rslt,-1,'_')),'');
  end;
  put rslt=;
run;

 

The tranwrd function takes the full phrase in RSLT and "translates" the last word (including leading underscore) to a blank - i.e. it removes the last word.

 

This needs no further modification unless you expecte a trailing word you wish to remove to be exactly duplicated in some earlier word (i.e. if you had a phrase like "mortgage_scores_usa_201501_usa_st.txt").  If that were the case, then - inside the TRANWRD function - temporarily make the last word absolutely unique by appending something like an exclamation to it, and remove it in the TRANWRD function, as in

         rslt=tranwrd(cats(rslt,'!'),cats('_',scan(rslt,-1,'_'),'!'),'');

 

PROC Star
Posts: 1,561

Re: read underscore

[ Edited ]

2 possible one-liners:

 

data WANT;
  input NAME : $30.;
  SHORT=substr(NAME,1,anydigit(NAME)+5);
  putlog SHORT=;
  SHORT=prxchange('s/(.*\d{6}).*/$1/',1,NAME);
  putlog SHORT=;
cards;
credit_scores_201501_usa.txt
mortgage_score_s_201501_usa_st.txt
run;

SHORT=credit_scores_201501
SHORT=credit_scores_201501
SHORT=mortgage_score_s_201501
SHORT=mortgage_score_s_201501

 

PROC Star
Posts: 1,561

Re: read underscore

If you want to ensure that the date is surrounded by underscores:

 

  SHORT=prxchange('s/(.*_\d{6})_.*/$1/',1,NAME);
Ask a Question
Discussion stats
  • 14 replies
  • 254 views
  • 1 like
  • 5 in conversation