BookmarkSubscribeRSS Feed
SASPhile
Quartz | Level 8


there are text files of varying lengths. I would like to get the first part of files before processing them.
like if there are fles like this:

credit_scores_201501_usa.txt
mortgage_scores_201501_usa_st.txt
I only want credit_scores_201501 and mortgage_scores_201501.
There is either one underscore or two underscores before '.' that will give me the file name I'm interested in.

how to read number of underscores in a text file

14 REPLIES 14
SASPhile
Quartz | Level 8


there are text files of varying lengths. I would like to get the first part of files before processing them.
like if there are fles like this:

credit_scores_201501_usa.txt
mortgage_scores_201501_usa_st.txt
I only want credit_scores_201501 and mortgage_scores_201501.
There is either one underscore or two underscores before '.' that will give me the file name I'm interested in.

 

 And how to read number of underscores in a text file?

Astounding
PROC Star

Counting the number of underscores is relatively straightforward:

 

n_underscores = countc(filename, '_');

 

That doesn't finish the job though.  Are you comfortable with the rest of it, or do you need further help?

 

This might be easier:

 

new_filename = catx('_', scan(filename, 1, '_'), scan(filename, 2, '_'), scan(filename, 3, '_') );

 

SASPhile
Quartz | Level 8

There are not just three underscores in the file. 

   i wan the file names will have either 8 or 9 underscores. if it has 8 i want the the file name before the 7th underscore. if it has 9 I want  file name before the 8th underscore.

Astounding
PROC Star

OK, working with that idea ...

 

n_underscores = countc(filename, '_');

 

new_filename = scan(filename, 1, '_');

 

do i=2 to n_underscores-1;

   new_filename = catx('_', new_filename, scan(filename, i, '_') );

end;

 

If I'm understanding you correctly, this does it.  But there is a possiblity you may need to change n_underscores-1 to n_underscores-2.

SASPhile
Quartz | Level 8

Thanks!

but the idea is to output the file name that is either one  underscores or two underscores before '.' based on the length of file name.

 

Astounding
PROC Star

Now I'm confused.  Are you saying that you want only the date piece, or all the pieces up to and including the date piece?

 

SASPhile
Quartz | Level 8

all the pieces up to and including the date piece

Astounding
PROC Star

Doesn't my code in message #4 do exactly that?

SASPhile
Quartz | Level 8

Its reading the first three words. 

Astounding
PROC Star

Yes, my original response reads three words.  But my second response in message #4 should read more than that.

mkeintz
PROC Star

If sometimes you remove two underscore-preceded "words" and sometimes you remove one, then getting a count of underscores would not seem to be the main issue.  It's coming up with a criterion to stop removing words.

 

It appears that you really want to remove trailing words until the last word is all numeric.  If that's the case then you want to see if the NOTDIGIT function applied to that last word returns a zero,  i.e.

 

Keep removing last word until    ( NOTDIGIT(last word of phrase)=0).

 

Getting the last underscore-separated word (trimmed of trailing blanks) is just the function   trim(SCAN(phrase,-1,'_')).  So the stopping criterion is:

     until (NOTDIGIT(trim(scan(phrase,-1,'_')))=0)

 

This program uses that approach:

 

data have;
  input phrase $40. ;
datalines;
credit_scores_201501_usa.txt
mortgage_scores_201501_usa_st.txt
run;

data want;
  set have;
  length rslt $40;
  rslt=phrase;
  do until (notdigit(trim(scan(rslt,-1,'_')))=0);
    rslt=tranwrd(rslt,cats('_',scan(rslt,-1,'_')),'');
  end;
  put rslt=;
run;

 

The tranwrd function takes the full phrase in RSLT and "translates" the last word (including leading underscore) to a blank - i.e. it removes the last word.

 

This needs no further modification unless you expecte a trailing word you wish to remove to be exactly duplicated in some earlier word (i.e. if you had a phrase like "mortgage_scores_usa_201501_usa_st.txt").  If that were the case, then - inside the TRANWRD function - temporarily make the last word absolutely unique by appending something like an exclamation to it, and remove it in the TRANWRD function, as in

         rslt=tranwrd(cats(rslt,'!'),cats('_',scan(rslt,-1,'_'),'!'),'');

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ChrisNZ
Tourmaline | Level 20

2 possible one-liners:

 

data WANT;
  input NAME : $30.;
  SHORT=substr(NAME,1,anydigit(NAME)+5);
  putlog SHORT=;
  SHORT=prxchange('s/(.*\d{6}).*/$1/',1,NAME);
  putlog SHORT=;
cards;
credit_scores_201501_usa.txt
mortgage_score_s_201501_usa_st.txt
run;

SHORT=credit_scores_201501
SHORT=credit_scores_201501
SHORT=mortgage_score_s_201501
SHORT=mortgage_score_s_201501

 

ChrisNZ
Tourmaline | Level 20

If you want to ensure that the date is surrounded by underscores:

 

  SHORT=prxchange('s/(.*_\d{6})_.*/$1/',1,NAME);

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 4191 views
  • 1 like
  • 5 in conversation