there are text files of varying lengths. I would like to get the first part of files before processing them.
like if there are fles like this:
credit_scores_201501_usa.txt
mortgage_scores_201501_usa_st.txt
I only want credit_scores_201501 and mortgage_scores_201501.
There is either one underscore or two underscores before '.' that will give me the file name I'm interested in.
how to read number of underscores in a text file
there are text files of varying lengths. I would like to get the first part of files before processing them.
like if there are fles like this:
credit_scores_201501_usa.txt
mortgage_scores_201501_usa_st.txt
I only want credit_scores_201501 and mortgage_scores_201501.
There is either one underscore or two underscores before '.' that will give me the file name I'm interested in.
And how to read number of underscores in a text file?
Counting the number of underscores is relatively straightforward:
n_underscores = countc(filename, '_');
That doesn't finish the job though. Are you comfortable with the rest of it, or do you need further help?
This might be easier:
new_filename = catx('_', scan(filename, 1, '_'), scan(filename, 2, '_'), scan(filename, 3, '_') );
There are not just three underscores in the file.
i wan the file names will have either 8 or 9 underscores. if it has 8 i want the the file name before the 7th underscore. if it has 9 I want file name before the 8th underscore.
OK, working with that idea ...
n_underscores = countc(filename, '_');
new_filename = scan(filename, 1, '_');
do i=2 to n_underscores-1;
new_filename = catx('_', new_filename, scan(filename, i, '_') );
end;
If I'm understanding you correctly, this does it. But there is a possiblity you may need to change n_underscores-1 to n_underscores-2.
Thanks!
but the idea is to output the file name that is either one underscores or two underscores before '.' based on the length of file name.
Now I'm confused. Are you saying that you want only the date piece, or all the pieces up to and including the date piece?
all the pieces up to and including the date piece
Doesn't my code in message #4 do exactly that?
Its reading the first three words.
Yes, my original response reads three words. But my second response in message #4 should read more than that.
countw(string,'_') - 1
will give you the number of underscores in string.
If sometimes you remove two underscore-preceded "words" and sometimes you remove one, then getting a count of underscores would not seem to be the main issue. It's coming up with a criterion to stop removing words.
It appears that you really want to remove trailing words until the last word is all numeric. If that's the case then you want to see if the NOTDIGIT function applied to that last word returns a zero, i.e.
Keep removing last word until ( NOTDIGIT(last word of phrase)=0).
Getting the last underscore-separated word (trimmed of trailing blanks) is just the function trim(SCAN(phrase,-1,'_')). So the stopping criterion is:
until (NOTDIGIT(trim(scan(phrase,-1,'_')))=0)
This program uses that approach:
data have;
input phrase $40. ;
datalines;
credit_scores_201501_usa.txt
mortgage_scores_201501_usa_st.txt
run;
data want;
set have;
length rslt $40;
rslt=phrase;
do until (notdigit(trim(scan(rslt,-1,'_')))=0);
rslt=tranwrd(rslt,cats('_',scan(rslt,-1,'_')),'');
end;
put rslt=;
run;
The tranwrd function takes the full phrase in RSLT and "translates" the last word (including leading underscore) to a blank - i.e. it removes the last word.
This needs no further modification unless you expecte a trailing word you wish to remove to be exactly duplicated in some earlier word (i.e. if you had a phrase like "mortgage_scores_usa_201501_usa_st.txt"). If that were the case, then - inside the TRANWRD function - temporarily make the last word absolutely unique by appending something like an exclamation to it, and remove it in the TRANWRD function, as in
rslt=tranwrd(cats(rslt,'!'),cats('_',scan(rslt,-1,'_'),'!'),'');
2 possible one-liners:
data WANT;
input NAME : $30.;
SHORT=substr(NAME,1,anydigit(NAME)+5);
putlog SHORT=;
SHORT=prxchange('s/(.*\d{6}).*/$1/',1,NAME);
putlog SHORT=;
cards;
credit_scores_201501_usa.txt
mortgage_score_s_201501_usa_st.txt
run;
SHORT=credit_scores_201501
SHORT=credit_scores_201501
SHORT=mortgage_score_s_201501
SHORT=mortgage_score_s_201501
If you want to ensure that the date is surrounded by underscores:
SHORT=prxchange('s/(.*_\d{6})_.*/$1/',1,NAME);
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.