Hello,
I am getting the path and file name using a unix command.
Then I would like to get the month (see two string examples below)
/.../prod1010.saa1fm.data/polices_sep2013.sas7bdat |
/.../pa.pxbtxax.sabtxfaa.dec2013.data/pa.pxbtxax.sabtxfaa.data/police.sas7bdat |
In the first example the month is equal to sep while into the second example the month is dec.
I could use the find function as below
data test1;
set text1;
if find(text, 'jan') gt 0 then month='jan'
...
if find(text, 'sep') gt 0 then month='sep'
...
if find(text, 'dec') gt 0 then month='dec'
run;
I wonder if there is a more efficient way to do that if I have the following month:
jan, fev, mar, avr, mai, jun, jui, aou, sep, oct , nov, dec.
Do yourself a favor, represent months as numeric variables, such a 1 2 3 and not as character strings, like 'jan' 'feb', etc. Numbers will sort numerically, while character strings will not sort properly, they will sort alphabetically, and so the first month of the year will be 'apr', which is not what most people want.
Do yourself a double favor, represent months as month and year in a numeric variable, so instead of 1 2 3, you have a representation such as January 2023 as a numeric variable. Make your programming easier by using the built-in SAS handling of dates.
data want;
set have;
datestr1=scan(scan(string,-3,'/'),-2,'.');
datestr2=scan(string,-2,'._');
monthyear=coalesce(input(datestr1,monyy7.),input(datestr2,monyy7.));
format monthyear monyy7.;
run;
Now anything you do with these MONTHYEAR values will sort properly, and still work if the years go to something other than 2013.
General rule you should use from now on: calendar information, and clock information, should ALWAYS be numeric, not character.
What do you mean by BEST?
If you know the PATTERN of the filenames then just pull out the part of the pattern that should contain the date.
In your first example it is the part before the last period.
datestr=scan(filename,-2,'._');
In your second example it appears to be the next to last part of the next to last directory name.
datestr=scan(scan(filename,-3,'/'),-2,'.');
Do yourself a favor, represent months as numeric variables, such a 1 2 3 and not as character strings, like 'jan' 'feb', etc. Numbers will sort numerically, while character strings will not sort properly, they will sort alphabetically, and so the first month of the year will be 'apr', which is not what most people want.
Do yourself a double favor, represent months as month and year in a numeric variable, so instead of 1 2 3, you have a representation such as January 2023 as a numeric variable. Make your programming easier by using the built-in SAS handling of dates.
data want;
set have;
datestr1=scan(scan(string,-3,'/'),-2,'.');
datestr2=scan(string,-2,'._');
monthyear=coalesce(input(datestr1,monyy7.),input(datestr2,monyy7.));
format monthyear monyy7.;
run;
Now anything you do with these MONTHYEAR values will sort properly, and still work if the years go to something other than 2013.
General rule you should use from now on: calendar information, and clock information, should ALWAYS be numeric, not character.
I agree that using real date values is better, but we cannot always control the inputs we get.
Here's another approach that uses regular expressions. I always use https://regex101.com/ as a tool to build and validate a regex before trying to use it at scale in a SAS program with the PRXMATCH function.
data have;
length file $ 50;
infile datalines dsd;
input file;
datalines;
/.../prod1010.saa1fm.data/polices_sep2013.sas7bdat
/.../pa.pxbtxax.sabtxfaa.dec2013.data/pa.pxbtxax.sabtxfaa.data/police.sas7bdat
/.../pa.pxbtxax.sabtxfaa.JAN2013.data/pa.pxbtxax.sabtxfaa.data/police.sas7bdat
no dates here
;
data dates (drop=pos);
set have;
length date 8 month $ 3;
format date monyy7.;
pos = prxmatch('/((jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)+(\d\d\d\d))/i',file);
if pos then do;
date = input(substr(file,pos,7),monyy7.);
month=put(date,monname3.);
end;
run;
Output:
@alepage wrote:
what's the purpose of +(\d\d\d\d))/i into the prxmatch function ?
Since \d is the "search for a digit" instruction, why might one look for 4 digits in a row?
The + is "the preceding subexpression"
i.e. any of the 3-letter abbreviation of month names followed by 4 digits
The / is the close of the pattern started with the previous /
i is case insensitive
As I look at this again a better regex would be:
/((jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\d{4}))/i
Which means:
The regex I supplied in my original answer works, but the above is more concise and limits the match for month name to "exactly one" instead of "one or more" (which is what the + modifier does).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.