I'm a C# programmer by trade and am adept at coding, but I know very little about SAS.
I need to edit a simple text file using SAS. The file is not very large (~2,000 lines), and I think I need to: (1) inspect every line, (2) update/alter about 30% of the lines, (3) overwrite the original file with the new content.
My [values.sas] file looks like this:
VALUE TED_DONOR
.B = 'Unknown'
.C = 'Not answ/Inconsist'
1 = '1 HLA-identical sibling'
2 = '2 Syngeneic'
3 = '3 Related Donor, match status unknown'
4 = '4 Matched Unrelated Donor (MUD)'
31 = '31 HLA-matched other relative'
32 = '32 HLA-mismatched relative'
41 = '41 HLA matched unrelated'
42 = '42 HLA mismatched unrelated'
;
But I need the [values.sas] file to look like this:
VALUE TED_DONOR
.B = 'Unknown'
.C = 'Not answ/Inconsist'
1 = 'HLA-identical sibling'
2 = 'Syngeneic'
3 = 'Related Donor, match status unknown'
4 = 'Matched Unrelated Donor (MUD)'
31 = 'HLA-matched other relative'
32 = 'HLA-mismatched relative'
41 = 'HLA matched unrelated'
42 = 'HLA mismatched unrelated'
;
{If the leading text portion matches the value to the left of the '=', then remove the leading portion and any leading/trailing spaces between the ' '}
Any assistance you can provide would be greatly appreciated.
filename in 'c:\temp\values.sas'; /*original sas code*/
filename out 'c:\temp\new_values.sas'; /*processed sas code*/
data _null_;
infile in;
file out;
input;
if first(left(_infile_)) in ('.' '0' '1' '2' '3' '4' '5' '6' '7' '8' '9')
and findc(_infile_,"=") and findc(_infile_,"'") then
_infile_=prxchange("s/'\d+\s+/'/",1,_infile_);
put _infile_;
run;
If this is a one-time operation you can use regular expressions in your favorite code editor. SAS Enterprise Guide supports this, as does Notepad++ or VS Code.
In SAS Enterprise Guide, load the file and click Ctrl+H (find and replace). Select Regular expression support.
The pattern I discerned from your sample is:
'\d+\s+
Replace with just single quote
'
I'd still be tempted to just insert it and it'll overwrite theirs.
How is the data you have stored to update the file? Because there could be multiple values of 31/32 for example? How do you know to replace the right format value? I suspect doing line by line could be dangerous as well....
Ultimately I would redesign the process to generate a data set instead of text file for the format and update that.
All of Reeza's cautions are well-informed -- definitely potential pitfalls in the whole scheme. However, if the pattern of change is in fact as you describe -- in C# I'd be doing RegEx.Replace, so in SAS I'd be looking at prxchange() function.
You could use the SCAN function to grab the "before =" sign bit and then use transwrd to trim that bit out of the right side of the assignment. A little more klunky but could work.
The actual data in the dataset is good and the formats values have their quirks, but it's clean with no redundancies.
Here are some quirks: the values appear in the text label wrapped in parenthesis. There is a <tab> in the #4 label as well.
VALUE PTTXTYPE
1 = '(1) Allo only'
2 = '(2) Auto only'
3 = '(3) Allo transplant(s) then Auto(s)'
4 = '(4) Auto transplant(s) then by Allo(s)'
5 = '(5) Allo transplant(s) then DCI'
6 = '(6) Auto transplant(s) then Allo(s) then DCI'
;
VALUE TEDVSRES
1 = '1 TED'
2 = '2 CRF (RES)'
3 = '3 moved CRF (RES) to TED'
4 = '4 moved TED to CRF (RES)'
5 = '5 moved CTRM to TED'
6 = '6 Cellular Therapy Regenerative Medicine (CTRM, CTR)'
7 = '7 Auto no Consent'
8 = '8 moved TED to CTRM'
;
Ultimately, I won't be able to clean up _every_ text label - but if I can handle 80-90% with a simple script I have more clout to go back to the data team and insist they manually clean up the remaining values upstream.
Ok. Here's what I would do.
Take your code, add the PROC FORMAT at the start if it's not already there.
Give that to them. It will run and overwrite any older formats and the log will say such. Then the correct formats are in the data.
Doing it every day will work as well.
You cannot do line by line replacement here, you'd have to do it by each VALUE/INVALUE statement so PRX isn't workable either.
Nothing else required. You should modify the process but that seems like a battle for another day.
I love it - but what happens when they add new variables and values to their formats catalog?
Wouldn't my [clean] file just completely overwrite theirs - implying I would lose any new formats they've submitted/created?
I have no problem creating a process to "clean" their formats file periodically, but I cannot justify a daily task comparing their file to my file to see if anything new has been added.
@tdegen wrote:
I love it - but what happens when they add new variables and values to their formats catalog?
Wouldn't my [clean] file just completely overwrite theirs - implying I would lose any new formats they've submitted/created?
I have no problem creating a process to "clean" their formats file periodically, but I cannot justify a daily task comparing their file to my file to see if anything new has been added.
You don't lose any formats, only ones that are duplicate would be overwritten. If they update with new ones that your process does not have, theirs will remain. If yours is out of date, that will be true regardless of what approach you use, writing over the text file or running the format.
68 69 *their code runs; 70 proc format; 71 72 VALUE TED_DONOR 73 .B = 'Unknown' 74 .C = 'Not answ/Inconsist' 75 1 = '1 HLA-identical sibling' 76 2 = '2 Syngeneic' 77 3 = '3 Related Donor, match status unknown' 78 4 = '4 Matched Unrelated Donor (MUD)' 79 31 = '31 HLA-matched other relative' 80 32 = '32 HLA-mismatched relative' 81 41 = '41 HLA matched unrelated' 82 42 = '42 HLA mismatched unrelated' 83 ; NOTE: Format TED_DONOR has been output. 84 85 *your code runs; NOTE: PROCEDURE FORMAT used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 157.15k OS Memory 23456.00k Timestamp 08/25/2022 08:40:03 PM Step Count 24 Switch Count 2 Page Faults 0 Page Reclaims 79 Page Swaps 0 Voluntary Context Switches 14 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 56 86 proc format; 87 VALUE TED_DONOR 88 .B = 'Unknown' 89 .C = 'Not answ/Inconsist' 90 1 = 'HLA-identical sibling' 91 2 = 'Syngeneic' 92 3 = 'Related Donor, match status unknown' 93 4 = 'Matched Unrelated Donor (MUD)' 94 31 = 'HLA-matched other relative' 95 32 = 'HLA-mismatched relative' 96 41 = 'HLA matched unrelated' 97 42 = 'HLA mismatched unrelated' 98 ; NOTE: Format TED_DONOR is already on the library WORK.FORMATS. NOTE: Format TED_DONOR has been output. 99 run; NOTE: PROCEDURE FORMAT used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 102.50k OS Memory 23456.00k Timestamp 08/25/2022 08:40:03 PM Step Count 25 Switch Count 0 Page Faults 0 Page Reclaims 16 Page Swaps 0 Voluntary Context Switches 0 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 0
Perhaps you might be interested in a slightly different approach.
You can create a data set of the format definitions using Proc format with the Cntlout=<your dataset name goes here>.
Then use a data step to modify the LABEL values then recreate the formats using that set as a CNTLIN data set with Proc Format.
But the real question would why "values.sas file is regenerated every time the dataset is refreshed". Actually refreshed isn't even of concern, at least personally. If the desired contents of the formats as applied don't change you could keep the CNTLIN data set created above and update the Format catalogs used without having to check anything.
...and how would you do that? Do you have a sample code you can share?
As I stated in the intro, I'm a c# developer and I know _very_ little about SAS scripts.
@tdegen wrote:
...and how would you do that? Do you have a sample code you can share?
As I stated in the intro, I'm a c# developer and I know _very_ little about SAS scripts.
Here is a likely too simple example of creating the formats that SAS would use, placing them into a data set and then manipulating them.
This assumes that you place the formats into a library on your FMTSEARCH path, a setting that tells SAS where to look for formats to use, in this case the very original YOURLIB. IF your Proc Format code has executed you could just use the Proc Format code without the definitions to create the data set from the already executed format code.
Then manipulate the data set to clean up the labels as desired. I picked a simple example for proof of concept. My code below uses COMPBL to remove multiple blanks because when pasted into my editor that "tab" character is no longer there and there are multiple spaces so COMPBL is appropriate to leave one blank instead of multiple. Then the last proc format uses the modified data set to recreate the formats to use.
Some caveats: You may need to look at other variables in the Cntlset created if you have numeric and character formats of the same name, Type=N is numeric, C is character to select the approriate one. If you have INFORMATS then Type is I and J.
After you have the labels looking as desired you would run the the last proc format (suggestion: use cntlout to place in the project library a result of the newly applied format definitions).
The fun part of this is if the actual formats do not /should not change but that file just getting updated and "corrupting" the desired appearance you just add the script with the proc format using the modified (correct) cntlset. That would apply the current corrected definitions to the formats when used. NEW formats would not be modified, so you could check for those.
I included one way to provide conditional rules for what gets done to each example format. Partially to show one way to address values conditionally. The Start variable in the control data set is always text and numeric values may require some manipulation to select the correct one.
Strong suggestion if it is not done elsewhere, each project really should have its own library and a format catalog in that library. That way you don't have someone accidentally changing a format during use.
proc format ; VALUE TED_DONOR .B = 'Unknown' .C = 'Not answ/Inconsist' 1 = '1 HLA-identical sibling' 2 = '2 Syngeneic' 3 = '3 Related Donor, match status unknown' 4 = '4 Matched Unrelated Donor (MUD)' 31 = '31 HLA-matched other relative' 32 = '32 HLA-mismatched relative' 41 = '41 HLA matched unrelated' 42 = '42 HLA mismatched unrelated' ; VALUE PTTXTYPE 1 = '(1) Allo only' 2 = '(2) Auto only' 3 = '(3) Allo transplant(s) then Auto(s)' 4 = '(4) Auto transplant(s) then by Allo(s)' 5 = '(5) Allo transplant(s) then DCI' 6 = '(6) Auto transplant(s) then Allo(s) then DCI' ; VALUE tedvsres 1 = '1 TED' 2 = '2 CRF (RES)' 3 = '3 moved CRF (RES) to TED' 4 = '4 moved TED to CRF (RES)' 5 = '5 moved CTRM to TED' 6 = '6 Cellular Therapy Regenerative Medicine (CTRM, CTR)' 7 = '7 Auto no Consent' 8 = '8 moved TED to CTRM' ; run;
/* this is the bit that just creates the output set*/ proc format cntlout=cntlset; run; /* REAL crude modification code*/ data newcntlset; set cntlset; select (fmtname); when ('PTTXTYPE') label=substr(label,5); when ('TEDVSRES') label=substr(label,3); when ('TED_DONOR') select (strip(start)); when ('.B','.C') ; when ('1','2','3','4') label=substr(label,3); otherwise label=substr(label,4); end; otherwise;/* something that might be done to ALL non- listed Fmtnames OR avoid any change*/ end; label=compbl(label); run; proc format libname=yourlib cntlin=newcntlset; run;
Based on Reeza's response I think I misunderstood the requirement -- you're using a SAS program to rewrite the text of SAS program?
If so, then RegEx is still a good option (prx* functions in SAS). I use https://regexr.com/ to build and test the expressions before deploying in a SAS function call.
So the goal is to convert a useful format (one that actually shows the underlying value) into a less useful format?
Why not just leave the code alone and modify the format it creates? Whether or not that works in your case depends on what other code is in the program file. If all it is doing is creating the format, but not actually producing any printouts using the format then that would work fine.
Also why not modify the program that is generating that program to not include the value in the label?
To read a file and modify it use a simple data step. For this problem I would hunt for the beginning of the VALUE statement and then modify lines until the end of the value statement is found.
First let's make an example file from your given input:
filename in temp;
options parmcards=in;
parmcards4;
header stuff
VALUE TED_DONOR
.B = 'Unknown'
.C = 'Not answ/Inconsist'
1 = '1 HLA-identical sibling'
2 = '2 Syngeneic'
3 = '3 Related Donor, match status unknown'
4 = '4 Matched Unrelated Donor (MUD)'
31 = '31 HLA-matched other relative'
32 = '32 HLA-mismatched relative'
41 = '41 HLA matched unrelated'
42 = '42 HLA mismatched unrelated'
;
trailer stuff
;;;;
Now we can read that IN file and write to an OUT file.
For testing let's write to a temporary file also.
filename out temp;
data _null_;
infile in truncover dsd dlm=' ' end=eof;
file out ;
length x1-x3 $200 ;
input x1-x3 ;
if found then do;
if x1=';' and x2=' ' then found=0;
else do;
x4 = scan(x3,1,' ');
if not missing(input(x4,??32.)) then do;
x3=left(substrn(x3,length(x4)+1));
_infile_=catx(' ',x1,x2,quote(trim(x3),"'"));
end;
end;
end;
else if x1='VALUE' and x2='TED_DONOR' then found=1;
retain found;
put _infile_;
run;
So running this we get a log like this
NOTE: The infile IN is: Filename=...\#LN00060, RECFM=V,LRECL=32767,File Size (bytes)=380, Last Modified=25Aug2022:16:27:35, Create Time=25Aug2022:16:27:35 NOTE: The file OUT is: Filename=...\#LN00061, RECFM=V,LRECL=32767,File Size (bytes)=0, Last Modified=25Aug2022:16:27:35, Create Time=25Aug2022:16:27:35 NOTE: 18 records were read from the infile IN. The minimum record length was 0. The maximum record length was 43. NOTE: 18 records were written to the file OUT. The minimum record length was 0. The maximum record length was 41. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
Let's look at the results:
12 data _null_; 513 infile out; 514 input; 515 put _infile_; 516 run; NOTE: The infile OUT is: Filename=...\#LN00061, RECFM=V,LRECL=32767,File Size (bytes)=356, Last Modified=25Aug2022:16:27:35, Create Time=25Aug2022:16:27:35 header stuff VALUE TED_DONOR .B = 'Unknown' .C = 'Not answ/Inconsist' 1 = 'HLA-identical sibling' 2 = 'Syngeneic' 3 = 'Related Donor, match status unknown' 4 = 'Matched Unrelated Donor (MUD)' 31 = 'HLA-matched other relative' 32 = 'HLA-mismatched relative' 41 = 'HLA matched unrelated' 42 = 'HLA mismatched unrelated' ; trailer stuff NOTE: 18 records were read from the infile OUT. The minimum record length was 0. The maximum record length was 41. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.