Solved: Editing a simple text file (used for a Formats Catalog)

tdegen · Posted 08-25-2022 03:40 PM

I'm a C# programmer by trade and am adept at coding, but I know very little about SAS.

I need to edit a simple text file using SAS. The file is not very large (~2,000 lines), and I think I need to: (1) inspect every line, (2) update/alter about 30% of the lines, (3) overwrite the original file with the new content.

My [values.sas] file looks like this:

VALUE TED_DONOR
.B = 'Unknown'
.C = 'Not answ/Inconsist'
1 = '1 HLA-identical sibling'
2 = '2 Syngeneic'
3 = '3 Related Donor, match status unknown'
4 = '4     Matched Unrelated Donor (MUD)'
31 = '31 HLA-matched other relative'
32 = '32 HLA-mismatched relative'
41 = '41 HLA matched unrelated'
42 = '42 HLA mismatched unrelated'
;

But I need the [values.sas] file to look like this:

VALUE TED_DONOR
.B = 'Unknown'
.C = 'Not answ/Inconsist'
1 = 'HLA-identical sibling'
2 = 'Syngeneic'
3 = 'Related Donor, match status unknown'
4 = 'Matched Unrelated Donor (MUD)'
31 = 'HLA-matched other relative'
32 = 'HLA-mismatched relative'
41 = 'HLA matched unrelated'
42 = 'HLA mismatched unrelated'
;

{If the leading text portion matches the value to the left of the '=', then remove the leading portion and any leading/trailing spaces between the ' '}

Any assistance you can provide would be greatly appreciated.

Ksharp · Posted 08-26-2022 07:51 AM

filename in 'c:\temp\values.sas';      /*original sas code*/  
filename out 'c:\temp\new_values.sas'; /*processed sas code*/

data _null_;
infile in;
file out;
input;
if first(left(_infile_)) in ('.' '0' '1' '2' '3' '4' '5' '6' '7' '8' '9')
   and findc(_infile_,"=") and findc(_infile_,"'") then 
_infile_=prxchange("s/'\d+\s+/'/",1,_infile_);
put _infile_;
run;

View solution in original post

ChrisHemedinger · Posted 08-25-2022 03:53 PM

If this is a one-time operation you can use regular expressions in your favorite code editor. SAS Enterprise Guide supports this, as does Notepad++ or VS Code.

In SAS Enterprise Guide, load the file and click Ctrl+H (find and replace). Select Regular expression support.

The pattern I discerned from your sample is:

'\d+\s+

Replace with just single quote

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!

tdegen · Posted 08-25-2022 04:00 PM

I *wish* it were a one-time event. I'm confident I could do this in C# in 20 minutes.

I'm afraid this values.sas file is regenerated every time the dataset is refreshed. The team that refreshes the data will run any sas script I give them as part of the file rename/cleanup process - but it has to be a sas script.

On the surface this is just simple text manipulation and should be easy - but I know so woefully little about SAS I'm not even sure where to begin.

Reeza · Posted 08-25-2022 04:04 PM

I'd still be tempted to just insert it and it'll overwrite theirs.

How is the data you have stored to update the file? Because there could be multiple values of 31/32 for example? How do you know to replace the right format value? I suspect doing line by line could be dangerous as well....

Ultimately I would redesign the process to generate a data set instead of text file for the format and update that.

ChrisHemedinger · Posted 08-25-2022 04:11 PM

All of Reeza's cautions are well-informed -- definitely potential pitfalls in the whole scheme. However, if the pattern of change is in fact as you describe -- in C# I'd be doing RegEx.Replace, so in SAS I'd be looking at prxchange() function.

You could use the SCAN function to grab the "before =" sign bit and then use transwrd to trim that bit out of the right side of the assignment. A little more klunky but could work.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!

tdegen · Posted 08-25-2022 04:19 PM

The actual data in the dataset is good and the formats values have their quirks, but it's clean with no redundancies.

Here are some quirks: the values appear in the text label wrapped in parenthesis. There is a <tab> in the #4 label as well.

VALUE PTTXTYPE
1 = '(1) Allo only'
2 = '(2) Auto only'
3 = '(3) Allo transplant(s) then Auto(s)'
4 = '(4) Auto transplant(s) then by Allo(s)'
5 = '(5) Allo transplant(s) then DCI'
6 = '(6) Auto transplant(s) then Allo(s) then DCI'
;
VALUE TEDVSRES
1 = '1 TED'
2 = '2 CRF (RES)'
3 = '3 moved CRF (RES) to TED'
4 = '4 		moved TED to CRF (RES)'
5 = '5 moved CTRM to TED'
6 = '6 Cellular Therapy Regenerative Medicine (CTRM, CTR)'
7 = '7 Auto no Consent'
8 = '8 moved TED to CTRM'
;

Ultimately, I won't be able to clean up _every_ text label - but if I can handle 80-90% with a simple script I have more clout to go back to the data team and insist they manually clean up the remaining values upstream.

Reeza · Posted 08-25-2022 04:23 PM

Ok. Here's what I would do.

Take your code, add the PROC FORMAT at the start if it's not already there.

Give that to them. It will run and overwrite any older formats and the log will say such. Then the correct formats are in the data.

Doing it every day will work as well.

You cannot do line by line replacement here, you'd have to do it by each VALUE/INVALUE statement so PRX isn't workable either.

Nothing else required. You should modify the process but that seems like a battle for another day.

tdegen · Posted 08-25-2022 04:29 PM

I love it - but what happens when they add new variables and values to their formats catalog?

Wouldn't my [clean] file just completely overwrite theirs - implying I would lose any new formats they've submitted/created?

I have no problem creating a process to "clean" their formats file periodically, but I cannot justify a daily task comparing their file to my file to see if anything new has been added.

Reeza · Posted 08-25-2022 04:47 PM

@tdegen wrote:

I love it - but what happens when they add new variables and values to their formats catalog?

Wouldn't my [clean] file just completely overwrite theirs - implying I would lose any new formats they've submitted/created?

I have no problem creating a process to "clean" their formats file periodically, but I cannot justify a daily task comparing their file to my file to see if anything new has been added.

You don't lose any formats, only ones that are duplicate would be overwritten. If they update with new ones that your process does not have, theirs will remain. If yours is out of date, that will be true regardless of what approach you use, writing over the text file or running the format.

 68         
 69         *their code runs;
 70         proc format;
 71         
 72         VALUE TED_DONOR
 73         .B = 'Unknown'
 74         .C = 'Not answ/Inconsist'
 75         1 = '1 HLA-identical sibling'
 76         2 = '2 Syngeneic'
 77         3 = '3 Related Donor, match status unknown'
 78         4 = '4     Matched Unrelated Donor (MUD)'
 79         31 = '31 HLA-matched other relative'
 80         32 = '32 HLA-mismatched relative'
 81         41 = '41 HLA matched unrelated'
 82         42 = '42 HLA mismatched unrelated'
 83         ;
 NOTE: Format TED_DONOR has been output.
 84         
 85         *your code runs;
 
 NOTE: PROCEDURE FORMAT used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              157.15k
       OS Memory           23456.00k
       Timestamp           08/25/2022 08:40:03 PM
       Step Count                        24  Switch Count  2
       Page Faults                       0
       Page Reclaims                     79
       Page Swaps                        0
       Voluntary Context Switches        14
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           56
       
 86         proc format;
 
 87         VALUE TED_DONOR
 88         .B = 'Unknown'
 89         .C = 'Not answ/Inconsist'
 90         1 = 'HLA-identical sibling'
 91         2 = 'Syngeneic'
 92         3 = 'Related Donor, match status unknown'
 93         4 = 'Matched Unrelated Donor (MUD)'
 94         31 = 'HLA-matched other relative'
 95         32 = 'HLA-mismatched relative'
 96         41 = 'HLA matched unrelated'
 97         42 = 'HLA mismatched unrelated'
 98         ;
 NOTE: Format TED_DONOR is already on the library WORK.FORMATS.
 NOTE: Format TED_DONOR has been output.
 99         run;
 
 NOTE: PROCEDURE FORMAT used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              102.50k
       OS Memory           23456.00k
       Timestamp           08/25/2022 08:40:03 PM
       Step Count                        25  Switch Count  0
       Page Faults                       0
       Page Reclaims                     16
       Page Swaps                        0
       Voluntary Context Switches        0
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           0

ballardw · Posted 08-25-2022 04:07 PM

Perhaps you might be interested in a slightly different approach.

You can create a data set of the format definitions using Proc format with the Cntlout=<your dataset name goes here>.

Then use a data step to modify the LABEL values then recreate the formats using that set as a CNTLIN data set with Proc Format.

But the real question would why "values.sas file is regenerated every time the dataset is refreshed". Actually refreshed isn't even of concern, at least personally. If the desired contents of the formats as applied don't change you could keep the CNTLIN data set created above and update the Format catalogs used without having to check anything.

tdegen · Posted 08-25-2022 04:13 PM

...and how would you do that? Do you have a sample code you can share?

As I stated in the intro, I'm a c# developer and I know _very_ little about SAS scripts.

ballardw · Posted 08-25-2022 05:19 PM

@tdegen wrote:

...and how would you do that? Do you have a sample code you can share?

As I stated in the intro, I'm a c# developer and I know _very_ little about SAS scripts.

Here is a likely too simple example of creating the formats that SAS would use, placing them into a data set and then manipulating them.

This assumes that you place the formats into a library on your FMTSEARCH path, a setting that tells SAS where to look for formats to use, in this case the very original YOURLIB. IF your Proc Format code has executed you could just use the Proc Format code without the definitions to create the data set from the already executed format code.

Then manipulate the data set to clean up the labels as desired. I picked a simple example for proof of concept. My code below uses COMPBL to remove multiple blanks because when pasted into my editor that "tab" character is no longer there and there are multiple spaces so COMPBL is appropriate to leave one blank instead of multiple. Then the last proc format uses the modified data set to recreate the formats to use.

Some caveats: You may need to look at other variables in the Cntlset created if you have numeric and character formats of the same name, Type=N is numeric, C is character to select the approriate one. If you have INFORMATS then Type is I and J.

After you have the labels looking as desired you would run the the last proc format (suggestion: use cntlout to place in the project library a result of the newly applied format definitions).

The fun part of this is if the actual formats do not /should not change but that file just getting updated and "corrupting" the desired appearance you just add the script with the proc format using the modified (correct) cntlset. That would apply the current corrected definitions to the formats when used. NEW formats would not be modified, so you could check for those.

I included one way to provide conditional rules for what gets done to each example format. Partially to show one way to address values conditionally. The Start variable in the control data set is always text and numeric values may require some manipulation to select the correct one.

Strong suggestion if it is not done elsewhere, each project really should have its own library and a format catalog in that library. That way you don't have someone accidentally changing a format during use.

proc format ;
VALUE TED_DONOR
.B = 'Unknown'
.C = 'Not answ/Inconsist'
1 = '1 HLA-identical sibling'
2 = '2 Syngeneic'
3 = '3 Related Donor, match status unknown'
4 = '4     Matched Unrelated Donor (MUD)'
31 = '31 HLA-matched other relative'
32 = '32 HLA-mismatched relative'
41 = '41 HLA matched unrelated'
42 = '42 HLA mismatched unrelated'
;
VALUE PTTXTYPE
1 = '(1) Allo only'
2 = '(2) Auto only'
3 = '(3) Allo transplant(s) then Auto(s)'
4 = '(4) Auto transplant(s) then by Allo(s)'
5 = '(5) Allo transplant(s) then DCI'
6 = '(6) Auto transplant(s) then Allo(s) then DCI'
;
VALUE tedvsres
1 = '1 TED'
2 = '2 CRF (RES)'
3 = '3 moved CRF (RES) to TED'
4 = '4 		moved TED to CRF (RES)'
5 = '5 moved CTRM to TED'
6 = '6 Cellular Therapy Regenerative Medicine (CTRM, CTR)'
7 = '7 Auto no Consent'
8 = '8 moved TED to CTRM'
;
run;
/* this is the bit that just creates the output set*/
proc format  cntlout=cntlset;
run;

/* REAL crude modification code*/
data newcntlset;
   set cntlset;
   select (fmtname);
      when ('PTTXTYPE')  label=substr(label,5);
      when ('TEDVSRES')  label=substr(label,3);
      when ('TED_DONOR')  select (strip(start));
                            when ('.B','.C') ;
                            when ('1','2','3','4') label=substr(label,3);
                            otherwise label=substr(label,4);
                         end;
      otherwise;/* something that might be done to ALL non-
                 listed Fmtnames OR avoid any change*/ 
   end;
   label=compbl(label);
run;

proc format libname=yourlib cntlin=newcntlset;
run;

Reeza · Posted 08-25-2022 03:54 PM

Not good practice at all but if you put it at the end of the file, SAS will create the format again and overwrite the previous values. If you do this, add a comment to the code as to why this is happening.

ChrisHemedinger · Posted 08-25-2022 03:58 PM

Based on Reeza's response I think I misunderstood the requirement -- you're using a SAS program to rewrite the text of SAS program?

If so, then RegEx is still a good option (prx* functions in SAS). I use https://regexr.com/ to build and test the expressions before deploying in a SAS function call.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!

Tom · Posted 08-25-2022 04:32 PM

So the goal is to convert a useful format (one that actually shows the underlying value) into a less useful format?

Why not just leave the code alone and modify the format it creates? Whether or not that works in your case depends on what other code is in the program file. If all it is doing is creating the format, but not actually producing any printouts using the format then that would work fine.

Also why not modify the program that is generating that program to not include the value in the label?

To read a file and modify it use a simple data step. For this problem I would hunt for the beginning of the VALUE statement and then modify lines until the end of the value statement is found.

First let's make an example file from your given input:

filename in temp;
options parmcards=in;
parmcards4;

header stuff 

VALUE TED_DONOR
.B = 'Unknown'
.C = 'Not answ/Inconsist'
1 = '1 HLA-identical sibling'
2 = '2 Syngeneic'
3 = '3 Related Donor, match status unknown'
4 = '4     Matched Unrelated Donor (MUD)'
31 = '31 HLA-matched other relative'
32 = '32 HLA-mismatched relative'
41 = '41 HLA matched unrelated'
42 = '42 HLA mismatched unrelated'
;

trailer stuff 

;;;;

Now we can read that IN file and write to an OUT file.

For testing let's write to a temporary file also.

filename out temp;
data _null_;
  infile in truncover dsd dlm=' ' end=eof;
  file out  ;
  length x1-x3 $200 ;
  input x1-x3 ;
  if found then do;
    if x1=';' and x2=' ' then found=0;
    else do;
      x4 = scan(x3,1,' ');
      if not missing(input(x4,??32.)) then do;
        x3=left(substrn(x3,length(x4)+1));
        _infile_=catx(' ',x1,x2,quote(trim(x3),"'"));
      end; 
    end; 
  end;
  else if x1='VALUE' and x2='TED_DONOR' then found=1;
  retain found;
  put _infile_;
run;

So running this we get a log like this

NOTE: The infile IN is:
      Filename=...\#LN00060,
      RECFM=V,LRECL=32767,File Size (bytes)=380,
      Last Modified=25Aug2022:16:27:35,
      Create Time=25Aug2022:16:27:35

NOTE: The file OUT is:
      Filename=...\#LN00061,
      RECFM=V,LRECL=32767,File Size (bytes)=0,
      Last Modified=25Aug2022:16:27:35,
      Create Time=25Aug2022:16:27:35

NOTE: 18 records were read from the infile IN.
      The minimum record length was 0.
      The maximum record length was 43.
NOTE: 18 records were written to the file OUT.
      The minimum record length was 0.
      The maximum record length was 41.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

Let's look at the results:

12  data _null_;
513    infile out;
514    input;
515    put _infile_;
516  run;

NOTE: The infile OUT is:
      Filename=...\#LN00061,
      RECFM=V,LRECL=32767,File Size (bytes)=356,
      Last Modified=25Aug2022:16:27:35,
      Create Time=25Aug2022:16:27:35


header stuff

VALUE TED_DONOR
.B = 'Unknown'
.C = 'Not answ/Inconsist'
1 = 'HLA-identical sibling'
2 = 'Syngeneic'
3 = 'Related Donor, match status unknown'
4 = 'Matched Unrelated Donor (MUD)'
31 = 'HLA-matched other relative'
32 = 'HLA-mismatched relative'
41 = 'HLA matched unrelated'
42 = 'HLA mismatched unrelated'
;

trailer stuff
NOTE: 18 records were read from the infile OUT.
      The minimum record length was 0.
      The maximum record length was 41.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Re: Editing a simple text file (used for a Formats Catalog)

Registration is open

SAS Training: Just a Click Away