Re: how to add a space before a repeating prefix in long string and ex...

appleorange

data sample;
 y='pn415pn418 pn415pn418 pn414 pn415pn417pn413p n415pn418 pn417pn417 pn415'; run;

  data want2x; set sample;
y=tranwrd(y, " pn ", "pn");
y=tranwrd(y, " pnp ", "pnp");
y=tranwrd(y, " 1icc ", "1icc");
y=tranwrd(y, " 2icc ", "2icc"); run;

data want3x; set want2x;
y2=lowcase(y); run;

data tryx; set want3x;
y2=tranwrd(trim(y2), " ", ",");
call symput('y', y2);
run;

data middle2x;
set tryx;
type="&codes"; length term $20;
do i=1 by 1 until (term=' ');
term=strip(scan(y2,i,','));
if term ne ' ' then output;
end;
run;

Because of the random lack of space between one pnxxx code, or an extra space such as after pn413, I can't extract each distinct pnxxx code. There are also codes with the prefix pnp, 1icc, 2icc, etc. and up to 4-5 letters so SAS code on how to address a variety of situations would help. Currently because of the wonky spacing, I get these separated out in term:

pn415pn418

pn414

pn415pn417pn413p

n415pn418

pn417pn417

pn415

Is there a way to format the original string better with the right spacing? Or a different way to get at this. I've been adding commas, and then trying to get the sequence but the spaces mess them up.

ballardw

It would help a great deal if you provided what you actually want for a result.

Are you looking for something like this?

data want2x; 
   set sample;
   x=compress(y,'');
   z=tranwrd(x,'pn',' pn');
run;

The first step removes all the spaces then replace the 'pn' with ' pn'. Scan should extract any code from here.

Note that this should fix codes that start with PNP as well as PN. You didn't provide any of the other "etc" codes. You may well have to have a very complicated solution if you have more overlapping sequences

Note: Since this is inserting spaces you must provide for a target variable at some point that is longer than the existing variable.

Where did this stuff come from? It has some appearance of things that can happen when reading a source file incorrectly.

SASJedi

To kind of do it all at once:

data have;
	y='pn415pn418 pn415pn418 pn414 pn415pn417pn413p n415pn418 pn417pn417 pn415'; 
run;

data want;
	set have;
	/* Remove all spaces. Put a | delimiter in front of each 'pn'*/
	y=tranwrd(compress(y),'pn','|pn');
	/* countw will count the number of values in y */
	do word=1 to countw(y);
		/* Extract each value */
		Text=scan(y,word);
		/* Do whatver you want with the value. */
		put word= Text=;
	end;
run;

Check out my Jedi SAS Tricks for SAS Users

appleorange

It's from data that's collected by a laptop or ipad, where people have to type the codes. I believe people are typing them in, and either adding in spaces or not adding in spaces at their whim.

ballardw

@appleorange wrote:

It's from data that's collected by a laptop or ipad, where people have to type the codes. I believe people are typing them in, and either adding in spaces or not adding in spaces at their whim.

What did they "type it into", text file, spreadsheet, some data base?

And how did the data make it into SAS? This step might have a chance of correcting it. Maybe. Depends on more than one factor.

appleorange

I think it's a database they login into. The database admin people then turn whatever forms/data get collected in that section into .csvs. These are always extremely messy that I cannot import into SAS and have had to use STATA to deal with the quote stripping and quote binding feature it has. Thanks for the suggestion, maybe the step between the database data being entered and turned into csv can eliminate some of this?

ballardw

@appleorange wrote:

I think it's a database they login into. The database admin people then turn whatever forms/data get collected in that section into .csvs. These are always extremely messy that I cannot import into SAS and have had to use STATA to deal with the quote stripping and quote binding feature it has. Thanks for the suggestion, maybe the step between the database data being entered and turned into csv can eliminate some of this?

I might suggest providing an example of part of one of these "extremely messy" csv files before stripping quotes.

Quotes serve a very significant purpose in most csv files and perhaps what STATA is doing is contributing to the problem. Copy a few rows of the raw csv that have some of these problems, open a text box on the forum using the </> icon that appears above the message window and paste the text. The text box is critical because the main windows here will reformat text significantly. If there are any sensitive fields like names, account numbers and such replace such with XXXX or similar nonsense values. Or if practical just the "code" fields.

Note: If my Import you mean Proc Import or one of the widgets that calls import in the background it is designed for pretty basic nice data layouts. Unfortunately there is much data that does not fit that model. Which is why we have programmers. We learn to use the tools available to read data in messy forms.

appleorange

pnp14h6
pnp64h29
pnp68h31
pnp66h30
pnp68h31
pnp208h95","False","False","False","","","","True","2","1","","True","False","False","False","False","False","False","False","","","","","","pnp46h22
pnp96h45
pnp94h44
pnp232h107","False","","","","","","","","","","","True",""
"00000","00000-0000","Name2","Name1","Encounter","Agent","Place A","","False","False","False","False","False","","","","","","False","False","False","False","","","","","","","False","False","","","False","False","False","","","","","","","","","","                                                                                                                                                                                                        ","False","False","False","False","False","False","False","False","False","False","False","False","","","","","","","","","","","","","","True","","data data","","","","","","1iccp38h17 1iccp40h18 1iccp42h19 1iccp44h20 1iccp154h67 1iccp156h68 1icc1318 1iccp211h94 1iccp214h95 1icc1319

","True","","","","",": pn811 pn812 pn814 pn815 pn816 pn817 pn818 pn819
pn814 pn815 pn816 pn817 pn818 pn819
","True","False","False","","","False","","","","","","","","","","True","False","","","","","pn1111 pnp78h36 pnp128h59 pn1122 pnp170h78 pnp216h99 pnp260h119 pnp298h135 pnp296h136 pn1152","False","False","False","","","","","","","","","","","","True","","","pnp28h13
pnp30h14 pnp32h15 pnp82h38 pnp84h39 pnp86h40 pnp132h61 pnp134h62

","False","False","","","True","False","","","","pnp140h65","True","","","","pnp62h28 pnp8h3 pnp160h73 
pnp206h94 pnp250h114 pn1752
","True","","","","","","pnp38h18 pnp34h16 pnp36h17 pnp40h19 pnp42h20 pnp130h60
pnp108h51 1811 pnp88h41 pnp90h42 pnp92h43 pn1818 pnp136h63 pnp138h64 pnp179h82 pnp184h85 pnp186h86 pnp220h101 pnp222h101 pnp224h103 pnp226h104 pnp228h105 pnp230h106 pnp264h121 pnp266h122 pnp268h123 pnp270h124 pnp300h137 pnp302h138 pp304h139
","False","","","","","","","","","False","","","","","","","","True","","","","pnp12h5 pnp64h29 pnp66h30 pnp68h31 pnp118h54 pnp120h55 pnp14h6 pnp162h74 pnp164h75 pnp208h95 pnp14h6 pnp16h7 pnp70h32 pnp122h56 pnp166h76 pnp252h115 pnp254h116 pnp290h132 pnp290h133 ","False","False","False","","","","True","8","9","","False","False","False","False","False","False","False","False","","","","","","pnp46h22 pnp48h23 pnp50h24 pnp52h25 pn224 pn2242 pnp94h44 pnp98h46 pnp100h47 pnp102h48 pnp104h49 pn2221 pnp142h66 pnp144h67 pnp146h68 pnp148h69 pn2231 pnp188h87 pnp190h88 pnp192h89 pnp194h90 pnp196h91 pnp232h107 pnp236h109 pnp238h110 pnp240h111 pnp272h125 pnp274h126 pnp276h128 pnp306h140 ","False","","","","","","","","","","","True","1iccp38h17 1iccp40h18 1iccp42h19 1iccp44h20 1iccp154h67 1iccp156h68 1icc1318 1iccp211h94 1iccp214h95 1icc1319

"
"00000","00000-0000","Name2","Name1","Encounter","Agent","Place A",

By import, I use infile, but haven't found a solution to deal with raw data that looks like this as it hasn't come up that much (until now) and I'm always being pressed to run results. Not everyone understands just how messy raw files can be, and how long it could take to correct the various issues but definitely, STATA might be causing more issues. I pasted a random portion of the raw file that shows the spaces and quotes that get pushed into breaks.

ballardw

I would strongly suggest contacting who ever does the data extract from the data base that line feeds and/or carriage returns entered in the data base be replaced with spaces (not deleted) before creating the CSV files. That will solve many of your problems. This is an example of "bad database admin, no cookie" letting such junk persist in the data. (And yes I've dealt with this before as a paid contractor. When I explained how much $$$ the were paying us in programming time to fix this issue with provided data files when it would be easier to fix before sending to us and paying for that it was fixed quickly as we were processing three files per week.)

The things like ,"", indicate columns in the data base with null values and SAS reads those fine once the line feed/carriage returns are straight.

appleorange

LOL, it's validating to see that this is indeed bad database administrating going on. I can also think of another data warehouse that provides data with these issues (we just asked the data provider to delete the entire variable)...so it's a bit surprising to me that they seem to either be incompetent, or inexperienced at understanding how to correctly CSV files. Thank you so much for your help and insight!

Tom

If those lines are supposed to be part of a CSV file then you definitely have issues as you cannot have a line that starts:

pnp208h95",

in a valid CSV file.

If they are just the values of the actual field that the user typed into then, yes that is what users might do.

One source of issues you might be able to get more control of is the conversion of the data from the database into a CSV file. A lot of databases do not have a good tools for making CSV files (it was basically impossible with older versions of Oracle). So first make sure that the database programmers are using the right instructions for writing the CSV file. Make sure that the generated file has quotes around any value that includes either the delimiter (normally a comma) or the quote character itself. And that any quote character in the data is double up (not "escaped" with a \ like you would do in a Unix shell.).

Another problem is that SAS does not handle embedded end-of-line characters in the value of variable in a CSV file. Which can be a problem when you are letting users type in free text. So make sure the database programmers converts any CR ('0D'x) or LF('0A'x) characters into something else, like a space or comma or vertical bar, before generating the CSV file.

If you do have a CSV file that is otherwise fine but has embedded end-of-line characters only inside of a quoted value then it is possible to convert such a file into one that SAS can read as a normal delimited file by replacing any CR or LF that is inside of quotes. You could try using this macro on such a file to make a new file that SAS will have an easier time reading.

https://github.com/sasutils/macros/blob/master/replace_crlf.sas

appleorange

I copied a section that had less sensitive data for me to replace but I assume it would have been:

"MORE DATA
pnp14h6
pnp64h29
pnp68h31
pnp66h30
pnp68h31
pnp208h95"

As in the beginning quote started before that with more line breaks, blank spaces, etc.

But otherwise, thanks so much for the advice on getting back to the database people. I am not familiar with this but will copy and paste what you said to the relevant people!

Thanks for linking the macro - that would definitely help with variations of this type of data.

A_Kh

With some modification to your code - compressing blanks, inserting a comma before 'pn' and counting the number of 'commas' to parse each term into a new observation.

data want;
	set have;
	y1=tranwrd(compress(y, ' '), 'pn', ',pn');
	num=countc(y1, ',');
	do i=1 to num;
		pn=scan(y1,i,',');
		output;
	end; 
run;

Ksharp

So is the following output what you are looking for ?

data sample;
 y='pn415pn418 pn415pn418 pn414 pn415pn417pn413p n415pn418 pn417pn417 pn415'; 
run;

data want;
 set sample;
 y=compress(y,,'s');
 pid=prxparse('/[a-z]+\d+/i');
 s=1;e=length(y);
 call prxnext(pid,s,e,y,p,l);
 do while(p>0);
   want=substr(y,p,l);output;
   call prxnext(pid,s,e,y,p,l);
 end;
 keep y want;
run;

Tom

There are also codes with the prefix pnp, 1icc, 2icc, etc. and up to 4-5 letters

It would help if you could provide more details on the rules for the "codes".

Do the digits after the prefix vary? How? Your examples seem to all have pn followed by a three digit string. In fact all of your examples have pn41 followed by one digit.

Also since some of the prefixes might start with a digit (1icc and 2icc) you could have trouble if the number of digits after a prefix varies. It might be impossible to figure out what was intended from some run-on strings.

Do you have a full list of the prefixes? Perhaps in a dataset? You might want to use the list to generate some code that can be used to convert the string into something that is easier to parse.

For example you might want to use regular expressions. So perhaps something like:

  y_fixed=prxchange('s/(pn\d\d\d)/,$1/',-1,y_fixed);
  y_fixed=prxchange('s/(1icc\d\d)/,$1/',-1,y_fixed);

Example:

data sample;
  row+1;
  input y $80.;
cards;
pn415pn418 pn415pn418 pn414 pn415pn417pn413p n415pn418 pn417pn417 pn415
pn4151icc25
;

data sample_fixed;
  set sample;
  length y_fixed $200;
  y_fixed=compress(lowcase(y));
  y_fixed=prxchange('s/(pn\d\d\d)/,$1/',-1,y_fixed);
  y_fixed=prxchange('s/(1icc\d\d)/,$1/',-1,y_fixed);
run;

proc print;
run;

data want;
  set sample_fixed;
  do i=1 to max(1,countw(y_fixed,','));
    length code $10;
    code = scan(y_fixed,i,',');
    output;
  end;
  keep row i code ;
run;

proc print;
run;

Result:

how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Re: how to add a space before a repeating prefix in long string and extract strings

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!