Solved: Parse a String containing Repeated Term

Jeff_DOC · Posted 02-18-2022 04:52 PM

Good afternoon.

I have a list of telephone numbers and types all contained within the same string. What I'd like to do is to parse the data out at each instance of 'TEL:' into separate columns. Each column should contain 'TEL;' and all the characters to the beginning of the next "TEL;"

I've been doing some reading on scan but can't seem to get it to work correctly. I cannot get the array to work correctly and keep getting empty columns.

I'd sure appreciate any help anyone could give me.

Thank you.

TEL;TYPE=home,voice,pref:(XXX) XXX-XXXX

TEL;TYPE=cell,voice,pref:+1 (XXX) XXX-XXXX

TEL;TYPE=work,voice,pref:XXX-XXX-XXXX,TEL;TYPE=work,voice:(XXX) XXX-XXXX,TEL;TYPE=home,voice:(XXX) XXX-XXXX,TEL;TYPE=cell,voice:(XXX) XXX-XXXX,TEL;TYPE=pager:(XXX) XXX-XXXX

TEL;TYPE=work,voice,pref:(XXX) XXX-XXXX

TEL;TYPE=cell,voice,pref:+XXXXXXXXXXX

TEL;TYPE=work,voice,pref:(XXX) XXX-XXXX,TEL;TYPE=work,voice:XXXXXXXXXX

TEL;TYPE=work,voice,pref:(XXX) XXX-XXXX

Tom · Posted 02-22-2022 06:22 PM

First convert it to multiple observations. Converting it to multiple variables is a job for PROC TRANSPOSE.

data tall;
  set have;
  length num 8 string $50 ;
  do num=1 by 1 until(not loc);
     loc = find(col1,'TEL;',5);
     if loc then do;
       string=substrn(col1,1,loc-2);
       col1=substrn(col1,loc);
     end;
     else string=col1;
     output;
  end;
  drop col1 loc;
run;

So if you have this input:

data have;
  row+1;
  input;
  col1=_infile_;
cards4;
TEL;TYPE=work,voice,pref:xxx-xxx-xxxx,TEL;TYPE=work,voice:(xxx) xxx-xxxx,TEL;TYPE=home,voice:(xxx) xxx-xxxx
TEL;TYPE=work,voice,pref:xxx-xxx-xxxx
TEL;TYPE=work,voice,pref:xxx-xxx-xxxx,TEL;TYPE=work,voice:(xxx) xxx-xxxx
;;;;

You get this output:

Obs    row    num                   string

 1      1      1     TEL;TYPE=work,voice,pref:xxx-xxx-xxxx
 2      1      2     TEL;TYPE=work,voice:(xxx) xxx-xxxx
 3      1      3     TEL;TYPE=home,voice:(xxx) xxx-xxxx
 4      2      1     TEL;TYPE=work,voice,pref:xxx-xxx-xxxx
 5      3      1     TEL;TYPE=work,voice,pref:xxx-xxx-xxxx
 6      3      2     TEL;TYPE=work,voice:(xxx) xxx-xxxx

Which you can transpose into multiple variables.

You could use PROC TRANSPOSE.

proc transpose data=tall out=wide(drop=_name_) prefix=tel;
  by row;
  id num;
  var string;
run;

Or a data step:

proc sql noprint;
  select max(num) into :&ncols from tall;
quit;
data want;
  do _n_=1 to &ncols until(last.row);
    set tall;
    by row;
    array tel [&ncols] $50 ;
    tel[_n_]=string;
  end;
run;

If you know the upperbound on the number of telephone numbers you will have might even be able to use the ARRAY in the step that splits the strings and avoid making the TALL dataset.

View solution in original post

ballardw · Posted 02-18-2022 05:51 PM

Pick a couple of these and then manually create what you expect as a result.

If your data came from a text file it might be a good idea to provide 10 or so lines of the text as it may be easier to read than parse existing data.

HB · Posted 02-18-2022 06:18 PM

I was reading it in like this:

data tel_line_nums;
	infile datalines truncover;
	input telco_record $200.;
datalines;
TEL,TYPE=home,voice,pref:(XXX) 555-1212
TEL,TYPE=home,voice,pref:(XXX) XXX-2222
TEL,TYPE=cell,voice,pref:+1 (XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:XXX-555-1213,TEL,TYPE=work,voice:(XXX) 555-1214,TEL,TYPE=home,voice:(XXX) XXX-XXXX,TEL,TYPE=cell,voice:(XXX) 555-1215,TEL,TYPE=mrbig:(XXX) XXX-mrbig
TEL,TYPE=work,voice,pref:XXX-XXX-XXXX,TEL,TYPE=work,voice:(XXX) XXX-XXXX,TEL,TYPE=home,voice:(XXX) XXX-XXXX,TEL,TYPE=cell,voice:(XXX) XXX-XXXX,TEL,TYPE=pager:(XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
TEL,TYPE=cell,voice,pref:+XXXXXXXXXXX
TEL,TYPE=cell,voice,pref:+XXXXXXXXXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX,TEL,TYPE=work,voice:XXXXXXXXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX,TEL,TYPE=work,voice:XXXXXXXXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
;

but I couldn't figure out how to make it work quite right. "TEL" as the delimiter in the SCAN function affects the T in TYPE for reasons I don't understand and I don't like the trailing commas on the broken out phone numbers.

I changed a few X's into numbers so could track things better.

Tom · Posted 02-18-2022 06:39 PM

Each of the characters you list as delimiters are treated as delimiters. So if you tell that the letter T is a delimiter than any T it sees will be treated as a delimiter.

If you really have lines of text you could try using the DLMSTR= option on the INFILE statement.

data want;
	infile datalines dlmstr='TEL,TYPE=' truncover;
  row+1;
  length order 8 value $100;
  do order=1 by 1 until(value=' ');
    input value @;
    if value ne ' ' or order=1 then output;
  end;
datalines4;
TEL,TYPE=home,voice,pref:(XXX) 555-1212
TEL,TYPE=home,voice,pref:(XXX) XXX-2222
TEL,TYPE=cell,voice,pref:+1 (XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:XXX-555-1213,TEL,TYPE=work,voice:(XXX) 555-1214,TEL,TYPE=home,voice:(XXX) XXX-XXXX,TEL,TYPE=cell,voice:(XXX) 555-1215,TEL,TYPE=mrbig:(XXX) XXX-mrbig
TEL,TYPE=work,voice,pref:XXX-XXX-XXXX,TEL,TYPE=work,voice:(XXX) XXX-XXXX,TEL,TYPE=home,voice:(XXX) XXX-XXXX,TEL,TYPE=cell,voice:(XXX) XXX-XXXX,TEL,TYPE=pager:(XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
TEL,TYPE=cell,voice,pref:+XXXXXXXXXXX
TEL,TYPE=cell,voice,pref:+XXXXXXXXXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX,TEL,TYPE=work,voice:XXXXXXXXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX,TEL,TYPE=work,voice:XXXXXXXXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
TEL,TYPE=work,voice,pref:(XXX) XXX-XXXX
;;;;

proc print;
run;

Results:

Obs    row    order    value

  1      1      1      home,voice,pref:(XXX) 555-1212
  2      2      1      home,voice,pref:(XXX) XXX-2222
  3      3      1      cell,voice,pref:+1 (XXX) XXX-XXXX
  4      4      1      work,voice,pref:XXX-555-1213,
  5      4      2      work,voice:(XXX) 555-1214,
  6      4      3      home,voice:(XXX) XXX-XXXX,
  7      4      4      cell,voice:(XXX) 555-1215,
  8      4      5      mrbig:(XXX) XXX-mrbig
  9      5      1      work,voice,pref:XXX-XXX-XXXX,
 10      5      2      work,voice:(XXX) XXX-XXXX,
 11      5      3      home,voice:(XXX) XXX-XXXX,
 12      5      4      cell,voice:(XXX) XXX-XXXX,
 13      5      5      pager:(XXX) XXX-XXXX
 14      6      1      work,voice,pref:(XXX) XXX-XXXX
 15      7      1      work,voice,pref:(XXX) XXX-XXXX
 16      8      1      work,voice,pref:(XXX) XXX-XXXX
 17      9      1      work,voice,pref:(XXX) XXX-XXXX
 18     10      1      cell,voice,pref:+XXXXXXXXXXX
 19     11      1      cell,voice,pref:+XXXXXXXXXXX
 20     12      1      work,voice,pref:(XXX) XXX-XXXX,
 21     12      2      work,voice:XXXXXXXXXX
 22     13      1      work,voice,pref:(XXX) XXX-XXXX,
 23     13      2      work,voice:XXXXXXXXXX
 24     14      1      work,voice,pref:(XXX) XXX-XXXX
 25     15      1      work,voice,pref:(XXX) XXX-XXXX

If you have it as a character variable the perhaps use TRANWRD to convert the TEL or TEL,TYPE= into a single character like | or ^ that you could then use with SCAN(). Or try figuring out regular expressions.

ballardw · Posted 02-18-2022 07:12 PM

You do not say how you will use the data if you have these values in different variables on a single observation. Often that is not the best approach for many tasks. However using @Tom 's solution you could transpose the data by the Row variable to get the values on one observation.

Thank you Tom for providing the read solution. I thought something along this line would be very possible and yours is cleaner than my thoughts.

Jeff_DOC · Posted 02-22-2022 05:46 PM

Hi Tom.

Thanks for taking the time to reply. I should have been more clear about a few things. The data is an imported Excel sheet and I'm hoping to get each TEL to it's own column (see the example attached.

	COL 1	COL 2	COL3
HAVE	TEL;TYPE=work,voice,pref:xxx-xxx-xxxx,TEL;TYPE=work,voice:(xxx) xxx-xxxx,TEL;TYPE=home,voice:(xxx) xxx-xxxx
WANT	TEL;TYPE=work,voice,pref:xxx-xxx-xxxx	TEL;TYPE=work,voice:(xxx) xxx-xxxx	TEL;TYPE=home,voice:(xxx) xxx-xxxx

Tom · Posted 02-22-2022 06:22 PM

First convert it to multiple observations. Converting it to multiple variables is a job for PROC TRANSPOSE.

data tall;
  set have;
  length num 8 string $50 ;
  do num=1 by 1 until(not loc);
     loc = find(col1,'TEL;',5);
     if loc then do;
       string=substrn(col1,1,loc-2);
       col1=substrn(col1,loc);
     end;
     else string=col1;
     output;
  end;
  drop col1 loc;
run;

So if you have this input:

data have;
  row+1;
  input;
  col1=_infile_;
cards4;
TEL;TYPE=work,voice,pref:xxx-xxx-xxxx,TEL;TYPE=work,voice:(xxx) xxx-xxxx,TEL;TYPE=home,voice:(xxx) xxx-xxxx
TEL;TYPE=work,voice,pref:xxx-xxx-xxxx
TEL;TYPE=work,voice,pref:xxx-xxx-xxxx,TEL;TYPE=work,voice:(xxx) xxx-xxxx
;;;;

You get this output:

Obs    row    num                   string

 1      1      1     TEL;TYPE=work,voice,pref:xxx-xxx-xxxx
 2      1      2     TEL;TYPE=work,voice:(xxx) xxx-xxxx
 3      1      3     TEL;TYPE=home,voice:(xxx) xxx-xxxx
 4      2      1     TEL;TYPE=work,voice,pref:xxx-xxx-xxxx
 5      3      1     TEL;TYPE=work,voice,pref:xxx-xxx-xxxx
 6      3      2     TEL;TYPE=work,voice:(xxx) xxx-xxxx

Which you can transpose into multiple variables.

You could use PROC TRANSPOSE.

proc transpose data=tall out=wide(drop=_name_) prefix=tel;
  by row;
  id num;
  var string;
run;

Or a data step:

proc sql noprint;
  select max(num) into :&ncols from tall;
quit;
data want;
  do _n_=1 to &ncols until(last.row);
    set tall;
    by row;
    array tel [&ncols] $50 ;
    tel[_n_]=string;
  end;
run;

If you know the upperbound on the number of telephone numbers you will have might even be able to use the ARRAY in the step that splits the strings and avoid making the TALL dataset.

Jeff_DOC · Posted 02-23-2022 12:01 PM

This worked perfectly! Thank you very much for taking your time to help.

Parse a String containing Repeated Term

Re: Parse a String containing Repeated Term

Re: Parse a String containing Repeated Term

Re: Parse a String containing Repeated Term

Re: Parse a String containing Repeated Term

Re: Parse a String containing Repeated Term

Re: Parse a String containing Repeated Term

Re: Parse a String containing Repeated Term

Re: Parse a String containing Repeated Term

Classroom Training Available!