Find and replace multiple values

Anuz · Posted 04-09-2024 11:10 AM

Hi All,

What is the best way to find and replace values in a field separated by , from values in a mapping table

DATA HAVE;
	length R_TYP $250. ;
	infile datalines ;
	input R_TYP;
	datalines;
son,daughter
uncle
aunty
son,uncle
;
run;


DATA TEST_MAPPING;
	length K_TYP $100. S_TYP $100.;
	infile datalines delimiter=',' dsd truncover;
	input K_TYP  S_TYP;
	datalines;
son,family
daughter,family
uncle,extended family
aunty,extended family
;
run;

DATA WANT;
	length R_TYP $250. ;
	infile datalines ;
	input R_TYP;
	datalines;
family,family
extended family
extended family
family,extended family
;
run;

The output in WANT table is what I need as my final output by checking values in the MAPPING table - TEST_MAPPING

ballardw · Posted 04-09-2024 11:41 AM

Can you state some rules on what to find where? And the replacement rules?

An example without any description of the rules involves leads to lots of guessing and back-and-forth of "that isn't correct" and "why is it not correct".

Most of the time any variable that has multiple values in it for a single observation points to sub-optimal data structure which can cause lots of extra work to get anything done. Why are there multiple values?

Anuz · Posted 04-09-2024 12:06 PM

The HAVE table is what I receive.
WANT table is my desired output.

For example a value of son should be replaced by family from the TEST_MAPPING table.
If the column has multiple values like son, uncle. My output should say family, extended family as replaced values

ballardw · Posted 04-09-2024 12:15 PM

@Anuz wrote:
The HAVE table is what I receive.
WANT table is my desired output.

For example a value of son should be replaced by family from the TEST_MAPPING table.
If the column has multiple values like son, uncle. My output should say family, extended family as replaced values

Just because someone gives you poorly structured data is no reason to keep such a structure.

If you provide accurate analysis or reporting then what happens in between receipt of data and the report shouldn't matter.

Anuz · Posted 04-09-2024 01:27 PM

agree.. unfortunately that is the format it is required in to feed another process. I have to align to the requirement.

Tom · Posted 04-09-2024 01:35 PM

If the data is smallish just do the code I posted before as it is much clearer what is going on.

If the data is large you might get better performance by doing the transformation in one step. Essentially something like:

data want;
  set have;
  length temp $250 ;
  do col=1 to countw(r_typ,',');
     temp=catx(',',temp,put(left(scan(r_type,col,',')),$group.));
  end;
  r_typ=temp;
  drop temp col;
run;

Where $GROUP is a format you generated from your LOOKUP table that would map 'son' to 'family' and 'uncle' to 'extended family'.

Tom · Posted 04-09-2024 11:43 AM

It will be MUCH easier if you convert you delimited list of values into multiple observations instead.

So given your two input datasets (let's call the second one LOOKUP instead of TEST).

data have;
  infile datalines truncover;
  input R_TYP $250.;
datalines;
son,daughter
uncle
aunty
son,uncle
;

data lookup ;
  infile datalines dsd dsd truncover;
  input K_TYP :$100.  S_TYP :$100.;
datalines;
son,family
daughter,family
uncle,extended family
aunty,extended family
;

First fix the WANT dataset to have one observation per value.

data fixed;
  row+1;
  set have;
  do col=1 to max(1,countw(r_typ,','));
    length k_typ $100;
    k_typ = left(scan(r_typ,col,','));
    output;
  end;
  drop r_typ;
run;

Then you can combine on the common variable K_TYP.

proc sql noprint;
create table matched as
  select a.*,b.s_typ
  from fixed a 
    left join lookup b
    on a.k_typ = b.k_typ
  order by row,col
;
quit;

Resuls

Obs    row    col    k_typ       S_TYP

 1      1      1     son         family
 2      1      2     daughter    family
 3      2      1     uncle       extended family
 4      3      1     aunty       extended family
 5      4      1     son         family
 6      4      2     uncle       extended family

And if you did want to rebuild that delimited list it can be simply done with another data step.

data want;
  do until(last.row);
    set matched;
    by row col;
    length r_typ $250;
    r_typ=catx(',',r_typ,s_typ);
  end;
  keep row r_typ;
run;

Result

Obs    row    r_typ

 1      1     family,family
 2      2     extended family
 3      3     extended family
 4      4     family,extended family

Anuz · Posted 04-09-2024 12:08 PM

Thank you Tom. Let me try that/

Reeza · Posted 04-09-2024 01:53 PM

A bit more of a verbose solution but easy to follow and modify.

data have;
    infile datalines truncover;
    input R_TYP $250.;
    datalines;
son,daughter
uncle
aunty
son,uncle
;

data lookup;
    infile datalines dsd dsd truncover;
    input K_TYP :$100. S_TYP :$100.;
    datalines;
son,family
daughter,family
uncle,extended family
aunty,extended family
;
    
    
    
*find number of values per row for array dimensions;
data _null_;
    set have end=eof;
    retain max_num 0;
    n_words=countc(r_typ, ',')+1;
    max_num=max(n_words, max_num);

    if eof then
        call symputx('num_words', n_words);
run;

%put &num_words;

*create format for mapping;
data lookup_fmt;
    set lookup;
    start=k_typ;
    label=s_typ;
    type='C';
    fmtname='family_fmt';
run;

proc format cntlin=lookup_fmt;
run;

*map codes to new values;
data mapped_summarized;
    set have;
    array _original(&num_words.) $100.;
    array _mapped(&num_words.) $100.;

    do i=1 to countc(r_typ, ",")+1;
        _original(i)=trim(scan(r_typ, i, ','));
        _mapped(i)=put(_original(i), $family_fmt.);
    end;
    
    *concatenate output as desired;
    r_typ_new=catx(", ", of _mapped(*));
run;

@Anuz wrote:

Hi All,

What is the best way to find and replace values in a field separated by , from values in a mapping table
DATA HAVE;
	length R_TYP $250. ;
	infile datalines ;
	input R_TYP;
	datalines;
son,daughter
uncle
aunty
son,uncle
;
run;


DATA TEST_MAPPING;
	length K_TYP $100. S_TYP $100.;
	infile datalines delimiter=',' dsd truncover;
	input K_TYP  S_TYP;
	datalines;
son,family
daughter,family
uncle,extended family
aunty,extended family
;
run;

DATA WANT;
	length R_TYP $250. ;
	infile datalines ;
	input R_TYP;
	datalines;
family,family
extended family
extended family
family,extended family
;
run;
The output in WANT table is what I need as my final output by checking values in the MAPPING table - TEST_MAPPING

Ksharp · Posted 04-09-2024 10:34 PM

DATA HAVE;
	length R_TYP $250. ;
	infile datalines ;
	input R_TYP;
	datalines;
son,daughter
uncle
aunty
son,uncle
;
run;


DATA TEST_MAPPING;
	length K_TYP $100. S_TYP $100.;
	infile datalines delimiter=',' dsd truncover;
	input K_TYP  S_TYP;
	datalines;
son,family
daughter,family
uncle,extended family
aunty,extended family
;
run;

data want;
 if _n_=1 then do;
   if 0 then set TEST_MAPPING;
   declare hash h(dataset:'TEST_MAPPING');
   h.definekey('K_TYP');
   h.definedata('S_TYP');
   h.definedone();
 end;
set have;
length want temp $ 200;
do i=1 to countw(R_TYP,',');
 temp=scan(R_TYP,i,',');
 if h.find(key:temp)=0 then want=catx(',',want,S_TYP);
  else want=catx(',',want,temp);
end;
keep want;
run;

Find and replace multiple values

Re: Find and replace multiple values

Re: Find and replace multiple values

Re: Find and replace multiple values

Re: Find and replace multiple values

Re: Find and replace multiple values

Re: Find and replace multiple values

Re: Find and replace multiple values

Re: Find and replace multiple values

Re: Find and replace multiple values

Classroom Training Available!