Solved: how to extract a string (by order) from a string then count

AndyZ · Posted 01-26-2022 10:54 AM

hi guys,

I have a large dataset which contains reject reasons. rejection reasons are order, kind like the waterfall.

I want to count how many ppl are rejected by each rejection code. For example, customer b is rejected by "low+bro", because in the order, bro is the first reason, so b will be ONLY counted to bro, not low.

raw			what I am looking for is

name	rejection	rejection_by_order		rejection_reason	count
a	low	bro		bro	3
b	low+bro	eas		eas	1
c	bro	low		low	1
d	eas+bro
e	eas, low

Tom · Posted 01-26-2022 11:55 AM

Sounds like you have data like this:

data have;
  input name :$10. rejection $30. ;
cards;
a low
b low+bro
c bro
d eas+bro          
e eas, low 
;

And metadata like this:

data reasons;
  order+1;
  input reason $10.;
cards; 
bro
eas
low
;

So first combine them and sort by NAME and ORDER. Then take the first one per name.

proc sql;
create table step1 as
select a.*,b.*
from have a 
left join reasons b
on indexw(a.rejection,b.reason,' ,+')
order by name, order 
;
quit;

data want;
  set step1;
  by name;
  if first.name;
run;

proc freq data=want;
  tables reason;
run;

Results:

The FREQ Procedure

                                   Cumulative    Cumulative
reason    Frequency     Percent     Frequency      Percent
-----------------------------------------------------------
bro              3       60.00             3        60.00
eas              1       20.00             4        80.00
low              1       20.00             5       100.00

PS That is an extremely strange way to type in data. Did you enter it into a spreadsheet? Why not just enter it as plain text?

View solution in original post

Reeza · Posted 01-26-2022 11:08 AM

Since bro is before eas shouldn't it be bro=4, eas=1, low=0?

AndyZ · Posted 01-26-2022 11:19 AM

bro contains b, c , d

Reeza · Posted 01-26-2022 11:49 AM

data have;
	infile cards dlm='|' truncover;
	input name	$ rejection	 $50.;
	cards;
a |low	
b |low+bro 	
c |bro	
d |eas+bro	 	 	 	 	 
e |eas,low
;
run;


*create informat for ordering rejections;
proc format;
	invalue $ rej_infmt 'bro'=1 'eas'=2 'low'=3;
run;

*split data out for analysis;
data long;
	set have;
	*count number of terms in rejection;
	nWords=countw(compress(rejection, , 's'));

	do rej=1 to nwords;
	    *capture one rejection reason at a time;
		rejection_reason=compress(scan(rejection, rej, "+,"), , 's');
		*convert rejection_reseason to a number for ordering;
		rejection_code=input(rejection_reason, $rej_infmt.);
		*output to data set;
		output;
	end;
run;

*sort so that lowest is first;
proc sort data=long;
     by name rejection_code;
run;

*keep only first using nodupkey;
proc sort data=long out=lowest nodupkey;
by name;
run;

*summarize results;
proc freq data=lowest;
	table rejection_code * rejection_reason / list;
run;

Longer solution but dynamic, easy to fix and modify if necessary.

Tom · Posted 01-26-2022 11:55 AM

Sounds like you have data like this:

data have;
  input name :$10. rejection $30. ;
cards;
a low
b low+bro
c bro
d eas+bro          
e eas, low 
;

And metadata like this:

data reasons;
  order+1;
  input reason $10.;
cards; 
bro
eas
low
;

So first combine them and sort by NAME and ORDER. Then take the first one per name.

proc sql;
create table step1 as
select a.*,b.*
from have a 
left join reasons b
on indexw(a.rejection,b.reason,' ,+')
order by name, order 
;
quit;

data want;
  set step1;
  by name;
  if first.name;
run;

proc freq data=want;
  tables reason;
run;

Results:

The FREQ Procedure

                                   Cumulative    Cumulative
reason    Frequency     Percent     Frequency      Percent
-----------------------------------------------------------
bro              3       60.00             3        60.00
eas              1       20.00             4        80.00
low              1       20.00             5       100.00

PS That is an extremely strange way to type in data. Did you enter it into a spreadsheet? Why not just enter it as plain text?

AndyZ · Posted 01-26-2022 01:51 PM

is it possible to use scan or find function?

Tom · Posted 01-26-2022 02:19 PM

@AndyZ wrote:

is it possible to use scan or find function?

SCAN() would not work very well as it will just pick a single reason from the list. You could possible do some complex looping to try to get it to help but you would need to need to keep track of which is maximum order of the reasons found. The simplest way would be to use SCAN() to convert the HAVE dataset into one observation per NAME/REASON instead of having multiple reasons stuffed into a single variable. Then joining it with the metadata that sets the order would be easier. You might even use a INFORMAT to convert the text of the reason pulled out into an order number to make sorting easier.

FIND() has the problem that in the general case some text of a reason might be a subset of another reason. That is why INDEXW() or FINDW() is better since it only searches for full words, not just substrings.

andreas_lds · Posted 01-26-2022 02:33 PM

Why? The code will be much more complicated than the already posted solution.

how to extract a string (by order) from a string then count

Re: how to extract a string (by order) from a string then count

Re: how to extract a string (by order) from a string then count

Re: how to extract a string (by order) from a string then count

Re: how to extract a string (by order) from a string then count

Re: how to extract a string (by order) from a string then count

Re: how to extract a string (by order) from a string then count

Re: how to extract a string (by order) from a string then count

Re: how to extract a string (by order) from a string then count

Classroom Training Available!