hi guys,
I have a large dataset which contains reject reasons. rejection reasons are order, kind like the waterfall.
I want to count how many ppl are rejected by each rejection code. For example, customer b is rejected by "low+bro", because in the order, bro is the first reason, so b will be ONLY counted to bro, not low.
raw | what I am looking for is | |||||
name | rejection | rejection_by_order | rejection_reason | count | ||
a | low | bro | bro | 3 | ||
b | low+bro | eas | eas | 1 | ||
c | bro | low | low | 1 | ||
d | eas+bro | |||||
e | eas, low |
Sounds like you have data like this:
data have;
input name :$10. rejection $30. ;
cards;
a low
b low+bro
c bro
d eas+bro
e eas, low
;
And metadata like this:
data reasons;
order+1;
input reason $10.;
cards;
bro
eas
low
;
So first combine them and sort by NAME and ORDER. Then take the first one per name.
proc sql;
create table step1 as
select a.*,b.*
from have a
left join reasons b
on indexw(a.rejection,b.reason,' ,+')
order by name, order
;
quit;
data want;
set step1;
by name;
if first.name;
run;
proc freq data=want;
tables reason;
run;
Results:
The FREQ Procedure Cumulative Cumulative reason Frequency Percent Frequency Percent ----------------------------------------------------------- bro 3 60.00 3 60.00 eas 1 20.00 4 80.00 low 1 20.00 5 100.00
PS That is an extremely strange way to type in data. Did you enter it into a spreadsheet? Why not just enter it as plain text?
bro contains b, c , d
data have;
infile cards dlm='|' truncover;
input name $ rejection $50.;
cards;
a |low
b |low+bro
c |bro
d |eas+bro
e |eas,low
;
run;
*create informat for ordering rejections;
proc format;
invalue $ rej_infmt 'bro'=1 'eas'=2 'low'=3;
run;
*split data out for analysis;
data long;
set have;
*count number of terms in rejection;
nWords=countw(compress(rejection, , 's'));
do rej=1 to nwords;
*capture one rejection reason at a time;
rejection_reason=compress(scan(rejection, rej, "+,"), , 's');
*convert rejection_reseason to a number for ordering;
rejection_code=input(rejection_reason, $rej_infmt.);
*output to data set;
output;
end;
run;
*sort so that lowest is first;
proc sort data=long;
by name rejection_code;
run;
*keep only first using nodupkey;
proc sort data=long out=lowest nodupkey;
by name;
run;
*summarize results;
proc freq data=lowest;
table rejection_code * rejection_reason / list;
run;
Longer solution but dynamic, easy to fix and modify if necessary.
Sounds like you have data like this:
data have;
input name :$10. rejection $30. ;
cards;
a low
b low+bro
c bro
d eas+bro
e eas, low
;
And metadata like this:
data reasons;
order+1;
input reason $10.;
cards;
bro
eas
low
;
So first combine them and sort by NAME and ORDER. Then take the first one per name.
proc sql;
create table step1 as
select a.*,b.*
from have a
left join reasons b
on indexw(a.rejection,b.reason,' ,+')
order by name, order
;
quit;
data want;
set step1;
by name;
if first.name;
run;
proc freq data=want;
tables reason;
run;
Results:
The FREQ Procedure Cumulative Cumulative reason Frequency Percent Frequency Percent ----------------------------------------------------------- bro 3 60.00 3 60.00 eas 1 20.00 4 80.00 low 1 20.00 5 100.00
PS That is an extremely strange way to type in data. Did you enter it into a spreadsheet? Why not just enter it as plain text?
is it possible to use scan or find function?
@AndyZ wrote:
is it possible to use scan or find function?
SCAN() would not work very well as it will just pick a single reason from the list. You could possible do some complex looping to try to get it to help but you would need to need to keep track of which is maximum order of the reasons found. The simplest way would be to use SCAN() to convert the HAVE dataset into one observation per NAME/REASON instead of having multiple reasons stuffed into a single variable. Then joining it with the metadata that sets the order would be easier. You might even use a INFORMAT to convert the text of the reason pulled out into an order number to make sorting easier.
FIND() has the problem that in the general case some text of a reason might be a subset of another reason. That is why INDEXW() or FINDW() is better since it only searches for full words, not just substrings.
Why? The code will be much more complicated than the already posted solution.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.