How do i create a new function/rules for prioritizing number/character?
I have the following dataset.
Data mock;
trx_date='01Jan2024';
amount=100;
c_type='1';
output;
trx_date='01Jan2024';
amount=200;
c_type='2';
output;
trx_date='01Jan2024';
amount=300;
c_type='3';
output;
trx_date='01Jan2024';
amount=200;
c_type='1 or 3';
output;
trx_date='03Feb2024';
amount=1;
c_type='2';
output;
trx_date='03Feb2024';
amount=1.5;
c_type='3';
output;
trx_date='03Feb2024';
amount=2;
c_type='1 or 2 or 3';
output;
trx_date='31Mar2024';
amount=888;
c_type='1';
output;
trx_date='31Mar2024';
amount=1088;
c_type='2';
output;
run;
In the mockup data, there will be data belonging to 2 different date, 1st Jan 2024 and 3rd Feb 2024.
What i wish to achieve is to look at the data by trx_date as one group.
For trx_date = 1st Jan 2024, c_type will have value of '1' and '2' and '3' and '1 or 3'.
For trx_date = 3rd Feb 2024, c_type will have value of '2' and '3' and '1 or 2 or 3'.
For trx_date = 31st Mar 2024, c_type will have value of '1' and '2' only.
Whenever the value of c_type consist of 'X or X', there will be another level of checking for the "ranking" or "priority" for the c_type as follows.
1 > 3 > 2
In one date period, if there is "X or X" whereby '1' is in the "X Or X" condition, it will check if c_type in the same group of date period has any row of record that consist of '1' (1st in the ranking/priority). If yes, the final value will be '1'. If no, it will check if '3' is in the c_type for the same group of date period (assuming 3 is also in the 'X or X' condition). If yes, the final value would be '3'.
However, if the condition is '1 or 2', it will skip checking for c_type in the same period whether it consist of '3' or not as '3' is not in the 'X or X' condition. Afterall, the function that i wish to create should refer to 'X or X' first.
Worth mentioning here is that it is possible for a date period to not consist of c_type with the value of 'X or X'. Therefore, the function must be dynamic enough to work even if there is no value of 'X or X'.
Based on the example and explanation given here, the final output should be like this:
Data mock;
trx_date='01Jan2024';
amount=100;
c_type='1';
output;
trx_date='01Jan2024';
amount=200;
c_type='2';
output;
trx_date='01Jan2024';
amount=300;
c_type='3';
output;
trx_date='01Jan2024';
amount=200;
c_type='1';
output;
trx_date='03Feb2024';
amount=1;
c_type='2';
output;
trx_date='03Feb2024';
amount=1.5;
c_type='3';
output;
trx_date='03Feb2024';
amount=2;
c_type='3';
output;
trx_date='31Mar2024';
amount=888;
c_type='1';
output;
trx_date='31Mar2024';
amount=1088;
c_type='2';
output;
run;
Currently, i can only think of the following steps to achieve this which i think is not efficient and may not be enough to cover the whole logic.
Step 1: Check if each date has 'X or X'. Flag it with 'Y'.
If yes, proceed with Step 2: Set the ranking of
1 -> 1,
3 -> 2,
2 -> 3
Step 3, for data without 'X or X', direct populate without any logic required.
I am stuck at the part after Step 2 as to how do i define the value for the ranking and at the end, how do i assign the value after that?
I would start by making your input data easier to follow by writing it something like this:
data mock;
infile cards dsd truncover firstobs=1 dlm=',';
length trx_date 4 amount 8 c_type $20;
informat trx_date date9.;
format trx_date date9.;
input trx_date amount c_type;
cards;
01Jan2024,100,1
01Jan2024,200,2
01Jan2024,300,3
01Jan2024,200,1 or 3
;
run;
Do not use any quotes in the above.
Next, I would dispense with the idea of trying to keep whatever "c_type" is as a character string that might be "X" or "X or X" or "X or X or X". That's just going to be a giant pain. Create 3 variables (e.g., X, Y, Z). If c_type only contains 1 number, only X has a value and Y and Z are missing. Your life will be so much easier and I suspect the solution will become very simple and obvious.
I am not sure I understand the rule (or how it could be extended to other data).
It would help a lot if you provided the data in a clearer format. Instead of the wallpaper code just post a simple data step with in-line data records.
I think that perhaps you mean you have this input HAVE dataset
data have;
input trx_date :date. amount c_type $ 15-30 ;
format trx_date date9.;
cards;
01Jan2024 100 1
01Jan2024 200 2
01Jan2024 300 3
01Jan2024 200 1 or 3
03Feb2024 1 2
03Feb2024 1.5 3
03Feb2024 2 1 or 2 or 3
31Mar2024 888 1
31Mar2024 1088 2
;
and desire to create this WANT dataset.
data want;
input trx_date :date. amount c_type $ 15-30 ;
format trx_date date9.;
cards;
01Jan2024 100 1
01Jan2024 200 2
01Jan2024 300 3
01Jan2024 200 1
03Feb2024 1 2
03Feb2024 1.5 3
03Feb2024 2 3
31Mar2024 888 1
31Mar2024 1088 2
;
I am not sure what you mean by 1 < 3 < 2??? Is the issue that your real values are words instead of nunbers and you want to use something other than alphabetical order? So perhaps you have FRED, SALLY and JOE instead of 1, 3 and 2?
Perhaps you could make an INFORMAT that converts the strings into distinct integers in the proper order?
proc format ;
invalue order
'1' = 1
'3' = 2
'2' = 3
;
run;
Or do you mean that the string with OR keyword are setting ordering?
So that 1 or 3 means check if there are any 1's then check if there are any 3s?
But 3 or 1 would check for 3's first?
Which is it?
As to how to deal with those datasets I would suggest processing all of the observations for a single date in one step (using DOW loop) so that when you get to the last observation you could KNOW whether or not there were any 1's or 3's seen already. That should work assuming the OR records appear at the end.
Something like:
data try1 ;
do until(last.trx_date);
set have;
by trx_date ;
array found[3] ;
if not index(c_type,'or') then do;
n_type = input(c_type,32.);
found[n_type]=1;
end;
else do;
match=0;
do index=1 to countw(c_type,' ') by 2 until(match);
n_type=input(scan(c_type,index,' '),32.);
match=found[n_type];
end;
end;
output;
end;
run;
Hey @Tom , your sample do loop is pretty much what i want to achieve except on the part where the ordering is incorrect.
For Rom 5 to 7, the trx_date 03Feb2024, I would expect the "n_type" resolves to '3' as the ordering is as follows:
1
3
2
Hence, if '1' is not found in the same date period, it will check if '3' exist or not. If yes, the n_value is '3'. '2' is the last in the ordering. Are we able to tweak the loop to cater to this?
Perhaps you could make an INFORMAT that converts the strings into distinct integers in the proper order?
proc format ;
invalue order
'1' = 1
'3' = 2
'2' = 3
;
run;
You nailed it Tom. Initially i thought of this but without the use of proc format...i was thinking from the point of setting
'1' = 'A'
'3' = 'B'
'2' = 'C'
And i sort base on the alphabet to get the ordering right. However, ur loop has all the things that i wanted except for the ordering (1 to 3 to 2) rather than (1 to 2 to 3).
OR keyword here means that it is possible to have a value with the following string
"1 or 2 or 3" <- this is a valid value for the whole string wrapped with double quotation.
Whenever we see "X or Y", we need to check for the same date period if there is other value consisting on either
'1'
'3'
'2'
If in the same date period where it has value of "1 or 2 or 3", it will set the value based on whatever is found for the same period with the following order:
1
3
2
I hope this is clearer. Sorry for the confusion.
So the order of the digits in the strings with OR does not matter? "1 or 3" is the same thing as "3 or 1"?
If so you might just build the flag variables and then use a DO loop to check for the first # that is both flagged and in the OR string. So if you have FOUND1 to FOUND3 to indicate if there was every a value of 1, 2 or 3 your loop might look like:
do order=1,3,2 ;
if found[order] and indexw(c_type,cats(order)) then do;
n_type=order;
leave;
end;
end;
But for just 3 possible flags the loop is probably overkill.
if found1 and indexw(c_type,'1') then n_type=1;
else if found3 and indexw(c_type,'3') then n_type=3;
else if found2 and indexw(c_type,'2') then n_type=2;
else n_type=.;
Data mock; length c_type $20; trx_date='01Jan2024'; amount=100; c_type='1'; output; trx_date='01Jan2024'; amount=200; c_type='2'; output; trx_date='01Jan2024'; amount=300; c_type='3'; output; trx_date='01Jan2024'; amount=200; c_type='1 or 3'; output; trx_date='03Feb2024'; amount=1; c_type='2'; output; trx_date='03Feb2024'; amount=1.5; c_type='3'; output; trx_date='03Feb2024'; amount=2; c_type='1 or 2 or 3'; output; trx_date='31Mar2024'; amount=888; c_type='1'; output; trx_date='31Mar2024'; amount=1088; c_type='2'; output; run; data want; array _c{9999} $ 40 _temporary_; call missing(of _c{*}); do i=1 by 1 until(last.trx_date); set mock; by trx_date; _c{i}=c_type; end; do until(last.trx_date); set mock; by trx_date; if find(c_type,'or') then do; if findw(c_type,'1',,'kd') and '1' in _c then c_type='1'; else if findw(c_type,'3',,'kd') and '3' in _c then c_type='3'; else if findw(c_type,'2',,'kd') and '2' in _c then c_type='2'; end; output; end; drop i; run;
Here's my take:
data Have;
length trx_date amount 8 c_type $15;
format trx_date date9.;
trx_date='01Jan2024'd; amount=100; c_type='1'; output;
trx_date='01Jan2024'd; amount=200; c_type='2'; output;
trx_date='01Jan2024'd; amount=300; c_type='3'; output;
trx_date='01Jan2024'd; amount=200; c_type='1 or 3'; output;
trx_date='03Feb2024'd; amount=1; c_type='2'; output;
trx_date='03Feb2024'd; amount=1.5; c_type='3'; output;
trx_date='03Feb2024'd; amount=2; c_type='1 or 2 or 3'; output;
trx_date='31Mar2024'd; amount=888; c_type='1'; output;
trx_date='31Mar2024'd; amount=1088; c_type='2'; output;
run;
/* Find the max number of entries for any date, write to macro variable maxCount for sizing our arrays */
proc sql noprint;
select max(count) into :maxCount
from (select count(*) as Count from have group by trx_date)
;
quit;
data final;
/* All temp variable names start with _ - drop them all */
drop _: ;
length _cs $15;
/* Set up arrays to hold all amount and c_type values for a date group */
array a [&maxCount] _temporary_;
array c [&maxCount] $15 _temporary_;
/* read in the data you have by trx_date (must be sorted in that order) */
set have;
by trx_date;
/* Reset retained values if this is the first in a new date group */
if first.trx_date then do;
call missing (of a[*], of c[*], _count);
end;
/* Keep track of which record we're working with */
_count+1;
/* Stash the amout and c_type values in the arrays */
a[_count]=amount;
c[_count]=c_type;
/* When I've got all records for this date group, process the decisioning */
if last.trx_date then do;
/* Work through each entry in the arrays */
do _i=1 to dim(c);
/* If this c_type includes 'or' then do the priority checks */
if find(c[_i],'or') then do;
/* concatenate all previous c_type values into a list name _cs */
do _j=1 to _i-1;
_cs=cats(_cs,c[_j]);
end;
/* If there is a 1 on the list, set c_type to 1 */
if find(c[_i],'1') and find(_cs,'1') then c_type=1;
/* otherwise, it there is a 3 on the list, set c_type to 3 */
else if find(_cs,'3') then c_type=3;
/* The only other possibility is 2... */
else c_type=2;
end;
/* If there is no 'or' in c_type, get the appropriate value for this record from the array */
else c_type=c[_i];
/* Set amount to the appropriate value for this record from the array */
amount=a[_i];
/* Sometimes, there were fewer then &maxCount records. If this array value was missing, don't output */
if not missing(amount) then output;
end;
end;
run;
title "Have";
proc print data=have; run;
title "Final";
proc print data=final; run;
Have |
Obs | trx_date | amount | c_type |
---|---|---|---|
1 | 01JAN2024 | 100.0 | 1 |
2 | 01JAN2024 | 200.0 | 2 |
3 | 01JAN2024 | 300.0 | 3 |
4 | 01JAN2024 | 200.0 | 1 or 3 |
5 | 03FEB2024 | 1.0 | 2 |
6 | 03FEB2024 | 1.5 | 3 |
7 | 03FEB2024 | 2.0 | 1 or 2 or 3 |
8 | 31MAR2024 | 888.0 | 1 |
9 | 31MAR2024 | 1088.0 | 2 |
Final |
Obs | trx_date | amount | c_type |
---|---|---|---|
1 | 01JAN2024 | 100.0 | 1 |
2 | 01JAN2024 | 200.0 | 2 |
3 | 01JAN2024 | 300.0 | 3 |
4 | 01JAN2024 | 200.0 | 1 |
5 | 03FEB2024 | 1.0 | 2 |
6 | 03FEB2024 | 1.5 | 3 |
7 | 03FEB2024 | 2.0 | 3 |
8 | 31MAR2024 | 888.0 | 1 |
9 | 31MAR2024 | 1088.0 | 2 |
Thanks for your excellent codes for reference. For the time being, i have a slight changes to the existing requirements whereby the c_type "1 or 2 or 3" no longer just look at that specific day anymore. Instead, it will look at that specific day AND all day before that day.
Simply put, whenever there is value "1 or 2 or 3" in c_type appears on 03Feb2024, i will have to scan through all days on and before 03Feb2024.
Have
Obs trx_date amount c_type123456789
01JAN2024 | 100.0 | 1 |
01JAN2024 | 200.0 | 2 |
01JAN2024 | 300.0 | 3 |
01JAN2024 | 200.0 | 1 or 2 |
03FEB2024 | 1.0 | 1 |
03FEB2024 | 1.5 | 3 |
03FEB2024 | 2.0 | 1 or 2 or 3 |
31MAR2024 | 888.0 | 1 |
31MAR2024 | 1088.0 | 2 |
Final |
01JAN2024 | 100.0 | 1 |
01JAN2024 | 200.0 | 2 |
01JAN2024 | 300.0 | 3 |
01JAN2024 | 200.0 | 1 |
03FEB2024 | 1.0 | 2 |
03FEB2024 | 1.5 | 3 |
03FEB2024 | 2.0 | 1 |
31MAR2024 | 888.0 | 1 |
31MAR2024 | 1088.0 | 2 |
Can we slightly tune the code to cater to this?
OK.
Assuming "1 or 2 or 3" is always appeared at the end of a trx_date.
Otherwise, you need a RPOC SORT to order the data firstly.
Data mock;
length c_type $20;
trx_date='01Jan2024';
amount=100;
c_type='1';
output;
trx_date='01Jan2024';
amount=200;
c_type='2';
output;
trx_date='01Jan2024';
amount=300;
c_type='3';
output;
trx_date='01Jan2024';
amount=200;
c_type='1 or 3';
output;
trx_date='03Feb2024';
amount=1;
c_type='2';
output;
trx_date='03Feb2024';
amount=1.5;
c_type='3';
output;
trx_date='03Feb2024';
amount=2;
c_type='1 or 2 or 3';
output;
trx_date='31Mar2024';
amount=888;
c_type='1';
output;
trx_date='31Mar2024';
amount=1088;
c_type='2';
output;
run;
data want;
array _c{999999} $ 40 _temporary_;
set mock;
_c{_n_}=c_type;
if find(c_type,'or') then do;
if findw(c_type,'1',,'kd') and '1' in _c then c_type='1';
else if findw(c_type,'3',,'kd') and '3' in _c then c_type='3';
else if findw(c_type,'2',,'kd') and '2' in _c then c_type='2';
end;
run;
So just remove the BY group processing.
data have;
input trx_date :date. amount c_type $ 15-30 ;
format trx_date date9.;
cards;
01Jan2024 100 1
01Jan2024 200 2
01Jan2024 300 3
01Jan2024 200 1 or 3
03Feb2024 1 2
03Feb2024 1.5 3
03Feb2024 2 1 or 2 or 3
31Mar2024 888 1
31Mar2024 1088 2
;
You can just treat any value that cannot be converted to a number as one to trigger looking back for the "highest" previous value. You could use my previous array method.
data want ;
set have;
array found found1-found3 ;
retain found1-found3;
n_type = input(c_type,??32.);
if n_type in (1:3) then found[n_type]=1;
else do n_type=1,3,2;
if found[n_type] then leave;
end;
if n_type > 3 then n_type=.;
run;
Or just keep a single "highest" value variable to use.
data want2 ;
set have;
retain highest ;
array order [3] _temporary_ (2 3 1);
n_type = input(c_type,??32.);
if n_type=1 then highest=1;
else if n_type=3 and (highest ^= 1) then highest=3;
else if n_type=2 and (highest=.) then highest=2;
else n_type=coalesce(n_type,highest);
run;
Your example is probably not the best for testing. Since the first real value is 1 that means any missing values are going to be replaced by 1. To see if the strange ordering works you need different test data.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.