Hi,
I've been scratching my head for the past few hours and thought an expert can help me. I have a list of names that belong to the same common entity. I want to create unique 2 letter short codes using the names. I've written a program that should work (see result on row 2, where Harriet would create HA but because it was a duplicate of Ham, it removed A from the name to derive HR instead) but does not work for the subsequent record. Harry should have had A and R removed to derive HY
| short_code | nam_1 | iso_a2 | tmp_cd | last_tmp_cd | nam_1_s | in_prev | do_cnt | to_rm |
| HA | Ham | AL | HA | HAM | 0 | 1 | ||
| HR | Harriet | AL | HA,HR | HA | HRRIET | 0 | 2 | A |
| HA | Harry | AL | HA,HR,HA | HA,HR | HARRY | 0 | 1 | |
| HE | Hello | AL | HA,HR,HA,HE | HA,HR,HA | HELLO | 0 | 1 | |
| HI | Hitachi | AL | HA,HR,HA,HE,HI | HA,HR,HA,HE | HITACHI | 0 | 1 | |
| HO | Hog | AL | HA,HR,HA,HE,HI,HO | HA,HR,HA,HE,HI | HOG | 0 | 1 | |
| HG | Hogwarth | AL | HA,HR,HA,HE,HI,HO,HG | HA,HR,HA,HE,HI,HO | HGWARTH | 0 | 2 | O |
| HO | Holler | AL | HA,HR,HA,HE,HI,HO,HG,HO | HA,HR,HA,HE,HI,HO,HG | HOLLER | 0 | 1 | |
| HU | Hummer | AL | HA,HR,HA,HE,HI,HO,HG,HO,HU | HA,HR,HA,HE,HI,HO,HG,HO | HUMMER | 0 | 1 |
Here is a full working example that produced the results above. Apologies in advance as I had to put a lot of comments to make sure my head is working correctly! I also added a few failsafes and bits and bobs to troubleshoot.
/*list of names*/
data tmp;
length nam_1 $50;
input nam_1;
datalines;
Harry
Hummer
Harriet
Hog
Ham
Hello
Hogwarth
Hitachi
Holler
;
run;
/*add a common field*/
data tmp;
length short_code $5;
set tmp;
iso_a2="AL";
run;
/*objective: for the same common field (e.g. AL) short_code should be unique. short_code is derived from nam_1 (2 char)*/
/*short_code is all empty*/
proc sort data=tmp;
by iso_a2 descending short_code nam_1;
run;
data out;
set tmp;
by iso_a2 descending short_code nam_1;
length tmp_cd last_tmp_cd $500;
retain tmp_cd; /*i want to retain a list of short_codes that has been assigned for the same common field so i should not assign it again*/
if first.iso_a2 then do; /*for every unique common fieldm, reset my short_code list*/
tmp_cd=short_code;
end;
/*begin creating short_code*/
last_tmp_cd=tmp_cd; /*assign short_code from previous row to a new field to debug*/
nam_1_s=upcase(nam_1); /*assign name to temporary var to be manipulated if required*/
if short_code="" then do; /*assign short_code*/
in_prev=1; /*do at least once*/
do_cnt=0; /*failsafe for infinite loop*/
do until (in_prev=0 or do_cnt=6);
do_cnt=do_cnt+1; /*failsafe for infinite loop*/
if do_cnt=1 then short_code=upcase(substr(nam_1_s,1,2)); /*if first loop, then simple assignment. take first 2 char of nam_1_s as short code. this will introduce dups*/
else do; /*if not first loop it means there was a duplicate. now attempt to remove the char from nam_1_s*/
to_rm=substr(strip(reverse(short_code)),1,1); /*extract the 2nd char of short_code and use this to remove this char from nam_1_s*/
nam_1_s=cats(substr(nam_1_s,1,1),compress(substr(nam_1_s,2), to_rm, 'i')); /*keep first char and attempt to remove the next char*/
short_code=upcase(substr(nam_1_s,1,2));
put last_tmp_cd=;
put short_code=;
put nam_1_s=;
end;
/*as in you see in the logs, it works for the first duplicate short_code. but subsequent short_codes fail to get index value>0 even though th short_code value is in last_tmp_cd*/
in_prev=index(last_tmp_cd, short_code);
end;
end;
tmp_cd=catx(",",tmp_cd,short_code); /*add assigned short_code to the list so next short_code will not use the same value*/
run;
Most likely the issue is using INDEX() when you actually want INDEXW() or better still the FINDW() function because of the TRIME modifier it allows.
data have;
input nam_1 :$50. iso_a2 :$2. ;
cards;
Ham AL
Harriet AL
Harry AL
Hello AL
Hitachi AL
Hog AL
Hogwarth AL
Holler AL
Hummer AL
;
data want;
set have;
by iso_a2 ;
length short_code $2 list $100 ;
retain list;
if first.iso_a2 then call missing(list);
do index=2 to length(nam_1) until(not found);
short_code=upcase(cats(char(nam_1,1),char(nam_1,index)));
found = findw(list,short_code,',','it');
end;
list=catx(',',list,short_code);
run;
What is it you want to do when you get to something where there is no unique second letter to use?
What should happen if the same name appears multiple times?
And what should happen if you run out of characters from the name while building the short code? Would you try to use any character from the alphabet?
Instead of using a RETAINed character variable, which is limited in length (with longer codes and source "names" you could crack the 32K limit), I recommend to use a hash object, which makes finding an existing entry that much easier (and faster, as a binary tree search will always outperform a sequential scan) and which is limited in size only by available memory.
data have;
length nam_1 $50;
input nam_1;
iso_a2 = "AL";
datalines;
Harry
Hummer
Harriet
Hog
Ham
Hello
Hogwarth
Hitachi
Holler
;
proc sort data=have;
by iso_a2 nam_1;
run;
data want;
set have;
by iso_a2;
length short_code $2;
if _n_ = 1
then do;
declare hash code ();
code.definekey("short_code");
code.definedata("short_code");
code.definedone();
end;
if first.iso_a2 then code.clear();
nam_u = upcase(nam_1);
short_code = substr(nam_u,1,2);
if code.find() ne 0
then code.add();
else do;
i1 = 2;
i2 = 64;
do until (code.find() ne 0 or i2 gt 90);
if i1 lt length(nam_u)
then do;
i1 + 1;
short_code = substr(nam_u,1,1) !! substr(nam_u,i1,1);
end;
else do;
i2 + 1;
short_code = substr(nam_u,1,1) !! byte(i2);
end;
end;
if i2 gt 90
then putlog "run out of characters";
else code.add();
end;
drop nam_u i1 i2;
run;
Most likely the issue is using INDEX() when you actually want INDEXW() or better still the FINDW() function because of the TRIME modifier it allows.
data have;
input nam_1 :$50. iso_a2 :$2. ;
cards;
Ham AL
Harriet AL
Harry AL
Hello AL
Hitachi AL
Hog AL
Hogwarth AL
Holler AL
Hummer AL
;
data want;
set have;
by iso_a2 ;
length short_code $2 list $100 ;
retain list;
if first.iso_a2 then call missing(list);
do index=2 to length(nam_1) until(not found);
short_code=upcase(cats(char(nam_1,1),char(nam_1,index)));
found = findw(list,short_code,',','it');
end;
list=catx(',',list,short_code);
run;
What is it you want to do when you get to something where there is no unique second letter to use?
@dylanleong78 wrote:
thanks a lot for this! sorry it took me awhile to respond. i wanted to make sure applying this logic in my full implementation works.
Note that there is a lot of simplification in the code I posted.
And a comment on comments. If you place the comments that explain the code BEFORE the code instead of after it then it is much easier to understand the code. Plus it will keep you from making lines that are too long for humans to read. With long lines in addition to scanning down the page you have to keep sweeping the field of focus left and right.
BTW your issue with INDEX() comes from defining short_code with a length of 5.
in_prev=index(last_tmp_cd, short_code);
will therefore only work for the first instance, because then you have 2 characters and 3 blanks in the first 5 bytes of your variable.
See Maxim 46!! Always define variables only for the needed length, and/or use TRIM() to get rid of the blanks at the end.
@dylanleong78 wrote:
thanks. i tried putting short_code=strip(upcase(substr(nam_1_s,1,2)));
but it did not work. then i placed the strip in in_prev=index(last_tmp_cd, short_code); and that worked!
The reason that putting the STRIP() function call in an ASSIGNMENT statement did not help is because the trailing spaces were added back when the value was stored into the fixed length character variable.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.