BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
shyunk
Fluorite | Level 6

Hi, I'd appreciate your help regarding following:

My data

PersonidCharCode 
111188888886666
222292000000001
33392000000002

 

Mylist: ('8888888', '92000000001')  -> this is a very long list in actual dataset consisting of character variable

 

If Charcode (character variable) in my data contains anything in Mylist (even partially, not exact match), then I want to code a new variable Myfind, where Myfind=1. If Charcode does not match anything in mylist, then Myfind will be 0. 

Desired Output

PersonidCharCodeMyfind
1111888888866661
2222920000000011
333920000000020

 

 

What I've tried:

data mydata;
ATTRIB Myfind Length=3 Label='Yes/no indicator';
set pharmdat2;
If index ((CharCode), '8888888')>0 Or If index ((CharCode), '92000000001')>0 
THEN Myfind=1;
ELSE Myfind=0;
run;

This works, but very tedious. The fourth line is where I need to loop through without me going in and specifying every item in Mylist. Is there a way to modify just this part?

 

I've also tried:

%Let Mylist= ('8888888', '92000000001')

data mydata;
ATTRIB Myfind Length=3 Label='Yes/no indicator';
set pharmdat2;
If CharCode in: &Mylist
THEN Myfind=1;
ELSE Myfind=0;
run;

But this code returns Myfind=0 for all rows, although there is a Charcode to Mylist match that I know exists in this dataset dataset.

 

I also couldn't use like statement combined with '%Mylistitem%' because this is not a where statement.

 

1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

hi @shyunk 

 

This could be bone using several techniques besides the tedious writing of all check values.

 

Using a loop as you suggest is complicated. All check values should be read into the data step as a text array and the the actual values in each observation should be matched against all array values. 

 

Using a hash object would probably possible also, but it never got into my personal toolbox, so I will leave that to one of the hash gurus in this forum.

 

My preferred solution in cases like this is a SQL join. A SAS ifn-function is used to provide the value of MyFind based on a match or not. In this example the match is made on the shortest string from start, but the EQT operator could be switched to a CONTAINS to cover partial matched within the string as well.

 

data mydata;
	PersonID = 1111; CharCode = '88888886666'; output;
	PersonID = 1122; CharCode = '88888886789'; output;
	PersonID = 1133; CharCode = '88888856780'; output;
	PersonID = 2222; CharCode = '92000000001'; output;
	PersonID = 333;  CharCode = '92000000002'; output;
run;

data mylist;
	length CheckCode $20;
	CheckCode = '8888888'; output;
	CheckCode = '92000000001'; output;
run;

proc sql;
	create table want as
		select distinct
			a.PersonID,
			a.CharCode,
			ifn(b.CheckCode ne '',1,0) as MyFind
		from mydata as a
		left join mylist as b
		on a.CharCode eqt b.CheckCode;
quit;

Result:

2021-07-02 20_52_14-Window.gif

 

 

View solution in original post

7 REPLIES 7
ErikLund_Jensen
Rhodochrosite | Level 12

hi @shyunk 

 

This could be bone using several techniques besides the tedious writing of all check values.

 

Using a loop as you suggest is complicated. All check values should be read into the data step as a text array and the the actual values in each observation should be matched against all array values. 

 

Using a hash object would probably possible also, but it never got into my personal toolbox, so I will leave that to one of the hash gurus in this forum.

 

My preferred solution in cases like this is a SQL join. A SAS ifn-function is used to provide the value of MyFind based on a match or not. In this example the match is made on the shortest string from start, but the EQT operator could be switched to a CONTAINS to cover partial matched within the string as well.

 

data mydata;
	PersonID = 1111; CharCode = '88888886666'; output;
	PersonID = 1122; CharCode = '88888886789'; output;
	PersonID = 1133; CharCode = '88888856780'; output;
	PersonID = 2222; CharCode = '92000000001'; output;
	PersonID = 333;  CharCode = '92000000002'; output;
run;

data mylist;
	length CheckCode $20;
	CheckCode = '8888888'; output;
	CheckCode = '92000000001'; output;
run;

proc sql;
	create table want as
		select distinct
			a.PersonID,
			a.CharCode,
			ifn(b.CheckCode ne '',1,0) as MyFind
		from mydata as a
		left join mylist as b
		on a.CharCode eqt b.CheckCode;
quit;

Result:

2021-07-02 20_52_14-Window.gif

 

 

shyunk
Fluorite | Level 6

Hi @ErikLund_Jensen.

 

Thank you again for your help. I realized the code isn't doing exactly what I intended, when I ran into this issue. Do you have any thoughts on why this may be happening?

 

 

 

 

data mydata;
	PersonID = 11; CharCode = '68000000000'; output;
	PersonID = 22; CharCode = '70710120901'; output;
	PersonID = 33; CharCode = '69238131906'; output;
	PersonID = 44; CharCode = '31722065832'; output;
	PersonID = 55;  CharCode = '43598043060'; output;
run;


data mylist;
INPUT CheckCode $;
	length CheckCode $20;
	Datalines;
        68462030868
        70710120407
        69238131906
        31722065832
        43598043060
;; 
run;

proc sql;
	create table want as
		select distinct
			a.PersonID,
			a.CharCode,
			ifn(b.CheckCode ne '',1,0) as MyFind
		from mydata as a
		left join mylist as b
		on a.CharCode contains b.CheckCode;
quit;

 

 

 

 

This returns below, where CharCode is trimmed to 8 letters, and it cuts out the last three digits of everything. Therefore, although PersonID 22, 33 in mydata is not a match because of its last three digits, it has 1 for Myfind. I checked the length of Charcode, and it's set to 30 now. This is the same for my real (longer) dataset and my mock dataset. I would appreciate some guidance why SAS (or SQL) is automatically trimming it to 8 letters here. I want to have full 11 letters.

PersonIDCharCodeMyfind
11680000000
22707101201
33692381311
44317220651
55435980431

Output Table 1

 

When I use eqt or = instead of contains, using my real dataset, the output is 0 for all myfind where it is unable to find any match (Output table2). Interestingly, when I use my mock dataset I get the same 8 letter truncated table as Output table 1

 on a.CharCode eqt b.CheckCode;
on a.CharCode = b.CheckCode;

 

PersonIDCharCodeMyfind
11680000000
22707101200
33692381310
44317220650
55435980430

Output Table 2

 

Your help is much appreciated!

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @shyunk 

 

I made two small changes. The first is to get the full check value by using a $20 informat. The other reverses the comparison so checkcode contains charcodee instead of the reverse. I Hope that helps.

 

data mydata;
	PersonID = 11; CharCode = '68000000000'; output;
	PersonID = 22; CharCode = '70710120901'; output;
	PersonID = 33; CharCode = '69238131906'; output;
	PersonID = 44; CharCode = '31722065832'; output;
	PersonID = 55;  CharCode = '43598043060'; output;
run;


data mylist;
	INPUT CheckCode $20.;
	/*length CheckCode $20;*/
	Datalines;
        68462030868
        70710120407
        69238131906
        31722065832
        43598043060
;; 
run;

proc sql;
	create table want as
		select distinct
			a.PersonID,
			a.CharCode,
			ifn(b.CheckCode ne '',1,0) as MyFind
		from mydata as a
		left join mylist as b
		/*on a.CharCode contains b.CheckCode;*/
		on b.CheckCode contains a.CharCode;
quit;

charcode.gif

 

shyunk
Fluorite | Level 6

@ErikLund_Jensen Thank you, I realized I was setting the informat incorrectly for mydata (it was too long), which was why SAS was trying to truncate it. Now it's doing what I intended it to do. I learned a lot from your answers and cannot thank you enough!

Tom
Super User Tom
Super User

You have these two statements backwards:

INPUT CheckCode $;
length CheckCode $20;

By referencing CHECKCODE in the INPUT statement before you have defined it SAS will have to use the default length of $8 for the variable.

 

Also these lines should NOT be indented. It will just confuse you and make it harder for you to notice how long the values actual are to intend those lines.

datalines;
68462030868
70710120407
69238131906
31722065832
43598043060
; 

The extra RUN: statement between the end of the data step and the start of the PROC SQL is also not useful, but it does not really do much harm other than it might confuse some programmers into thinking they can include statements for the data step in between the end of the in-line data and that superfluous RUN statement. 

shyunk
Fluorite | Level 6
Thank you! Those are great suggestions.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 2652 views
  • 3 likes
  • 3 in conversation