Re: Between two groups of variables, how many have the same value

TerryC · Posted 01-20-2020 04:24 PM

I am working on a dataset, which includes data both from the current year and from the previous year. The following is an example:

I currently have a dataset like this:

# CY_1 CY_2 CY_3 PY_1 PY_2 PY_3

1 c e g a b e

2 b c f a d g

3 b d e c d e

4 a c d a b e

I want to find out how many variables in CY_1, CY_2, and CY_3 equals valuables in PY_1, PY_2, and PY_3.

I hope to get the following output.

# CY_1 CY_2 CY_3 PY_1 PY_2 PY_3 output

1 c e g a b e 1

2 b c f a d g 0

3 b d e c d e 2

4 a c d a b e 1

Thank you!

novinosrin · Posted 01-20-2020 04:53 PM

HI @TerryC


data have;
input N    (CY_1 CY_2 CY_3         PY_1 PY_2 PY_3) ($);
cards;
1       c        e          g               a        b        e
2       b        c          f                a        d        g
3       b        d          e               c        d        e
4       a        c          d               a        b        e
;

data want;
 set have;
 want=countc(cats(of py_:),cats(of cy_:));
run;

The above assumes your values are letters just like your example

hashman · Posted 01-20-2020 05:06 PM

@TerryC:

Try this (under the assumption that none of the CY and PY contains a low-value "00"x as part of it):

data have ;                                                                                                                             
  input (CY_1-CY_3 PY_1-PY_3) ($) ;                                                                                                     
  cards ;                                                                                                                               
c  e  g    a  b  e                                                                                                                      
b  c  f    a  d  g                                                                                                                      
b  d  e    c  d  e                                                                                                                      
a  c  d    a  b  e                                                                                                                      
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want ;                                                                                                                             
  set have ;                                                                                                                            
  array cy cy: ;                                                                                                                        
  output = 0 ;                                                                                                                          
  do over cy ;                                                                                                                          
    output + ^^ find (catx ("00"x, of py:), trim (cy)) ;                                                                                
  end ;                                                                                                                                 
run ;

Kind regards

Paul D.

novinosrin · Posted 01-20-2020 05:38 PM

Guru @hashman Plagiarizing your idea, forgive me. This is something similar posted by you in another thread about a year and half ago :). Basically, it seems the 'K' modifier in COUNTW makes everything else a delimiter except the excerpt in 2nd argument. So copying your loop construct here, the following may work for "Words"

data have;
input N    (CY_1 CY_2 CY_3         PY_1 PY_2 PY_3) ($);
cards;
1       c        e          g               a        b        e
2       b        c          f                a        d        g
3       b        d          e               c        d        e
4       a        c          d               a        b        e
;

data want;
 set have;
 v1=cats(of py_:);
 array t cy_:;
  do over t;
    want=sum(countw(v1,strip(t),'k'),want);
  end;
 drop v1;
run;

k or K

causes all characters that are not in the list of characters to be treated as delimiters. If K is not specified, all characters that are in the list of characters are treated as delimiters.

EDIT: Please ignore. Sorry that I didn't realize it's all single chars

The above COUNTW is replace with COUNT


data want;
 set have;
 v1=cats(of py_:);
 array t cy_:;
  do over t;
    want=sum(count(v1,strip(t)),want);
  end;
 drop v1;
run;

TerryC · Posted 01-20-2020 10:01 PM

Thanks! It works.

Can I ask an additional question, which is similar to this one?

For example, I have the following strings. I want to get how many numbers these two strings have in common.

# String_1 String_2

1 |5|12|37|44| |8|12|44|

2 |45|67| |23|44|78|82|

3 |12|22|34|67|69|72|78| |34|37|72|78|

4 |2|14|33| |14|79|

I hope to get the following results.

# String_1 String_2 Output

1 |5|12|37|44| |8|12|44| 2

2 |45|67| |23|44|78|82| 0

3 |12|22|34|67|69|72|78| |34|37|72|78| 3

4 |2|14|33| |14|79| 1

Thank you!

unison · Posted 01-20-2020 05:13 PM

Another idea:

data have;
	input (cy1-cy3) ($) (py1-py3) ($);
	datalines;
c e g a b e
b c f a d g
b d e c d e
a c d a b e
;
run;

data want;
	set have;
	array cy[*] cy:;
	array py[*] py:;
	do _i=1 to dim(py);

		do _j=1 to dim(cy);
			flag=sum(flag, (py[_i]=cy[_j]));
		end;
	end;
	drop _:;
run;

-unison

mkeintz · Posted 01-21-2020 04:03 AM

Can a letter appear more than once in the CY's, or in the PY's?

If so, then if "a" appears twice in the CY's and once in the PY's, does that count as 1 match, or 2?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 01-22-2020 08:04 AM

data have;
	input (cy1-cy3) ($) (py1-py3) ($);
	datalines;
c e g a b e
b c f a d g
b d e c d e
a c d a b e
;
run;

data want;
	set have;
	array cy{*} cy:;
	array py{*} py:;
	n=0;
	do i=1 to dim(py);
	 if cy{i} in py then n+1;
	end;
run;

proc print;run;

hashman · Posted 01-22-2020 09:48 AM

@Ksharp: But indeed. I keep forgetting the convenience of IN against an array's name reference for searching the array!

Kind regards

Paul D.

Ksharp · Posted 01-23-2020 07:22 AM

So am I .Human brain offen fade away something as time goes by .

Classroom Training Available!