BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Xusheng
Obsidian | Level 7

Hi community members,

I have a question regarding comparing the length of variables from two data sets but need help with the coding. Here are the examples of my two sets. I just create them randomly.

Data A;
	input ID var1 $ var2 $;
	datalines;
1 go am
2 between sas
3 my going
4 the xyz
;
run;
Data B;
	input Obs Var3 $ Var4 $;
	datalines;
1 thy html
2 is lol
3 bedween pm
4 my preview
5 gone yxz
;
run;

Say I want to compare the length of variables between var1 from data A and var3 from data B with the following code which would allow me to create a table of outcomes:

 IF LENGTH(VAR1)>LENGTH(VAR3) THEN Z3=INDEX(VAR1,VAR3);                                                                                            

  ELSE Z3=INDEX(VAR3,VAR1);                                                                                                                   

  IF Z3>0 THEN POSSIBLE_MATCH='Y';                                                                                                      

  ELSE POSSIBLE_MATCH='N';                                                                                                              

OUTPUT;   
RUN;

My questions are: 1. does SAS allow me to compare the variables from two different data set?

2. The data A does not have the same number of variables as data B, does that affect the result of comparing?

3. How to do this in a macro manner if I have a larger dataset?

I appreciate any comment or suggestion.

Thank you.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

If you're trying to do fuzzy matching/probabilistic linkages I would recommend the post here:

https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780

 

I find the solution from FriedEgg to be a useful one.

View solution in original post

8 REPLIES 8
Astounding
PROC Star

You need a plan ... a detailed plan ... before you start to program anything.  For example ...

 

Your sample code contains an OUTPUT statement.  OUTPUT to where?  What's the name of the data set that will hold whatever it is you want to output?  What variables will it contain?  What observations?

 

It looks like you want to compare VAR1 to VAR3.  Does that mean every VAR1 gets compared to every VAR3?

 

Your code ignores some basic concepts.  All your character variables have a length of 8.  So if a variable contains "my" it actually contains 6 blanks as well.  What will the INDEX function do with those extra blanks?  What should happen with a partial match such as "my" vs. "grammy" ?

Xusheng
Obsidian | Level 7

Thank you for replying Astounding.

My plan is to compare VAR 1 and VAR3, yes, every VAR1 compare with every VAR3. I could say match VAR1 and VAR3. To see if they are matched based on their length.

I'm not confident with my sample code in terms of the comparing. So any suggestion I will appreciate. 

Here is the whole story, the idea is based on my another post where I want to improve the matching outcome but hasn't found a solution yet. Here is the link to the post: https://communities.sas.com/t5/SAS-Data-Management/Improving-fuzzy-match-quality/m-p/373823#M11329

Therefore I want to use another method to compare the name such as "TOYS R US PROPERTY CO I" and "TOYS R U". Here, if I use the length to compare, it would give me a more reasonable outcome than compged function (results are posted in my fuzzy matching post). 

However, I don't know if sas could do the comparison between two data sets and I lack the knowledge of macro so try to find a way to do it using macro. Therefore I post this question to see if there is any suggestion.

DATA A;                                                                                                                                 

  X='TOYS R US PROPERTY CO I';                                                                                                          

  Y='TOYS R U';                                                                                                                         

  IF LENGTH(X)>LENGTH(Y) THEN Z3=INDEX(X,Y);                                                                                            

  ELSE Z3=INDEX(Y,X);                                                                                                                   

  IF Z3>0 THEN POSSIBLE_MATCH='Y';                                                                                                      

  ELSE POSSIBLE_MATCH='N';                                                                                                              

OUTPUT;                                                                                                                                 

  X='CHEMOURS CO';                                                                                                                      

  Y='CARMIKE';                                                                                                                          

  IF LENGTH(X)>LENGTH(Y) THEN Z3=INDEX(X,Y);                                                                                            

  ELSE Z3=INDEX(Y,X);                                                                                                                   

  IF Z3>0 THEN POSSIBLE_MATCH='Y';                                                                                                      

  ELSE POSSIBLE_MATCH='N';                                                                                                              

OUTPUT;                                                                                                                                 

RUN;                                                                                                                                    

                                                                                                                                        

PROC PRINT;RUN;
Reeza
Super User

If you're trying to do fuzzy matching/probabilistic linkages I would recommend the post here:

https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780

 

I find the solution from FriedEgg to be a useful one.

Astounding
PROC Star

As @Reeza mentioned, there is a lot of work in the literature about fuzzy matching.  You don't have to start from scratch.

 

As you begin, first consider that the number of observations balloons if you compare every observation in A to every observation in B.  If each starts with 10,000 observations, you end up with 100,000,000 observations.  So consider whether you would like to perform some sort of subsetting.  The easiest way to get all 100,000,000 would be:

 

proc sql;

create table huge as select * from a, b;

quit;

 

To reduce the number to matches based on the examples you mentioned, you could use:

 

proc sql;

create table medium as select * from a, b

where index(a.var1, trim(b.var3)) > 0 or index(b.var3, trim(a.var1)) > 0;

quit;

 

However, this is a very limited type of match and won't account for many spelling variations.  Finding better matching techniques ... that's where the literature search will come in handy.

Xusheng
Obsidian | Level 7

Hi Astounding, as I applied the code that Reeze recommended, the outcome did improve. I will continue to try different methods to see which one would yield the most acceptable result.

I think another reason is the data is not cleaned enough to apply the fuzzying matching method. I will look at this kind of literature as well. 

Xusheng
Obsidian | Level 7

Hi Reeza, thank you for your advice. I applied the code to my data and it did improve the outcome of matching. However, since it's  the first time I've heard sql, I'm not able to modify the code. I'm wondering if you have any idea about how to include the variables I want to the match table under sql? Here is my example.

The result from original code:

"A                                B                  Match_type

 TOYS R US       TOYS R US          exact          "

The result I want to achieve:

"A                        gvkeyA                  B             gvkeyB       Match_type

 TOYS R US       1000            TOYS R US       1000          exact              "

where gvkeyA is from data A, 1000 is the gvkey related with TOYS R US; gvkeyB is from data B and 1000 is the data related to TOYS R US in data B.

Reeza
Super User

@Xusheng wrote:

Hi Reeza, thank you for your advice. I applied the code to my data and it did improve the outcome of matching. However, since it's  the first time I've heard sql, I'm not able to modify the code.

  

If you're going to implement, use and/or recommend a solution in either your work or homework, you need to understand what it's doing and how to change it. I would suggest some SQL tutorials by searching LexJansen.com. 

 

I think you need to change the SELECT items, to include what you want, as well as the JOIN or WHERE condition to ensure that GVKEY matches as well as the 'fuzzy' match.

 

Xusheng
Obsidian | Level 7

Thank you Reeza, I will check the Lex Jansen's web and your advice.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1577 views
  • 0 likes
  • 3 in conversation