BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lydiawawa
Lapis Lazuli | Level 10

Hi All,

 

I'm merging very large datasets that have many variables. Is there an efficient way to find the variable with the most unique values for merging? Assuming there are still duplicates, I'm just trying to find the most unique value to prevent the merged dataset to be too large. I'm thinking maybe a program that prints out counts of unique values for each variable will be helpful so that I can compare and pick the variable with the most unique count. 

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

I am unsure I understand your goal, but to answer your original question, have a look at the NLEVELS option for PROC FREQ to compute the variable cardinality.

View solution in original post

7 REPLIES 7
ballardw
Super User

Do you expect to have the same variables in both data sets? If they have the same names and variable types you might try:

 

Proc sql;
   create table want as 
   select *
   from have1 
        natural join
        have2    ;
quit;   

From the documentation:

A natural join selects rows from two tables that have equal values in columns that share the same name and the same type. A natural join is requested with the syntax NATURAL JOIN. If like columns are not found, then a cross join is performed. Do not use an ON clause with a natural join. When using a natural join, an ON clause is implied, matching all like columns.

 

 

 

ChrisNZ
Tourmaline | Level 20

Wow. You seems to be sliding down a very slippery slope.

 

You should know your data. There should a be a key. Or maybe keys.

A combination of keys to build a unique merging key may be your answer.

 

High-cardinality variables may be something as trivial as invoice amount. 

Would you really want to merge by that?

lydiawawa
Lapis Lazuli | Level 10

My problem is that we are still developing a method to de-dup the datasets, but I want to extract variables from different datasets for analysis coding that lead to merge. It is only a temporary solution, I'm trying to create a "bad dataset" that contains rough information I need just for analysis coding. Once we figure out the de-dup method we will update the dataset. Let me know if you have better suggestions to better my approach to the problem.

ChrisNZ
Tourmaline | Level 20

I am unsure I understand your goal, but to answer your original question, have a look at the NLEVELS option for PROC FREQ to compute the variable cardinality.

lydiawawa
Lapis Lazuli | Level 10
What I was trying to say is that, I'm aware there is no unique variable, but I'm in desperate need to extract variables from other datasets. Just trying to find a way to produce the least explosive merged dataset...
lydiawawa
Lapis Lazuli | Level 10
I will do that when I get back to work. Thank you for checking up on me.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2982 views
  • 5 likes
  • 3 in conversation