Solved: Re: How to find the most unique/distinct variable?

lydiawawa · Posted 04-25-2019 06:35 PM

Hi All,

I'm merging very large datasets that have many variables. Is there an efficient way to find the variable with the most unique values for merging? Assuming there are still duplicates, I'm just trying to find the most unique value to prevent the merged dataset to be too large. I'm thinking maybe a program that prints out counts of unique values for each variable will be helpful so that I can compare and pick the variable with the most unique count.

Thank you!

ChrisNZ · Posted 04-25-2019 08:04 PM

I am unsure I understand your goal, but to answer your original question, have a look at the NLEVELS option for PROC FREQ to compute the variable cardinality.

High-Performance SAS Coding - Third Edition

View solution in original post

ballardw · Posted 04-25-2019 06:57 PM

Do you expect to have the same variables in both data sets? If they have the same names and variable types you might try:

Proc sql;
   create table want as 
   select *
   from have1 
        natural join
        have2    ;
quit;

From the documentation:

A natural join selects rows from two tables that have equal values in columns that share the same name and the same type. A natural join is requested with the syntax NATURAL JOIN. If like columns are not found, then a cross join is performed. Do not use an ON clause with a natural join. When using a natural join, an ON clause is implied, matching all like columns.

ChrisNZ · Posted 04-25-2019 07:02 PM

Wow. You seems to be sliding down a very slippery slope.

You should know your data. There should a be a key. Or maybe keys.

A combination of keys to build a unique merging key may be your answer.

High-cardinality variables may be something as trivial as invoice amount.

Would you really want to merge by that?

High-Performance SAS Coding - Third Edition

lydiawawa · Posted 04-25-2019 07:59 PM

My problem is that we are still developing a method to de-dup the datasets, but I want to extract variables from different datasets for analysis coding that lead to merge. It is only a temporary solution, I'm trying to create a "bad dataset" that contains rough information I need just for analysis coding. Once we figure out the de-dup method we will update the dataset. Let me know if you have better suggestions to better my approach to the problem.

ChrisNZ · Posted 04-25-2019 08:04 PM

I am unsure I understand your goal, but to answer your original question, have a look at the NLEVELS option for PROC FREQ to compute the variable cardinality.

High-Performance SAS Coding - Third Edition

lydiawawa · Posted 04-25-2019 08:12 PM

What I was trying to say is that, I'm aware there is no unique variable, but I'm in desperate need to extract variables from other datasets. Just trying to find a way to produce the least explosive merged dataset...

ChrisNZ · Posted 04-25-2019 09:32 PM

Have you looked at NLEVELS?

High-Performance SAS Coding - Third Edition

lydiawawa · Posted 04-25-2019 10:43 PM

I will do that when I get back to work. Thank you for checking up on me.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away