Hi All,
I'm merging very large datasets that have many variables. Is there an efficient way to find the variable with the most unique values for merging? Assuming there are still duplicates, I'm just trying to find the most unique value to prevent the merged dataset to be too large. I'm thinking maybe a program that prints out counts of unique values for each variable will be helpful so that I can compare and pick the variable with the most unique count.
Thank you!
I am unsure I understand your goal, but to answer your original question, have a look at the NLEVELS option for PROC FREQ to compute the variable cardinality.
Do you expect to have the same variables in both data sets? If they have the same names and variable types you might try:
Proc sql; create table want as select * from have1 natural join have2 ; quit;
From the documentation:
A natural join selects rows from two tables that have equal values in columns that share the same name and the same type. A natural join is requested with the syntax NATURAL JOIN. If like columns are not found, then a cross join is performed. Do not use an ON clause with a natural join. When using a natural join, an ON clause is implied, matching all like columns.
Wow. You seems to be sliding down a very slippery slope.
You should know your data. There should a be a key. Or maybe keys.
A combination of keys to build a unique merging key may be your answer.
High-cardinality variables may be something as trivial as invoice amount.
Would you really want to merge by that?
My problem is that we are still developing a method to de-dup the datasets, but I want to extract variables from different datasets for analysis coding that lead to merge. It is only a temporary solution, I'm trying to create a "bad dataset" that contains rough information I need just for analysis coding. Once we figure out the de-dup method we will update the dataset. Let me know if you have better suggestions to better my approach to the problem.
I am unsure I understand your goal, but to answer your original question, have a look at the NLEVELS option for PROC FREQ to compute the variable cardinality.
Have you looked at NLEVELS?
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.