BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lydiawawa
Lapis Lazuli | Level 10

Hi All,

 

I'm merging very large datasets that have many variables. Is there an efficient way to find the variable with the most unique values for merging? Assuming there are still duplicates, I'm just trying to find the most unique value to prevent the merged dataset to be too large. I'm thinking maybe a program that prints out counts of unique values for each variable will be helpful so that I can compare and pick the variable with the most unique count. 

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

I am unsure I understand your goal, but to answer your original question, have a look at the NLEVELS option for PROC FREQ to compute the variable cardinality.

View solution in original post

7 REPLIES 7
ballardw
Super User

Do you expect to have the same variables in both data sets? If they have the same names and variable types you might try:

 

Proc sql;
   create table want as 
   select *
   from have1 
        natural join
        have2    ;
quit;   

From the documentation:

A natural join selects rows from two tables that have equal values in columns that share the same name and the same type. A natural join is requested with the syntax NATURAL JOIN. If like columns are not found, then a cross join is performed. Do not use an ON clause with a natural join. When using a natural join, an ON clause is implied, matching all like columns.

 

 

 

ChrisNZ
Tourmaline | Level 20

Wow. You seems to be sliding down a very slippery slope.

 

You should know your data. There should a be a key. Or maybe keys.

A combination of keys to build a unique merging key may be your answer.

 

High-cardinality variables may be something as trivial as invoice amount. 

Would you really want to merge by that?

lydiawawa
Lapis Lazuli | Level 10

My problem is that we are still developing a method to de-dup the datasets, but I want to extract variables from different datasets for analysis coding that lead to merge. It is only a temporary solution, I'm trying to create a "bad dataset" that contains rough information I need just for analysis coding. Once we figure out the de-dup method we will update the dataset. Let me know if you have better suggestions to better my approach to the problem.

ChrisNZ
Tourmaline | Level 20

I am unsure I understand your goal, but to answer your original question, have a look at the NLEVELS option for PROC FREQ to compute the variable cardinality.

lydiawawa
Lapis Lazuli | Level 10
What I was trying to say is that, I'm aware there is no unique variable, but I'm in desperate need to extract variables from other datasets. Just trying to find a way to produce the least explosive merged dataset...
lydiawawa
Lapis Lazuli | Level 10
I will do that when I get back to work. Thank you for checking up on me.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1551 views
  • 5 likes
  • 3 in conversation