BookmarkSubscribeRSS Feed
sasmaverick
Obsidian | Level 7

I have two datasets, one with 3000 records and other with 30000 records. I need to join the datasets so that the variables in the datasets are side by side. However, I do not have a primary key to join on.

The result should have the number of observations of the larger dataset.

Thanks,

6 REPLIES 6
Reeza
Super User

So how are you repeating the data? Or past 3000 the observations are missing?

Some sample data would be great.

You can merge side by side directly with a MERGE and no BY statement. Make sure you have no variables with the same names, or that you understand how they'll be handled.

data test;

merge sashelp.cars(obs=30 keep=make model mpg_city) sashelp.class;

run;

jwillis
Quartz | Level 8

sasmaverick,

An example of your data would help.  Potentially you could have 30,000 rows with all the columns from the 3000 row table.  In the potential table you could have 27,000 rows with missing values in the columns from the 3000 row table.  If you create a mockup of what you have and what you want to have, it will be easier to help you.

Astounding
PROC Star

How about this as an approach:

data want;

   set large_file;

   random_obs = ceil(_nobs_ * ranuni(12345));

   set small_file point=random_obs nobs=_nobs_;

   drop random_obs;

run;

For each record in the larger file, it randomly selects one of the records in the smaller file and outputs.  Naturally, each record in the smaller file can appear multiple times in the final result.  And since the selection is random, it is possible that one of the records in the smaller file might not be randomly selected and thus might not appear in the final result.

sasmaverick
Obsidian | Level 7

Thanks all for your replies.

@jwillis Following is the sample of datasets I have:

Larger dataset:

Name      Site                     City                Amount

Arthur     Madrid-JMC          Madrid          $40000

Simon     San Diego-Main     San Diego     $20000

......30,000 rows

Smaller Dataset:

Person_Name     Site_Name     City          Amount

Arthur               Madrid-JMC     Madrid        $40000

James               Paris-ANC     Paris          $15000

.....3000 rows

Expected Output: (Need to create a new variable called Match_Type--based on matching Names, but don't want to join based on Name)

Name      Person_Name    Site                     Site_Name      City                  Amount           Match_Type

Arthur        Arthur          Madrid-JMC          Madrid-JMC     Madrid                  $40000          Match

Simon                           San Diego-Main                           San Diego            $20000          No Match

                James                                       Paris-ANC       Paris                   $15000         No Match

Note: Simon and James are only present in their respective datasets.

Astounding
PROC Star

So what do you want to do on the 3,001 observation, when your smaller data set has run out of observations?

rajan_sharma
Calcite | Level 5

proc sort data = large;
by name;
run;

 

proc sort data = small;
by person_name;
run;

 

data merged;
retain name person_name;
merge large(in= l) small(rename = (person_name=name) in = s);
by name;
length status$ 8;
if s then do;
person_name = name;
if l then status = "match";
else status = "no match";
end;
else do;
person_name = "";
status = "no match";
end;
run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 9375 views
  • 0 likes
  • 5 in conversation