I have a problem with a Proc SQL join.
I have two data sets with 124,100 rows each. Data set A has two columns:
Contact_ID Place_of_Employment
There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.
The second data set Data set B has the following columns. These are sparse also.
The Place_of_employment in B matches to the Place_of-Employment in A.
Here are the columns for Dataset B:
Place_of_Employment NAICS_Sector Sector_Type Type_Firm Child_minor
Disabled More_than_one_Job Retired Unemployed__Not_working
Work_from_home_remotely
I used the following code
Proc sql noprint; CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS Select A.*, B.* From SASCDC_2.Contact_Person_Employ_field AS A Left Join SASCDC_2.Employment_Classify AS B ON A.Place_of_Employment = B.Place_of_Employment; quit;
I would expect the query to take a few seconds but this is the result from the log
Proc sql noprint;
39 CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
40 Select A.*, B.*
41 From SASCDC_2.Contact_Person_Employ_field AS A
42 Left Join SASCDC_2.Employment_Classify AS B
43 ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: There were 124803 observations read from the data set
SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD.
NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY.
WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete. When this step
was stopped there were 3036491921 observations and 11 variables.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52
percent.
Compressed is 1061686 pages; un-compressed would require 11204768 pages.
WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was
stopped.
NOTE: PROCEDURE SQL used (Total process time):
real time 25:19.26
cpu time 22:39.62
44 quit;
When I stopped the query 3,036,491,921 observations had been processed.
Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B. That is why I chose the Left Join.
There is some problem here if over 3 billion obs were processed and the query was no where near to completion.
What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.
Thank you.
wklierman
Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.
I have a problem with a Proc SQL join.
I have two data sets with 124,100 rows each. Data set A has two columns:
Contact_ID Place_of_Employment
There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.
The second data set Data set B has the following columns. These are sparse also.
The Place_of_employment in B matches to the Place_of-Employment in A.
Here are the columns for Dataset B:
Place_of_Employment NAICS_Sector Sector_Type Type_Firm Child_minor
Disabled More_than_one_Job Retired Unemployed__Not_working
Work_from_home_remotely
I used the following code
Proc sql noprint; CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS Select A.*, B.* From SASCDC_2.Contact_Person_Employ_field AS A Left Join SASCDC_2.Employment_Classify AS B ON A.Place_of_Employment = B.Place_of_Employment; quit;
I would expect the query to take a few seconds but this is the result from the log
Proc sql noprint;
39 CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
40 Select A.*, B.*
41 From SASCDC_2.Contact_Person_Employ_field AS A
42 Left Join SASCDC_2.Employment_Classify AS B
43 ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: There were 124803 observations read from the data set
SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD.
NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY.
WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete. When this step
was stopped there were 3036491921 observations and 11 variables.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52
percent.
Compressed is 1061686 pages; un-compressed would require 11204768 pages.
WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was
stopped.
NOTE: PROCEDURE SQL used (Total process time):
real time 25:19.26
cpu time 22:39.62
44 quit;
When I stopped the query 3,036,491,921 observations had been processed.
Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B. That is why I chose the Left Join.
There is some problem here if over 3 billion obs were processed and the query was no where near to completion.
What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.
Thank you.
wklierman
DUPLICATE THREAD
Do not reply in this thread
Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unrea...
@PaigeMiller wrote:
DUPLICATE THREAD
Do not reply in this thread
Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unrea...
Merged.
Sparse data can cause this. Each missing in data set A is matched with each missing in data set B, in effect a Cartesian join of all records where Place_of_Employment is missing in both data sets. Instead, try this:
From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
Thank you for responding. This how I altered the code with your code.
Proc sql noprint; CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS Select A.*, B.* From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B ON A.Place_of_Employment = B.Place_of_Employment; quit;
I still got over 11MM obs as can be seen in the log.
Select A.*, B.*
48 From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS
48 ! A
49 Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
50 ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 87.73
percent.
Compressed is 5015 pages; un-compressed would require 40858 pages.
NOTE: Table SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS created, with 11072295 rows and 11 columns.
51 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 5.89 seconds
cpu time 5.54 seconds
Need some additional modification.
wklierman
Bottom line: Know your data.
You need to figure out which values have a high cardinality (repeated values) and cause the large Cartesian product.
Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.
Thanks. You are right. It can't be accomplished. So I did it the old fashion way.
Exported only the contact_person_id to an xlsx worksheet. Then I copy / pasted the employment data (which had been exported earlier to an xlsx worksheet) to the contact_person_id and I have the dataset that I needed.
I should have thought of that work-around before.
Thank you all for your help.
wklierman
If the order of both datasets is already correct, then a MERGE without BY can also do that side-by-side lineup.
You should leave those two tables separate unless there is a field matching Contact_ID in table Employment_Classify to do the join on.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.