I have a problem with a Proc SQL join.
I have two data sets with 124,100 rows each. Data set A has two columns:
Contact_ID Place_of_Employment
There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.
The second data set Data set B has the following columns. These are sparse also.
The Place_of_employment in B matches to the Place_of-Employment in A.
Here are the columns for Dataset B:
Place_of_Employment NAICS_Sector Sector_Type Type_Firm Child_minor
Disabled More_than_one_Job Retired Unemployed__Not_working
Work_from_home_remotely
I used the following code
Proc sql noprint; CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS Select A.*, B.* From SASCDC_2.Contact_Person_Employ_field AS A Left Join SASCDC_2.Employment_Classify AS B ON A.Place_of_Employment = B.Place_of_Employment; quit;
I would expect the query to take a few seconds but this is the result from the log
Proc sql noprint; 39 CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS 40 Select A.*, B.* 41 From SASCDC_2.Contact_Person_Employ_field AS A 42 Left Join SASCDC_2.Employment_Classify AS B 43 ON A.Place_of_Employment = B.Place_of_Employment; WARNING: Variable Place_of_Employment already exists on file SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS. NOTE: SAS threaded sort was used. NOTE: There were 124803 observations read from the data set SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD. NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY. WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete. When this step was stopped there were 3036491921 observations and 11 variables. NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52 percent. Compressed is 1061686 pages; un-compressed would require 11204768 pages. WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was stopped. NOTE: PROCEDURE SQL used (Total process time): real time 25:19.26 cpu time 22:39.62 44 quit;
When I stopped the query 3,036,491,921 observations had been processed.
Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B. That is why I chose the Left Join.
There is some problem here if over 3 billion obs were processed and the query was no where near to completion.
What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.
Thank you.
wklierman
Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.
I have a problem with a Proc SQL join.
I have two data sets with 124,100 rows each. Data set A has two columns:
Contact_ID Place_of_Employment
There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.
The second data set Data set B has the following columns. These are sparse also.
The Place_of_employment in B matches to the Place_of-Employment in A.
Here are the columns for Dataset B:
Place_of_Employment NAICS_Sector Sector_Type Type_Firm Child_minor
Disabled More_than_one_Job Retired Unemployed__Not_working
Work_from_home_remotely
I used the following code
Proc sql noprint; CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS Select A.*, B.* From SASCDC_2.Contact_Person_Employ_field AS A Left Join SASCDC_2.Employment_Classify AS B ON A.Place_of_Employment = B.Place_of_Employment; quit;
I would expect the query to take a few seconds but this is the result from the log
Proc sql noprint; 39 CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS 40 Select A.*, B.* 41 From SASCDC_2.Contact_Person_Employ_field AS A 42 Left Join SASCDC_2.Employment_Classify AS B 43 ON A.Place_of_Employment = B.Place_of_Employment; WARNING: Variable Place_of_Employment already exists on file SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS. NOTE: SAS threaded sort was used. NOTE: There were 124803 observations read from the data set SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD. NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY. WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete. When this step was stopped there were 3036491921 observations and 11 variables. NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52 percent. Compressed is 1061686 pages; un-compressed would require 11204768 pages. WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was stopped. NOTE: PROCEDURE SQL used (Total process time): real time 25:19.26 cpu time 22:39.62 44 quit;
When I stopped the query 3,036,491,921 observations had been processed.
Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B. That is why I chose the Left Join.
There is some problem here if over 3 billion obs were processed and the query was no where near to completion.
What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.
Thank you.
wklierman
DUPLICATE THREAD
Do not reply in this thread
Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unrea...
@PaigeMiller wrote:
DUPLICATE THREAD
Do not reply in this thread
Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unrea...
Merged.
Sparse data can cause this. Each missing in data set A is matched with each missing in data set B, in effect a Cartesian join of all records where Place_of_Employment is missing in both data sets. Instead, try this:
From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
Thank you for responding. This how I altered the code with your code.
Proc sql noprint; CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS Select A.*, B.* From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B ON A.Place_of_Employment = B.Place_of_Employment; quit;
I still got over 11MM obs as can be seen in the log.
Select A.*, B.* 48 From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS 48 ! A 49 Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B 50 ON A.Place_of_Employment = B.Place_of_Employment; WARNING: Variable Place_of_Employment already exists on file SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS. NOTE: SAS threaded sort was used. NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 87.73 percent. Compressed is 5015 pages; un-compressed would require 40858 pages. NOTE: Table SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS created, with 11072295 rows and 11 columns. 51 quit; NOTE: PROCEDURE SQL used (Total process time): real time 5.89 seconds cpu time 5.54 seconds
Need some additional modification.
wklierman
Bottom line: Know your data.
You need to figure out which values have a high cardinality (repeated values) and cause the large Cartesian product.
Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.
Thanks. You are right. It can't be accomplished. So I did it the old fashion way.
Exported only the contact_person_id to an xlsx worksheet. Then I copy / pasted the employment data (which had been exported earlier to an xlsx worksheet) to the contact_person_id and I have the dataset that I needed.
I should have thought of that work-around before.
Thank you all for your help.
wklierman
If the order of both datasets is already correct, then a MERGE without BY can also do that side-by-side lineup.
You should leave those two tables separate unless there is a field matching Contact_ID in table Employment_Classify to do the join on.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.