I have a problem with a Proc SQL join.
I have two data sets with 124,100 rows each. Data set A has two columns:
Contact_ID Place_of_Employment
There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.
The second data set Data set B has the following columns. These are sparse also.
The Place_of_employment in B matches to the Place_of-Employment in A.
Here are the columns for Dataset B:
Place_of_Employment NAICS_Sector Sector_Type Type_Firm Child_minor
Disabled More_than_one_Job Retired Unemployed__Not_working
Work_from_home_remotely
I used the following code
Proc sql noprint; CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS Select A.*, B.* From SASCDC_2.Contact_Person_Employ_field AS A Left Join SASCDC_2.Employment_Classify AS B ON A.Place_of_Employment = B.Place_of_Employment; quit;
I would expect the query to take a few seconds but this is the result from the log
Proc sql noprint; 39 CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS 40 Select A.*, B.* 41 From SASCDC_2.Contact_Person_Employ_field AS A 42 Left Join SASCDC_2.Employment_Classify AS B 43 ON A.Place_of_Employment = B.Place_of_Employment; WARNING: Variable Place_of_Employment already exists on file SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS. NOTE: SAS threaded sort was used. NOTE: There were 124803 observations read from the data set SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD. NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY. WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete. When this step was stopped there were 3036491921 observations and 11 variables. NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52 percent. Compressed is 1061686 pages; un-compressed would require 11204768 pages. WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was stopped. NOTE: PROCEDURE SQL used (Total process time): real time 25:19.26 cpu time 22:39.62 44 quit;
When I stopped the query 3,036,491,921 observations had been processed.
Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B. That is why I chose the Left Join.
There is some problem here if over 3 billion obs were processed and the query was no where near to completion.
What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.
Thank you.
wklierman
Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.
I have a problem with a Proc SQL join.
I have two data sets with 124,100 rows each. Data set A has two columns:
Contact_ID Place_of_Employment
There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.
The second data set Data set B has the following columns. These are sparse also.
The Place_of_employment in B matches to the Place_of-Employment in A.
Here are the columns for Dataset B:
Place_of_Employment NAICS_Sector Sector_Type Type_Firm Child_minor
Disabled More_than_one_Job Retired Unemployed__Not_working
Work_from_home_remotely
I used the following code
Proc sql noprint; CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS Select A.*, B.* From SASCDC_2.Contact_Person_Employ_field AS A Left Join SASCDC_2.Employment_Classify AS B ON A.Place_of_Employment = B.Place_of_Employment; quit;
I would expect the query to take a few seconds but this is the result from the log
Proc sql noprint; 39 CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS 40 Select A.*, B.* 41 From SASCDC_2.Contact_Person_Employ_field AS A 42 Left Join SASCDC_2.Employment_Classify AS B 43 ON A.Place_of_Employment = B.Place_of_Employment; WARNING: Variable Place_of_Employment already exists on file SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS. NOTE: SAS threaded sort was used. NOTE: There were 124803 observations read from the data set SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD. NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY. WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete. When this step was stopped there were 3036491921 observations and 11 variables. NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52 percent. Compressed is 1061686 pages; un-compressed would require 11204768 pages. WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was stopped. NOTE: PROCEDURE SQL used (Total process time): real time 25:19.26 cpu time 22:39.62 44 quit;
When I stopped the query 3,036,491,921 observations had been processed.
Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B. That is why I chose the Left Join.
There is some problem here if over 3 billion obs were processed and the query was no where near to completion.
What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.
Thank you.
wklierman
DUPLICATE THREAD
Do not reply in this thread
Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unrea...
@PaigeMiller wrote:
DUPLICATE THREAD
Do not reply in this thread
Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unrea...
Merged.
Sparse data can cause this. Each missing in data set A is matched with each missing in data set B, in effect a Cartesian join of all records where Place_of_Employment is missing in both data sets. Instead, try this:
From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
Thank you for responding. This how I altered the code with your code.
Proc sql noprint; CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS Select A.*, B.* From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B ON A.Place_of_Employment = B.Place_of_Employment; quit;
I still got over 11MM obs as can be seen in the log.
Select A.*, B.* 48 From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS 48 ! A 49 Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B 50 ON A.Place_of_Employment = B.Place_of_Employment; WARNING: Variable Place_of_Employment already exists on file SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS. NOTE: SAS threaded sort was used. NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 87.73 percent. Compressed is 5015 pages; un-compressed would require 40858 pages. NOTE: Table SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS created, with 11072295 rows and 11 columns. 51 quit; NOTE: PROCEDURE SQL used (Total process time): real time 5.89 seconds cpu time 5.54 seconds
Need some additional modification.
wklierman
Bottom line: Know your data.
You need to figure out which values have a high cardinality (repeated values) and cause the large Cartesian product.
Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.
Thanks. You are right. It can't be accomplished. So I did it the old fashion way.
Exported only the contact_person_id to an xlsx worksheet. Then I copy / pasted the employment data (which had been exported earlier to an xlsx worksheet) to the contact_person_id and I have the dataset that I needed.
I should have thought of that work-around before.
Thank you all for your help.
wklierman
If the order of both datasets is already correct, then a MERGE without BY can also do that side-by-side lineup.
You should leave those two tables separate unless there is a field matching Contact_ID in table Employment_Classify to do the join on.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.