BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wlierman
Lapis Lazuli | Level 10

I have a problem with a Proc SQL join.

 

I have two data sets with 124,100 rows each.  Data set A has two columns:

 

Contact_ID      Place_of_Employment   

 

There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.

 

The second data set Data set B has the following columns.  These are sparse also.

The Place_of_employment in B matches to the Place_of-Employment in A.

 

Here are the columns for Dataset B:

 

Place_of_Employment    NAICS_Sector     Sector_Type   Type_Firm   Child_minor 

 

Disabled   More_than_one_Job   Retired    Unemployed__Not_working

 

Work_from_home_remotely

 

I used the following code

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field AS A
   Left Join SASCDC_2.Employment_Classify AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

I would expect the query to take a few seconds but this is the result from the log

 Proc sql noprint;
39   CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
40      Select A.*, B.*
41      From SASCDC_2.Contact_Person_Employ_field AS A
42      Left Join SASCDC_2.Employment_Classify AS B
43      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: There were 124803 observations read from the data set
      SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD.
NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY.
WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete.  When this step
         was stopped there were 3036491921 observations and 11 variables.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52
      percent.
      Compressed is 1061686 pages; un-compressed would require 11204768 pages.
WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was
         stopped.
NOTE: PROCEDURE SQL used (Total process time):
      real time           25:19.26
      cpu time            22:39.62


44   quit;

When I stopped the query 3,036,491,921 observations had been processed.

 

Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B.  That is why I chose the Left Join.

 

There is some problem here if over 3 billion obs were processed and the query was no where near to completion.

 

What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.

 

Thank you.

 

wklierman

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.

View solution in original post

10 REPLIES 10
wlierman
Lapis Lazuli | Level 10
Proc Sql -- Join problem

I have a problem with a Proc SQL join.

 

I have two data sets with 124,100 rows each.  Data set A has two columns:

 

Contact_ID      Place_of_Employment   

 

There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.

 

The second data set Data set B has the following columns.  These are sparse also.

The Place_of_employment in B matches to the Place_of-Employment in A.

 

Here are the columns for Dataset B:

 

Place_of_Employment    NAICS_Sector     Sector_Type   Type_Firm   Child_minor 

 

Disabled   More_than_one_Job   Retired    Unemployed__Not_working

 

Work_from_home_remotely

 

I used the following code

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field AS A
   Left Join SASCDC_2.Employment_Classify AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

I would expect the query to take a few seconds but this is the result from the log

 Proc sql noprint;
39   CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
40      Select A.*, B.*
41      From SASCDC_2.Contact_Person_Employ_field AS A
42      Left Join SASCDC_2.Employment_Classify AS B
43      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: There were 124803 observations read from the data set
      SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD.
NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY.
WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete.  When this step
         was stopped there were 3036491921 observations and 11 variables.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52
      percent.
      Compressed is 1061686 pages; un-compressed would require 11204768 pages.
WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was
         stopped.
NOTE: PROCEDURE SQL used (Total process time):
      real time           25:19.26
      cpu time            22:39.62


44   quit;

When I stopped the query 3,036,491,921 observations had been processed.

 

Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B.  That is why I chose the Left Join.

 

There is some problem here if over 3 billion obs were processed and the query was no where near to completion.

 

What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.

 

Thank you.

 

wklierman

 
PaigeMiller
Diamond | Level 26

DUPLICATE THREAD

Do not reply in this thread

Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unrea...

--
Paige Miller
PaigeMiller
Diamond | Level 26

Sparse data can cause this. Each missing in data set A is matched with each missing in data set B, in effect a Cartesian join of all records where Place_of_Employment is missing in both data sets. Instead, try this: 

 

From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
   Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B

 

--
Paige Miller
wlierman
Lapis Lazuli | Level 10

Thank you for responding.  This how I altered the code with your code.

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
   Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

 I still got over 11MM obs as can be seen in the log.

 Select A.*, B.*
48      From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS
48 ! A
49      Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
50      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 87.73
      percent.
      Compressed is 5015 pages; un-compressed would require 40858 pages.
NOTE: Table SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS created, with 11072295 rows and 11 columns.

51   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           5.89 seconds
      cpu time            5.54 seconds

Need some additional modification.

 

wklierman

ChrisNZ
Tourmaline | Level 20

Bottom line: Know your data.

You need to figure out which values have a high cardinality (repeated values) and cause the large Cartesian product.

Kurt_Bremser
Super User

Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.

wlierman
Lapis Lazuli | Level 10

Thanks.  You are right. It can't be accomplished.  So I did it the old fashion way.

Exported only the contact_person_id to an xlsx worksheet. Then I copy / pasted the employment data (which had been exported earlier to an xlsx worksheet) to the contact_person_id and I have the dataset that I needed.

 

I should have thought of that work-around before.

 

Thank you all for your help.

 

wklierman

PGStats
Opal | Level 21

You should leave those two tables separate unless there is a field matching Contact_ID in table Employment_Classify to do the join on.

PG

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2809 views
  • 3 likes
  • 5 in conversation