BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wlierman
Lapis Lazuli | Level 10

I have a problem with a Proc SQL join.

 

I have two data sets with 124,100 rows each.  Data set A has two columns:

 

Contact_ID      Place_of_Employment   

 

There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.

 

The second data set Data set B has the following columns.  These are sparse also.

The Place_of_employment in B matches to the Place_of-Employment in A.

 

Here are the columns for Dataset B:

 

Place_of_Employment    NAICS_Sector     Sector_Type   Type_Firm   Child_minor 

 

Disabled   More_than_one_Job   Retired    Unemployed__Not_working

 

Work_from_home_remotely

 

I used the following code

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field AS A
   Left Join SASCDC_2.Employment_Classify AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

I would expect the query to take a few seconds but this is the result from the log

 Proc sql noprint;
39   CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
40      Select A.*, B.*
41      From SASCDC_2.Contact_Person_Employ_field AS A
42      Left Join SASCDC_2.Employment_Classify AS B
43      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: There were 124803 observations read from the data set
      SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD.
NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY.
WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete.  When this step
         was stopped there were 3036491921 observations and 11 variables.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52
      percent.
      Compressed is 1061686 pages; un-compressed would require 11204768 pages.
WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was
         stopped.
NOTE: PROCEDURE SQL used (Total process time):
      real time           25:19.26
      cpu time            22:39.62


44   quit;

When I stopped the query 3,036,491,921 observations had been processed.

 

Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B.  That is why I chose the Left Join.

 

There is some problem here if over 3 billion obs were processed and the query was no where near to completion.

 

What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.

 

Thank you.

 

wklierman

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.

View solution in original post

10 REPLIES 10
wlierman
Lapis Lazuli | Level 10
Proc Sql -- Join problem

I have a problem with a Proc SQL join.

 

I have two data sets with 124,100 rows each.  Data set A has two columns:

 

Contact_ID      Place_of_Employment   

 

There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.

 

The second data set Data set B has the following columns.  These are sparse also.

The Place_of_employment in B matches to the Place_of-Employment in A.

 

Here are the columns for Dataset B:

 

Place_of_Employment    NAICS_Sector     Sector_Type   Type_Firm   Child_minor 

 

Disabled   More_than_one_Job   Retired    Unemployed__Not_working

 

Work_from_home_remotely

 

I used the following code

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field AS A
   Left Join SASCDC_2.Employment_Classify AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

I would expect the query to take a few seconds but this is the result from the log

 Proc sql noprint;
39   CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
40      Select A.*, B.*
41      From SASCDC_2.Contact_Person_Employ_field AS A
42      Left Join SASCDC_2.Employment_Classify AS B
43      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: There were 124803 observations read from the data set
      SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD.
NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY.
WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete.  When this step
         was stopped there were 3036491921 observations and 11 variables.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52
      percent.
      Compressed is 1061686 pages; un-compressed would require 11204768 pages.
WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was
         stopped.
NOTE: PROCEDURE SQL used (Total process time):
      real time           25:19.26
      cpu time            22:39.62


44   quit;

When I stopped the query 3,036,491,921 observations had been processed.

 

Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B.  That is why I chose the Left Join.

 

There is some problem here if over 3 billion obs were processed and the query was no where near to completion.

 

What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.

 

Thank you.

 

wklierman

 
PaigeMiller
Diamond | Level 26

DUPLICATE THREAD

Do not reply in this thread

Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unrea...

--
Paige Miller
PaigeMiller
Diamond | Level 26

Sparse data can cause this. Each missing in data set A is matched with each missing in data set B, in effect a Cartesian join of all records where Place_of_Employment is missing in both data sets. Instead, try this: 

 

From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
   Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B

 

--
Paige Miller
wlierman
Lapis Lazuli | Level 10

Thank you for responding.  This how I altered the code with your code.

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
   Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

 I still got over 11MM obs as can be seen in the log.

 Select A.*, B.*
48      From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS
48 ! A
49      Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
50      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 87.73
      percent.
      Compressed is 5015 pages; un-compressed would require 40858 pages.
NOTE: Table SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS created, with 11072295 rows and 11 columns.

51   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           5.89 seconds
      cpu time            5.54 seconds

Need some additional modification.

 

wklierman

ChrisNZ
Tourmaline | Level 20

Bottom line: Know your data.

You need to figure out which values have a high cardinality (repeated values) and cause the large Cartesian product.

Kurt_Bremser
Super User

Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.

wlierman
Lapis Lazuli | Level 10

Thanks.  You are right. It can't be accomplished.  So I did it the old fashion way.

Exported only the contact_person_id to an xlsx worksheet. Then I copy / pasted the employment data (which had been exported earlier to an xlsx worksheet) to the contact_person_id and I have the dataset that I needed.

 

I should have thought of that work-around before.

 

Thank you all for your help.

 

wklierman

PGStats
Opal | Level 21

You should leave those two tables separate unless there is a field matching Contact_ID in table Employment_Classify to do the join on.

PG

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1457 views
  • 3 likes
  • 5 in conversation