topic Proc Sql -- Join problem in SAS Programming

Proc Sql -- Join problem

wlierman — Thu, 01 Jul 2021 22:01:26 GMT

I have a problem with a Proc SQL join.

I have two data sets with 124,100 rows each. Data set A has two columns:

Contact_ID Place_of_Employment

There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.

The second data set Data set B has the following columns. These are sparse also.

The Place_of_employment in B matches to the Place_of-Employment in A.

Here are the columns for Dataset B:

Place_of_Employment NAICS_Sector Sector_Type Type_Firm Child_minor

Disabled More_than_one_Job Retired Unemployed__Not_working

Work_from_home_remotely

I used the following code

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field AS A
   Left Join SASCDC_2.Employment_Classify AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

I would expect the query to take a few seconds but this is the result from the log

 Proc sql noprint;
39   CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
40      Select A.*, B.*
41      From SASCDC_2.Contact_Person_Employ_field AS A
42      Left Join SASCDC_2.Employment_Classify AS B
43      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: There were 124803 observations read from the data set
      SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD.
NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY.
WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete.  When this step
         was stopped there were 3036491921 observations and 11 variables.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52
      percent.
      Compressed is 1061686 pages; un-compressed would require 11204768 pages.
WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was
         stopped.
NOTE: PROCEDURE SQL used (Total process time):
      real time           25:19.26
      cpu time            22:39.62


44   quit;

When I stopped the query 3,036,491,921 observations had been processed.

Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B. That is why I chose the Left Join.

There is some problem here if over 3 billion obs were processed and the query was no where near to completion.

What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.

Thank you.

wklierman

Proc Sql--Join Problem

wlierman — Thu, 01 Jul 2021 22:05:42 GMT

Posted 2 seconds ago (0 views)

I have a problem with a Proc SQL join.

I have two data sets with 124,100 rows each. Data set A has two columns:

Contact_ID Place_of_Employment

There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.

The second data set Data set B has the following columns. These are sparse also.

The Place_of_employment in B matches to the Place_of-Employment in A.

Here are the columns for Dataset B:

Place_of_Employment NAICS_Sector Sector_Type Type_Firm Child_minor

Disabled More_than_one_Job Retired Unemployed__Not_working

Work_from_home_remotely

I used the following code

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field AS A
   Left Join SASCDC_2.Employment_Classify AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

I would expect the query to take a few seconds but this is the result from the log

 Proc sql noprint;
39   CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
40      Select A.*, B.*
41      From SASCDC_2.Contact_Person_Employ_field AS A
42      Left Join SASCDC_2.Employment_Classify AS B
43      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: There were 124803 observations read from the data set
      SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD.
NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY.
WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete.  When this step
         was stopped there were 3036491921 observations and 11 variables.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52
      percent.
      Compressed is 1061686 pages; un-compressed would require 11204768 pages.
WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was
         stopped.
NOTE: PROCEDURE SQL used (Total process time):
      real time           25:19.26
      cpu time            22:39.62


44   quit;

When I stopped the query 3,036,491,921 observations had been processed.

Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B. That is why I chose the Left Join.

There is some problem here if over 3 billion obs were processed and the query was no where near to completion.

What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.

Thank you.

wklierman

Re: Proc Sql -- Join problem

PaigeMiller — Thu, 01 Jul 2021 22:08:19 GMT

Sparse data can cause this. Each missing in data set A is matched with each missing in data set B, in effect a Cartesian join of all records where Place_of_Employment is missing in both data sets. Instead, try this:

From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
   Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B

Re: Proc Sql--Join Problem

PaigeMiller — Thu, 01 Jul 2021 22:11:41 GMT

DUPLICATE THREAD

Do not reply in this thread

Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unread-message

Re: Proc Sql -- Join problem

wlierman — Thu, 01 Jul 2021 23:27:32 GMT

Thank you for responding. This how I altered the code with your code.

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
   Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

I still got over 11MM obs as can be seen in the log.

 Select A.*, B.*
48      From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS
48 ! A
49      Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
50      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 87.73
      percent.
      Compressed is 5015 pages; un-compressed would require 40858 pages.
NOTE: Table SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS created, with 11072295 rows and 11 columns.

51   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           5.89 seconds
      cpu time            5.54 seconds

Need some additional modification.

wklierman

Re: Proc Sql -- Join problem

ChrisNZ — Fri, 02 Jul 2021 04:53:36 GMT

Bottom line: Know your data.

You need to figure out which values have a high cardinality (repeated values) and cause the large Cartesian product.

Re: Proc Sql -- Join problem

PGStats — Fri, 02 Jul 2021 01:25:44 GMT

You should leave those two tables separate unless there is a field matching Contact_ID in table Employment_Classify to do the join on.

Re: Proc Sql--Join Problem

Kurt_Bremser — Fri, 02 Jul 2021 06:47:11 GMT

@PaigeMiller wrote:

DUPLICATE THREAD

Do not reply in this thread

Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unread-message

Merged.

Re: Proc Sql -- Join problem

Kurt_Bremser — Fri, 02 Jul 2021 06:58:48 GMT

Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.

Re: Proc Sql -- Join problem

wlierman — Fri, 02 Jul 2021 19:40:18 GMT

Thanks. You are right. It can't be accomplished. So I did it the old fashion way.

Exported only the contact_person_id to an xlsx worksheet. Then I copy / pasted the employment data (which had been exported earlier to an xlsx worksheet) to the contact_person_id and I have the dataset that I needed.

I should have thought of that work-around before.

Thank you all for your help.

wklierman

Re: Proc Sql -- Join problem

Kurt_Bremser — Fri, 02 Jul 2021 20:00:43 GMT

If the order of both datasets is already correct, then a MERGE without BY can also do that side-by-side lineup.