BookmarkSubscribeRSS Feed
SFDonovan
Calcite | Level 5

The SAS was written years ago by a previous developer.

Right now I am getting a file 77 MB in size from my SQL statement, when before the file was only 1.68 MB in size from SAS. 

What I need is an explanation as to what ends up in CLS_CDUP_ENRL.xls

CDUP, CLASS_TBL, and ENRLMNTS are large data extracts from the enterprise system.

After every KEEP statement there is an OUTPUT statement.

Example

KEEP OWNER STRM SESSION_CODE CLASS_SECTION SUBJECT CATALOG_NBR CLASS_NBR EMPLID DUP_CODE;

RUN;

PROC EXPORT DATA=CDUP_NO_ENRLMNT

            OUTFILE= "C:\Processes\C_DUP\for_1058\CDUP_NO_ENRLMNT.xls"

            DBMS=EXCEL2000 REPLACE;

The KEEP statement causes a DATA step to write only the variables that you specify to one or more SAS data sets.

My problem is I don't see a specific KEEP statement before CLS_CDUP_ENRL is exported.  I don't know what is retained in CLS_CDUP_ENRL.  Maybe I'm getting confused where the previous developer who made this SAS program used CLS_CDUP_ENRL in both a DATA statement and a SET statement.

Here is the SAS code.

DATA CLASS_NOT_FOUND;

MERGE CLASS_TBL (IN=IS_A) CDUP (IN=IS_B);

  BY CLS_KEY;

  IF IS_B AND NOT IS_A;

  KEEP OWNER STRM SESSION_CODE CLASS_SECTION SUBJECT CATALOG_NBR EMPLID DUP_CODE;

RUN;

PROC EXPORT DATA=CLASS_NOT_FOUND

            OUTFILE= "C:\Processes\C_DUP\for_1058\CLASS_NOT_FOUND_CDUP_ER1.xls"

            DBMS=EXCEL2000 REPLACE;

RUN;

DATA CLS_CDUP_VLD;

MERGE CLASS_TBL (IN=IS_A) CDUP (IN=IS_B);

  BY CLS_KEY;

  IF IS_B AND IS_A;

  STDNT_KEY = EMPLID || STRM || CLASS_NBR;

  STDNT_KEY2 = EMPLID || CRSE_ID;

RUN;

PROC SORT DATA=CLS_CDUP_VLD;

BY STDNT_KEY;

RUN;

DATA CLS_CDUP_ENRL;

MERGE CLS_CDUP_VLD (IN=IS_A) ENRLMNTS (IN=IS_B);

  BY STDNT_KEY;

  IF IS_B AND IS_A;

RUN;

DATA CDUP_NO_ENRLMNT;

MERGE CLS_CDUP_VLD (IN=IS_A) ENRLMNTS (IN=IS_B);

  BY STDNT_KEY;

  IF IS_A AND NOT IS_B;

  KEEP OWNER STRM SESSION_CODE CLASS_SECTION SUBJECT CATALOG_NBR CLASS_NBR EMPLID DUP_CODE;

RUN;

PROC EXPORT DATA=CDUP_NO_ENRLMNT

            OUTFILE= "C:\Processes\C_DUP\for_1058\CDUP_NO_ENRLMNT.xls"

            DBMS=EXCEL2000 REPLACE;

RUN;

DATA DUPS_CDUP_ENRL;

SET CLS_CDUP_ENRL;

BY STDNT_KEY;

IF (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1);

   KEEP OWNER EMPLID STRM SESSION_CODE SUBJECT CATALOG_NBR CLASS_SECTION DUP_CODE;

RUN;

PROC EXPORT DATA=DUPS_CDUP_ENRL

            OUTFILE= "C:\Processes\C_DUP\for_1058\DUPLICATE_CDUP_ENTRIES.xls"

            DBMS=EXCEL2000 REPLACE;

RUN;

PROC SORT DATA=CLS_CDUP_ENRL;

BY STDNT_KEY2;

RUN;

DATA UNMATCH_CDUP_ENRL;

SET CLS_CDUP_ENRL;

BY STDNT_KEY2;

IF (FIRST.STDNT_KEY2 = 1 & LAST.STDNT_KEY2 = 1);

   KEEP OWNER EMPLID STRM SESSION_CODE SUBJECT CATALOG_NBR CLASS_SECTION DUP_CODE STATUS;

RUN;

PROC EXPORT DATA=UNMATCH_CDUP_ENRL

            OUTFILE= "C:\Processes\C_DUP\for_1058\UNMATCHED_CDUP_ENTRIES.xls"

            DBMS=EXCEL2000 REPLACE;

RUN;

PROC EXPORT DATA=CLS_CDUP_ENRL

            OUTFILE= "C:\Processes\C_DUP\for_1058\CLS_CDUP_ENRL.xls"

            DBMS=EXCEL2000 REPLACE;

RUN;

5 REPLIES 5
Tom
Super User Tom
Super User

Code looks pretty simple.  If you are seeing a large increase in size then check the input data sets.

CLASS_TBL 

CDUP

ENRLMNTS

Reeza
Super User

What are the # of observations in the old and new output? If the size has significantly changed the file sizes will change according. I'm also assuming you're creating XLSX files now rather than XLS.

You also haven't posted your new SQL code so a comparison can't be made - check the length of the variables in your SQL code would be my suggestion.

LinusH
Tourmaline | Level 20

Why are you converting to SQL?

Check the SAS log, after each step the number of observations and variables are printed.

Data never sleeps
Howles
Quartz | Level 8

The lengths of the variables (especially character variables) are also an issue. So run PROC CONTENTS for each data set created.

Kurt_Bremser
Super User

It may be that one or more variables in the input datasets has had its attributes changed (eg because of adaption to UTF) and consumes more space.

We had similar things happen when the generic attribute datasets (name/value tables) had their "value" fields changed from char(100) to char(3600) by the DB designers.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 974 views
  • 0 likes
  • 6 in conversation