Hi all. This is a quite complicated task I'm hoping to get some help with. If you have any suggestions or code that would help me to accomplish this task I'd really appreciated it. If it's too big of an ask I certainly understand that too.
I have a physical file on the server that I need to read in with the goal of outputting two columns of data to a SAS dataset. Each record output would contain 'Column Name' and 'Table Name' from any Proc SQL steps in a SAS program it reads in:
1. Each table name in the 'FROM' (the real table name so ignoring aliases)
2. Each Column name in the 'SELECT' and 'WHERE' conditions (the real column names so again ignoring aliases)
This is the example SAS program I've been playing with. Notice it's intentionally messy because I anticipate people writing code in a variety of ways:
/* MACROS */
%LET DATE_START = '01JUN2016'D;
%LET DATE_END = '31MAY2018'D;
/* ================================================
SECTION 1:
================================================ */
PROC SQL;
CREATE TABLE &USERN._LIB.ORTHOTICS_BASE_PULL_V1 AS
SELECT DISTINCT
CLM.CLM_HIC_NUM AS BENE_HIC_NUM
,CLM.BENE_SK
,CLM.CLM_THRU_DT ,CLM.CLM_PMT_AMT
,LINE.CLM_LINE_FROM_DT
,LINE.CLM_LINE_THRU_DT,
LINE.CLM_LINE_HCPCS_CD AS HCPCS_CD , BENE.BENE_LINE_2_ADR
,BENE.BENE_LINE_3_ADR,
BENE.BENE_LINE_4_ADR
FROM OPI_AAL.CLM AS CLM
INNER JOIN OPI_AAL.BENE AS BENE
ON (CLM.BENE_SK = BENE.BENE_SK)
INNER JOIN OPI_AAL.CLM_LINE AS LINE
ON (CLM.GEO_BENE_SK = LINE.GEO_BENE_SK
AND CLM.CLM_DT_SGNTR_SK = LINE.CLM_DT_SGNTR_SK
AND CLM.CLM_TYPE_CD = LINE.CLM_TYPE_CD
AND CLM.CLM_NUM_SK = LINE.CLM_NUM_SK)
WHERE CLM.CLM_TYPE_CD = 2800
AND LINE.CLM_LINE_PRVDR_PMT_AMT > 0 AND CLM.CLM_FINL_ACTN_IND = 'Y'
;QUIT;
So the desired output would be a dataset containing:
TABLE NAME: COLUMN NAME:
OPI_AAL.CLM CLM_HIC_NUM
OPI_AAL.CLM BENE_SK
OPI_AAL.CLM CLM_THRU_DT
OPI_AAL.CLM CLM_PMT_AMT
OPI_AAL.CLM CLM_TYPE_CD
OPI_AAL.CLM CLM_FINL_ACTN_IND
OPI_AAL.BENE BENE_LINE_2_ADR
OPI_AAL.BENE BENE_LINE_3_ADR
OPI_AAL.BENE BENE_LINE_4_ADR
OPI_AAL.CLM_LINE CLM_LINE_FROM_DT
OPI_AAL.CLM_LINE CLM_LINE_THRU_DT
OPI_AAL.CLM_LINE CLM_LINE_HCPCS_CD
OPI_AAL.CLM_LINE CLM_LINE_PRVDR_PMT_AMT
So far I've been able to read in the SAS program with the following code, but I'm stuck on how to parse thru the lines of the file to create my desired output based on those requirements:
filename file_in "/sasdata/users/steve/DME/DMECode.sas";
data pgm;
infile file_in end=template_eof;
length line $ 2000;
do while (not template_eof);
input;
line = _infile_;
output;
end;
run;
Below code extracts the SQL from your source code and stores it in a single observation.
data pgm;
infile datalines truncover;
input;
length sql_code $32767;
retain sql_code sql_flg;
if sql_flg ne '1' and find(compbl(_infile_),'proc sql','i')>0 then
sql_flg='1';
else if sql_flg='1' then
do;
if find(compbl(_infile_),'quit','i')>0 then
do;
output;
sql_flg='0';
call missing(sql_code);
end;
else
do;
sql_code=catx(' ',sql_code,_infile_);
end;
end;
datalines4;
/* MACROS */
%LET DATE_START = '01JUN2016'D;
%LET DATE_END = '31MAY2018'D;
/* ================================================
SECTION 1:
================================================ */
PROC SQL;
CREATE TABLE &USERN._LIB.ORTHOTICS_BASE_PULL_V1 AS
SELECT DISTINCT
CLM.CLM_HIC_NUM AS BENE_HIC_NUM
,CLM.BENE_SK
,CLM.CLM_THRU_DT ,CLM.CLM_PMT_AMT
,LINE.CLM_LINE_FROM_DT
,LINE.CLM_LINE_THRU_DT,
LINE.CLM_LINE_HCPCS_CD AS HCPCS_CD , BENE.BENE_LINE_2_ADR
,BENE.BENE_LINE_3_ADR,
BENE.BENE_LINE_4_ADR
FROM OPI_AAL.CLM AS CLM
INNER JOIN OPI_AAL.BENE AS BENE
ON (CLM.BENE_SK = BENE.BENE_SK)
INNER JOIN OPI_AAL.CLM_LINE AS LINE
ON (CLM.GEO_BENE_SK = LINE.GEO_BENE_SK
AND CLM.CLM_DT_SGNTR_SK = LINE.CLM_DT_SGNTR_SK
AND CLM.CLM_TYPE_CD = LINE.CLM_TYPE_CD
AND CLM.CLM_NUM_SK = LINE.CLM_NUM_SK)
WHERE CLM.CLM_TYPE_CD = 2800
AND LINE.CLM_LINE_PRVDR_PMT_AMT > 0 AND CLM.CLM_FINL_ACTN_IND = 'Y'
;QUIT;
;;;;
run;
You will have to extend the code logic to deal with more use cases, i.e. multiple SQL statements within a single Proc SQL/QUIT block or explicit pass-through SQL.
I still believe that implementing something generic will be very hard but.... even if you don't succeed you could actually learn a lot along the way just by trying.
It would certainly be possible to write some code for the sample program you've posted but to come up with something generic feels like a huge endeavor. That's basically about writing a SQL parser.
If you need that only for a single program then consider a fully manual approach. If the end goal is to write some sort of data lineage reporting tool then I'd spend time searching the Internet for existing tools. Eventually search with words like "SQL Tokenizer" and the like.
Below code extracts the SQL from your source code and stores it in a single observation.
data pgm;
infile datalines truncover;
input;
length sql_code $32767;
retain sql_code sql_flg;
if sql_flg ne '1' and find(compbl(_infile_),'proc sql','i')>0 then
sql_flg='1';
else if sql_flg='1' then
do;
if find(compbl(_infile_),'quit','i')>0 then
do;
output;
sql_flg='0';
call missing(sql_code);
end;
else
do;
sql_code=catx(' ',sql_code,_infile_);
end;
end;
datalines4;
/* MACROS */
%LET DATE_START = '01JUN2016'D;
%LET DATE_END = '31MAY2018'D;
/* ================================================
SECTION 1:
================================================ */
PROC SQL;
CREATE TABLE &USERN._LIB.ORTHOTICS_BASE_PULL_V1 AS
SELECT DISTINCT
CLM.CLM_HIC_NUM AS BENE_HIC_NUM
,CLM.BENE_SK
,CLM.CLM_THRU_DT ,CLM.CLM_PMT_AMT
,LINE.CLM_LINE_FROM_DT
,LINE.CLM_LINE_THRU_DT,
LINE.CLM_LINE_HCPCS_CD AS HCPCS_CD , BENE.BENE_LINE_2_ADR
,BENE.BENE_LINE_3_ADR,
BENE.BENE_LINE_4_ADR
FROM OPI_AAL.CLM AS CLM
INNER JOIN OPI_AAL.BENE AS BENE
ON (CLM.BENE_SK = BENE.BENE_SK)
INNER JOIN OPI_AAL.CLM_LINE AS LINE
ON (CLM.GEO_BENE_SK = LINE.GEO_BENE_SK
AND CLM.CLM_DT_SGNTR_SK = LINE.CLM_DT_SGNTR_SK
AND CLM.CLM_TYPE_CD = LINE.CLM_TYPE_CD
AND CLM.CLM_NUM_SK = LINE.CLM_NUM_SK)
WHERE CLM.CLM_TYPE_CD = 2800
AND LINE.CLM_LINE_PRVDR_PMT_AMT > 0 AND CLM.CLM_FINL_ACTN_IND = 'Y'
;QUIT;
;;;;
run;
You will have to extend the code logic to deal with more use cases, i.e. multiple SQL statements within a single Proc SQL/QUIT block or explicit pass-through SQL.
I still believe that implementing something generic will be very hard but.... even if you don't succeed you could actually learn a lot along the way just by trying.
@buechler66 wrote:
I know based on SQL rules that all the column names follow the 'Select'
statement
Below some code samples just to give you a feel of all the challenges you're going to encounter if trying to deal with any SQL (and that's so far only the SAS SQL flavor).
proc sql;
select
*
from sashelp.class
;
quit;
proc sql;
select
(select count(*) from sashelp.class where sex=o.sex) as group_cnt,
*
from sashelp.class o
;
quit;
Just to finish this off: Adding the FEEDBACK option to proc SQL and then parse the SQL code out of the SAS log will make it possible to deal with select *, will always give you column names with an alias or table name added to variables and also to give you access to the actual SQL executed when code is wrapped into SAS macro syntax;
28 proc sql feedback; 29 select 30 * 31 from sashelp.class 32 ; NOTE: Statement transforms to: select CLASS.Name, CLASS.Sex, CLASS.Age, CLASS.Height, CLASS.Weight from SASHELP.CLASS; 33 quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.