The attached code processes multiple file types (S, E and D) and corrects bad reporting of SSN data.
Our ETL process requires file types S, E and D to be run through together. My problem is that S and E will have changes, while D may not. How do I "track" what has happened in preceding data steps for other file types.? Like have a variable "outside" that is updated within these data steps. The above code just covers one year and semester. We will be correcting SSN data to present (where needed). I need to generate a corresponding D "change" file even though nothing has been changed. I am trying to generate a list of changed files that my data tech can process through our edit application to stage data for the ETL. Hope I have explained this well enough.
Thank you for any assistance you give to me.
Jackie O.
Tom,
All I've done is think about how best to accomplish this. I'll try to explain further.
1) Some, but not all of our data makes it through to our Oracle tables. We have an edit routine that validates data reported and generates element errors found. Those with the highest level "FATAL" are restricted from staging for our Oracle ETL process that populates all of our data tables. These data are held back until these types of errors are corrected. So to answer your question, no. I cannot just correct bad SSNs reported in our Oracle database.
2) We require total replacement files when institutions submit data to us. Yes, we have multiple files submitted by every institution during reporting cycle deadlines. This is why I read in files called "I" files to ensure the most recent data sent. Often the SSN changes go back many years. After a while on Unix, the time stamp just has a year and not month, day and time.
3) Institutions should not be re-assigning numbers. The first 3 digits of assigned SSNs must contain the institution code (such as "110"). The other 6 digits are assigned consecutively. If enough time has transpired, I guess they could re-use. Even the more important to free up an assigned SSN when it gets changed to a valid SSN.
4) The update table does include institution code. I have found that some of these assigned SSNs can actually be a valid SSN if searched system wide. This is why I want to make changes only to data submitted by the given institution. We have an institution code of "146". At another institution say "110" there could well be a valid SSN 146XXXXXX. I have to check both institution code and old SSN prior to changing to the new SSN.
5) A flag on the changed rows would be a good thing.
OK. I tried your code. Getting errors on the:
INFILE IN FILEVAR = in_file_name LRECL = 250 truncover end=eof;
Expecting something other than in_file_name after the first = sign. Reading to find out why.
Sorry for just now getting back with you. Winter storm came through. Work was closed and power was out at my house.
Thank you for hanging in there with me.
Jackie O.
Can you post sample data and what you expect as result?
This is pretty vague, and so the only answers I can give are also vague
How do I "track" what has happened in preceding data steps for other file types.?
PROC PRINT will show you what has happened. Or you could store the before and after in a data set for comparison.
Like have a variable "outside" that is updated within these data steps.
Same answer as before.
But really, we need a small data set, where you can show us by example exactly what you want.
Good morning all,
I want to thank those who replied to my post. I thought I had attached code to better explained my problem.
Oh well, I've been reading all weekend. I think I have found my answer...%GLOBAL. Will test out today.
Closing this post.
Happy New Year to all. Maybe will "see" ya next year.
Jackie O.
Can you simply the question to just the thing that you don't understand?
I need to generate a corresponding D "change" file even though nothing has been changed.
How are you doing this now? Are you making a dataset with one observation per change? Perhaps you could just add a step like this that will make sure there is at least one observation in the dataset.
data want ;
if _n_=1 and final then output;
set have end=final;
output;
run;
I thought I was on to my answer, but have struggled all day and gotten no further with resolution to my problem. I will simplify in hopes that you will have an answer.
1. My first data step reads in records from a student demographic file and sorts it with another file that marks ssns needing to be corrected. A flag is set to 'Y' to indicate that the file actually was modified and all records (changed/unchanged) are written out
to a new file containing "changed" in the name.
2. My second data step reads in records from a student course file and again sorts it with the file that marks ssns needing to be
corrected. The flag in step #1 is initialized to 'N'. It is set to 'Y' again per verbiage in step #1.
3. My third data step reads in records from a student degree awarded file and again sorts it with the file that marks ssns needing to be corrected. This file may or may not contain the students' ssns corrected in steps #1,2. Not every student in the demographic/course files receive degrees.
The problem is that I need to know what the flag has been set to in steps #1 and/or #2. This will be used in step #3 to generate a degree file to be processed with the demographic and course files. Our ETL will have problems if all three files are not processed together.
I hope this is clear enough. If not, I can expand on my explanation.
Many thanks in advance!
Jackie O.
Can you clarify what you mean by FLAG? Are you setting a flag variable on the individual records that need to be modified?
What do you mean by combining the first two steps? Do you mean you want all SSN that were flagged in either of the first two steps or those flagged only in both?
Are the datasets in each step one observations per SSN? Or are they more complex?
You program was a little convoluted and hard to follow. It seemed to be both read and writing to text files.
It would probably be easier to separate out the parts that read and write the text files and just concentrate on processing the data as SAS datasets.
Also why did the code you posted appear to be using GOTO statements to mimic a DO loop? Or possibly to mimic the normal data step iteration over all observations in the input data? Is there some hidden complexity in the data structure? Or is it just written by someone that wasn't familiar with how SAS data steps work?
Tom,
Sorry to not have responded sooner. Happy New Year BTW! And...thank you!
You are correct. My code is convoluted. I learned SAS the way my former supervisor taught me way back in 1985. I am the only programmer doing SAS in my office and have not taken time to update my "skills". Have been trying new things today. Thank you for the "prodding". Started a new version of my program. Trying to use loops (i,j,k) to iterate through years, semester, and institutions (in that order).
How would you increment a loop index if not consecutive numbering? Example of my institution codes are '110','111','120','340'. Is there a way to use an IN('110','111') to update my index. I am attaching my new attempt.
Jackie O
Anyone?
Is it possible within a DATA STEP to read both an external file and results from a SQL query? I have attached my code. I have two SQL queries. One determines which institutions have to be processed. The second contains SSN changes. The goal:
1. Out of 46 institutions, process only those with changes found in the first query.
2. Only process semester and years impacted by the changes( SUM,FALL,SPRING AND 20102011 to 20182019).
3. Output the external files in entirety, but with the SSN changes made.
I am certain with all your experiences that you know of a better way. Sometimes I equate posting a question to walking before a
firing squad. I am a bit sensitive. However, Tom opened my eyes to SAS 9.4 and I so thank him. I first learned SAS in 1985 and my coding is antiquated at best. Trying new things with version 2 of this project.
Thank you in advance,
Jackie O.
Jackie O.
1) So the first query gets a list of SSN? Which variable contains the SSN? Because the variable you all calling SSN09 is set to NULL on every record. Also what type of variable is SSN09 supposed to be? Normally you would store SSN in a character variable, but your Oracle query is not setting a type for the variable and so I am not sure what type SAS will create for it when you pull it into SAS. Will it be a number or a character string?
Note no need for a separate PROC SORT step as you can add an ORDER BY clause to your SQL query.
CREATE TABLE inst_list AS
SELECT * FROM CONNECTION TO myconn
(SELECT DISTINCT
institution_code AS inst_code
, NULL AS ssn09
, 'A' AS record_type
FROM renaeb.ssn_update
WHERE (COMMENTS <> 'CORRECTED' OR COMMENTS IS NULL)
AND NEW_SSN IS NOT NULL
)
order by inst_code, ssn09, record_type
;
Perhaps you meant the second query?
That one appears to be creating two variables with names that look like they should contain SSN values, SSN09 and NEW_SSN.
CREATE TABLE ssn_updates AS
SELECT * FROM CONNECTION TO myconn
(SELECT
institution_code AS inst_code
, original_ssn AS ssn09
, new_ssn
, last_name
, first_name
, middle_name
, birthdate
, gender
, 'B' AS record_type
FROM renaeb.ssn_update
WHERE (COMMENTS <> 'CORRECTED' OR COMMENTS IS NULL)
AND NEW_SSN IS NOT NULL
)
order by inst_code, ssn09, record_type
;
(2)
What do you mean by PROCESS? What are you trying to do? Are you just changing SSN's and nothing else?
How do you know what semesters? Is that what the DO Loops in the data step are trying to do? Or is that information that you could have pulled from your Oracle table?
Your DATA steps looks like it is going to read in multiple text files by building the filename up from a number of other fields.
in_file_name=
catx('/'
,"&parent_path_I"
,catt('UDS',record_S,acad_year_2)
,sem
,catt('I',inst_code,'_S')
);
Your process should be (1) Generate list of files to read. (2) Read the files into SAS dataset. (3) apply updates (4) write updated files.
Is there any reason to try to combine those four steps into one step?
Perhaps you meant to do something like this, assuming that the included file just contains a subset of a data step to read the data from a file. So just an INPUT statement. With perhaps some LENGTH, LABEL, FORMAT and/or INFORMAT statements as needed to define the variables.
data file_list ;
set inst_list ;
length yr 8 sem $6 year $8 acad_year_2 $4 in_file_name $200 ;
do yr = 2010 to 2018 ;
do sem= 'SUM', 'FALL', 'SPRING';
acad_year_2 = cats(SUBSTR(put(yr,4.),3) , substr(put(yr+1,4.),3));
record_S = 'S';
in_file_name = catx('/', "&parent_path_I" , catt('UDS', record_S, acad_year_2), sem, catt('I', inst, '_S'));
output;
end;
end;
run;
data current ;
set file_list ;
INFILE IN FILEVAR = in_file_name LRECL = 250 truncover end=eof;
do while (not eof);
%INCLUDE '/uds/sas/copylib/UDS_RECORD_S_LAYOUT';
output;
end;
run;
Then update the data.
proc sort data=current;
by inst_code ssn09 ;
run;
data new ;
merge current (in=in1) ssn_updates(in=in2);
by inst_code ssn09 ;
if in2 then ssn09=new_ssn;
run;
Now if you want to re-create the text files then add a step that does the reverse of the step that read the files.
@JackieO wrote:
Tom,
Let me digest your suggested code. Got some new things in there as far as I am concerned. Need to research the SUBSTR(PUT...) combination. Also I have concerns bringing all those records across at the same time. Your data step file_list could be as large as 7 million records depending on how many institutions are making SSN changes. Not so much the S demographic records, but the enrollment records which will be another group of files to be corrected later in the program.
Questions? Your third line of code is just establishing variables and their respective lengths? You create in_file_name in the first data step, but reference it again in the data step current. How is scope not a player here?
Thank you, Tom
Jackie O
SUBSTR() operates on character strings. PUT() converts values to strings by applying a format. So PUT(2018,4.) will generate the string '2018', So SUBSTR('2018',3) will pull out the string '18'.
7 Million might have been large in 1985 but is not really a large number of observations today. It is a large number of FILES. Do you really have your data stored in over 7 million little files? Why is the data not in your ORACLE database?
It is always a good idea to define your variables instead of letting SAS guess how to define by how you first use them. In addition to a LENGTH (or ATTRIB) statement a good way to define a variable is to SET a dataset that already has that variable in it. The IN_FILE_NAME variable is being read from the FILE_LIST dataset.
Tom,
So in SUBSTR(PUT(yr,4.),3)) the 3 means start at that position and stop at the end? Used to have to put how many places to include. I get it. Thank you again!
At most the number of files for a record type (S is demographic) would be #years X #semesters X #institutions (9 X 3 X 46 = 1,242). The table containing the SSN corrections has no academic year or semester within. Institutions often enroll students under assigned numbers if they fail to have a valid SSN. Later on those students do obtain valid SSNs and to report persistence, we need to have those SSNs consistent throughout admission to the institution. So to make things easier on the schools, we search through all our data and update the SSNs for consistency.
We do have most of our data stored in our Oracle database. That is...the data that has passed our edit checks. The in_file_name files contain both the data that passed and the data that failed. So...I go there to make my corrections. I hope to generate a list of the actual files that have been changed so that they can be reprocessed through our ETL and broadcast into our Oracle tables. BTW...the reference to 7 million are the possible number of records and not the possible number of files.
I'm going to play around with this tonight. I so do appreciate all your help!!!
Thank you,
Jackie O.
Sounds like you probably need to sit down and think about what your data flow is and how best to handle this.
I would want to know answer to a number of things before deciding the right course of action.
Why can't you just make the corrections in the Oracle database and not modify the original files? Do they resend then same file multiple times, still with the made-up SSN?
Are the institutions re-using the made-up SSN numbers for different people over time? That would make fixing them much harder.
In general it is probably best to structure your database to use your own unique id for each person and treat the SSN as an attribute of the person.
Back to your original question.
So do your list of corrected SSN include INST or not? That is does your update data say that for INST='111' the SSN of '111-22-3333' should really be '999-88-7777'? Or does it just say change all '111-22-3333' to '999-88-7777'? If so that might be easier.
Once you figure out how to make the correction then make a SAS dataset that has the corrected data and an observation level FLAG to indiciate that a change was made. Something like:
FILE|INST|SSN|CHANGED 2018F111_S|111|999-88-7777|Y 2018F111_S|112|999-99-7777|N ....
Then you can pull out the list of files that have changed.
create table updated_files as
select distinct FILE
from new
where CHANGED='Y'
;
And use that to know which files need to be re-written and/or re-run through the ETL process.
Tom,
All I've done is think about how best to accomplish this. I'll try to explain further.
1) Some, but not all of our data makes it through to our Oracle tables. We have an edit routine that validates data reported and generates element errors found. Those with the highest level "FATAL" are restricted from staging for our Oracle ETL process that populates all of our data tables. These data are held back until these types of errors are corrected. So to answer your question, no. I cannot just correct bad SSNs reported in our Oracle database.
2) We require total replacement files when institutions submit data to us. Yes, we have multiple files submitted by every institution during reporting cycle deadlines. This is why I read in files called "I" files to ensure the most recent data sent. Often the SSN changes go back many years. After a while on Unix, the time stamp just has a year and not month, day and time.
3) Institutions should not be re-assigning numbers. The first 3 digits of assigned SSNs must contain the institution code (such as "110"). The other 6 digits are assigned consecutively. If enough time has transpired, I guess they could re-use. Even the more important to free up an assigned SSN when it gets changed to a valid SSN.
4) The update table does include institution code. I have found that some of these assigned SSNs can actually be a valid SSN if searched system wide. This is why I want to make changes only to data submitted by the given institution. We have an institution code of "146". At another institution say "110" there could well be a valid SSN 146XXXXXX. I have to check both institution code and old SSN prior to changing to the new SSN.
5) A flag on the changed rows would be a good thing.
OK. I tried your code. Getting errors on the:
INFILE IN FILEVAR = in_file_name LRECL = 250 truncover end=eof;
Expecting something other than in_file_name after the first = sign. Reading to find out why.
Sorry for just now getting back with you. Winter storm came through. Work was closed and power was out at my house.
Thank you for hanging in there with me.
Jackie O.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.