I have a large dataset (600k records), with possible date info in multiple of four distinct columns (admin_date1, admin_date2, admin_date3, admin_date4). I need to find the first and last of the four options.
firstdate = min (admin_date1, admin_date2, admin_date3, admin_date4);
lastdate = max (admin_date1, admin_date2, admin_date3, admin_date4);
then I create a new column referencing contents of firstdate and lastdate:
if firstdate+21>lastdate then (.) else Completedate=lastdate+14;
The process works for 99+% of the records, but a small percentage return completedate of 01/01/1960. These invariably have valid firstdate and lastdate data, but somehow screw up he completedate. I can find no common denominator for why it sometimes works and sometimes doesn't.
I've tried to strong-arm a solution via Excel, using an "if" statement, and placing a blank in the cell when there is no appropriate value. However, when importing data back into SAS, it won't recognize the new column in date format. I have copied and pasted contents within Excel using "keep data only," thus eliminating the if statement (even though I'm pretty sure I've imported dates previously generated through 'if' statements), to no avail- SAS still reads it as character data and won't reformat as date. How can I fix one or both of these issues?
Provide a working example of the data that causes the problem. (Note You will probably solve the issue yourself in the process of identifying the value that cause the issue).
For example you might start with looking at some of the values that have resulted in COMPLETEDATE of zero.
proc print data=have (obs=10);
where completedate=0;
var id firstdate lastdate completdate;
run;
If you are running SAS using a tool like SAS/Studio that makes producing listing files difficult then just write the values to the SAS log instead and copy the text from there.
data _null_;
set have ;
where completedate=0;
put (id firstdate lastdate completedate) (=);
if _n_>5 then stop;
run;
Show us a portion of the data set that contains examples of this problem, so we can see the values of firstdate and lastdate and completedate.
In preparing data for showing to you, I found the problem, but still don't know the solution. My code instructs that if the interval is <21 days, then put (.) in completedate. SAS is putting "01/01/1960" for each of those instances. How can I get a blank (i.e., no value) in completedate when the criteria for creating a date are not met?
As you can see, the instances where completedate has value of 01/01/1960, the interval is <21 days. In those instances, I want completedate to be blank (i.e., have no date).
Thanks so much, as your request led me to the root of the problem, at least!
Zero is the number used to represent 01JAN1960.
Are you sure there is not some other step that is converting the missing values to zeros?
PS Why post a photograph of text wrapped up into a PDF file? Just copy some of the text and paste it into a pop-insert code box. Or if you really have to post a picture click on the camera icon up load the image file. Then the image will show in your post instead of appearing as an attachment.
Just post data that gives you trouble like this:
data have;
input (firstdate lastdate) (:yymmdd10.);
format firstdate lastdate yymmdd10.;
datalines;
2021-10-01 2021-10-30
;
Add problematic date values in the DATALINES block. Similarly, post the complete code of your IF statement.
Using datalines is the proper way to present data here on the communities, as it takes just a copy/paste and submit to recreate the dataset without having to make guesses about attributes and values. This speeds up the solution process mightily.
This button must be used to post logs and textual data:
The "little running man" right next to it is what I used for the above code. It mimics the coloring of the SAS Enhanced Editor.
Thank you for the pointers!
PS it might be that you actually create missing values, but have this statement somewhere in your code
options missing = "0";
which changes what is used to display a missing value.
I think your variable completedate is formatted as datetime9. rather than as date9.
Provide a working example of the data that causes the problem. (Note You will probably solve the issue yourself in the process of identifying the value that cause the issue).
For example you might start with looking at some of the values that have resulted in COMPLETEDATE of zero.
proc print data=have (obs=10);
where completedate=0;
var id firstdate lastdate completdate;
run;
If you are running SAS using a tool like SAS/Studio that makes producing listing files difficult then just write the values to the SAS log instead and copy the text from there.
data _null_;
set have ;
where completedate=0;
put (id firstdate lastdate completedate) (=);
if _n_>5 then stop;
run;
You were correct that I found the problem when doing this. I appreciate the value of posting the whole code, but the code is very lengthy and would require explaining multiple other variables that were included in the IF statements. Moreover, posting would make public some information that I don't want disclosed. Ultimately, the issue lay in the complexity, with an out-of-place AND portion tacked on. It was intended to apply to a different column but was incorrectly attributed. The suggestion of printing the data assisted in finding the problem.
Thanks to all for your assistance.
That is functionally the code but I had some typos.There are additional 'if' criteria related to the process that complicates the code and precluded me from pasting directly. In truncating it, I left out the 'then completedate=' portion and also mis-typed the greater than symbol- it should be:
if firstdate +21<=lastdate then completedate=(.) else completedate=lastdate+14;
But again, the process works for hundreds of thousands of observations and fails for a few hundred. Because of that, I'm pretty confident in the code. I'm guessing it is a data formatting issue.
If your data file is CSV or other text, copy the first 4 or 5 rows using a TEXT editor, NOT EXCEL, and paste the example text into a text box opened on the forum using the </> icon.
Excel will format values so that the values from the CSV are changed. Sometimes for the worse. Values that are like 5-15 in a CSV file can become a date value. If you have opened your text file in Excel and then saved it you want to find a version from before you touched it with Excel to work with as what you have may very well be corrupt in many places.
If your "lastdate" value is 18Dec1959 then +14 is 01Jan1960.
I notice that you do not provide any examples of your "valid" firstdate and lastdate that demonstrate this behavior. Perhaps you could be so kind as to do so.
And an incomplete data step can hide all sorts of other problems. So perhaps you could share the entire code for that data step. An likely it should be from the log as this statement
if firstdate+21>lastdate then (.) else Completedate=lastdate+14;
would throw an error because 1) the (.) is syntactically incorrect and 2) there should be a semicolon between the clause after "then" and the "else".
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.