BookmarkSubscribeRSS Feed

I noticed this for the first time the other day, although it must have been happening for ages.  I tried importing a dataset that is an extract from our database, but I ran into an issue (one row of corrupt data threw the whole thing off).  I had selected "Replace File if it already exists" and realised that when the import process starts, it deletes the already existing dataset before checking the new data set.  This seems backwards to me - shouldn't the new file be checked and verified first, then if everything is Ok, delete the previous SAS dataset?  I wanted to see if the corrupt row had been previously imported or if the issue was a new one, and I couldn't do that.

 

I will now be keeping archived copies of my extracts so I can go back in and check, and will also be renaming my previous SAS dataset to a new name before replacing.

 (Tagged with Importing and Data Management as it doesn't really fall under the other tags....)

Thanks

Chris

11 Comments
ballardw
Super User

Interesting. I had never used that option in the import wizard. I had thought that applied to the generated Proc Import code because of the dialog box where the option is presented. Since I generally don't save the proc import code I didn't see a use. The option apparently is used to set the Proc Import Replace option. I did notice that a popup dialog asking to confirm the deletion does occur. So you do have an option to stop.

 

In my install it is not the default to have the replace box checked. Is that not the case in yours? Or are we looking at a different import wizard then in base SAS?

DarthPathos
Lapis Lazuli | Level 10

Hi @ballardw - Yes, you're correct, you do have the option to stop, and no, it's not my default either.  The dataset I have is the full fiscal year, from April 1 to the last day of the previous month.  Every month I reimport it and it should replace my previous one, and never looked at SAS while it's running (takes a few seconds as it's a large file).  Didn't realise this was happening until I saw the error message and went into my library to look at the previous month's data and the dataset wasn't there.  Did more testing to confirm this was the sequence. 

 

Here's the screen I'm seeing.  I'm running SAS 9.4 TS Level M3 on Win 7 32 bit.

Capture.PNG

paulkaefer
Lapis Lazuli | Level 10

I wonder if this is because of the implications as file size expands. If you are attempting to import a multiple gigabyte file (or larger), you may run into space issues if you first import the new copy, then remove the old one.

DarthPathos
Lapis Lazuli | Level 10

@paulkaefer I totally agree - but it doesn't look like the file gets imported, just verified, at least from what I'm seeing.  If i have time I'll do some playing around with PROC IMPORT and see if there are any differences versus the Import Wizard; I don't imagine there are as the Wizard generates the code, but who knows 🙂

ballardw
Super User

Is your file structure changing every month? Unstable data like Excel? If not use one result from Proc import as base for a data step to read the file (hopefully delimited) (copy from log or use F4 to recall code into the editor immediately after the wizard executes).

 

Then save that data step with nifty options like data labels and some error checking. Next month change in the infile statement and possibly the output dataset name.

 

I do this routinely. Some of my code as been slightly modified over the past 11 years as the incoming file formats change incrementally. This keeps my variable types and labels consistent and use of custom informats. Also I always read these into a temporary file and verify some elements before overwriting or storing in a permanent library.

DarthPathos
Lapis Lazuli | Level 10

@ballardw 

1) Nope, file structure (columns, formats, etc.) are identical (at least as long as I've been doing it, about 16 months)

2) It's a CSV file (which I don't open - I do two extracts, one that I open and check and the other I use to import...don't want Excel getting its grubby little formatting hands on my dates!)

3) PROC IMPORT - I've started writing a script to incorporate the Import, data checking, etc. and like your idea of pulling in one row and then going from there.  I'll also take your advice and move it to a temp file first, run some PROC SUMMARYs or something and then move it to my final library.  (Just thinking there may be value in outputting the PROC SUMMARY to an ODS output, having SAS checking the values against pre-defined "acceptable levels", and then if there's a problem, stop and if no problems continue to move the data...).  Means that someone else with SAS can run my code and I can take a vacation!

 

(But I still think as a software application, the IMPORT wizard should be reviewed - I see new users getting very confused and frustrated by this).....

ballardw
Super User

The data step read is really the way to go. You can use custom informats to check variable ranges or values and generate errors at the read.

For instance this is one custom informat I use:

Invalue $STD_ProgramArea (Just upcase default=1)
'1','FAMILY PLANNING'      ='1'
'2','PRENATAL'             ='2'
'3','PRIMARY CARE'         ='3'
'4','STD'                  ='4'
'5','OTHER'                ='5'
'6','JUVENILE CORRECTIONS' ='6'
'7','ADULT CORRECTIONS'    ='7'
other=_error_
;

 

 

There are two valid data entries because I have multiple data sources and some can't follow coding instructions so I read the value ProgramArea with this format. If I get anything other that the explicit values above then I get an error in the log just like reading with a date informat when your data supplier puts something like "UNKNOWN" in the data. Which is fixeable with an informat to assign missing, possibly a special missing such as .U, and the rest with appropriate date informat. Ranges can be checked similarly:

 

invalue range

0 - 100 = _same_

other=_error_

;

will throw an error for any numeric outside of 0 to 100.

 

If you have something that changes fairly often like added values that management forgets to mention this is one way to find them.

I will say that most of my repeated format files are relatively small, none over 40,000 or so records and most less than 1000 per month so occasional errors are relatively easy to run down.

DarthPathos
Lapis Lazuli | Level 10

@ballardw I ashamedly admit I have never figured out the data step.  This may motivate me to finally get a handle on it.  We average about 4000 new rows a month, so in April I'm looking at a fairly small file, but by January / February it's grown to 10-15,000 (I am extracting cumulative data for the fiscal year).  Shall definitely have to play around with your suggestions!

ballardw
Super User

This would be a good place to learn some basic data step. Proc import generates all the basic ugly bits you need: an infile statement, informats, formats and an input statement plus a very basic error reporting bit.

 

Depending on my actual data sources I may modify the varaiable names. Long column headings => long variables so I change things like

REFERRAL_SOURCE_PRIMARY_1_LOCATI to RefSourceLoc , or one source that insists on having 15 "Total" columns that I change to a more meaningful such as TotalWorkHours, TotalWeeklyWage or similar instead of the Total, Total2 Total3 import tends to generate.

using search in replace in the editor. Then add a label for that variable like "Referral Source Primary Location". I do generally apply labels to all variables that I need unless they are very easy to understand such as Year.

 

The code proc import generates has one informat and format for each variable. So it is easy to make one change and see if that works.

I look for some indicators like an informat of $1. or $3. Often that indicates a column that actually had no values or an empty quoted field ("" from one of my data sources). If you have a proper data exchange document and know that column could be a numeric with up to 4 digits then replace with best4. or F4. or appropriate for the values when populated. Another common issue is character variables such as names, addresses, product descriptions, account/product codes and such changing lengths from file to file when using proc import. So you take a look at your exchange document (if available it should have a maximum length noted) or apply a guesstimate based on the type of text to apply a consistent length (which would be set with the informat). Generally unless I have a custom format I only keep the format statements in the generated code for date, time or datetime variables.

 

Code to parse variables, cross reference checks (males with female medical diagnosis for instance), recoding of values such as thing like 9999 for "missing" data (if a custom informat doesn't do it) or other similar can all be incorporated early in the data process.

 

Have fun.

 

 

paulkaefer
Lapis Lazuli | Level 10

@DarthPathos: As far as learning more about data steps, there are many pages on the sasCommunity wiki, including conference papers introducing the data step.