11-04-2014 02:39 PM
I am working on NIS (Nationwide Inpatient Sample) database and I am going to have to combine/stack data for 10 years.
There are numerous variables in this data but a few years do not have certain variables/columns in them so the information for those variables is missing from them.
My question is what would be the best SAS code/procedure to perform this operation of stacking these 10 yrs of data.
is the Data/Set procedure is fine or proc append is approrpiate?
I had used the data/set procedure in the past to stack 3 years of data and all the columns from all 3 years were stacked properly, and the columns were also created in the final stacked version of the data, if any of the years didnot include that particular column/variable which was present in some other year.
I thougtht of asking this question before proceeding with this as I just want to make sure I am heading in a right way with this as it involves a huge amount of data.
Any suggestions, ideas, tips will behighly appreciated.
11-04-2014 02:45 PM
proc append with the force option. Data steps read each observation, proc append doesn't.
More important than what variables are present are the variable formats/lengths as you risk truncating data if the lengths of the variables have changed over time.
11-04-2014 02:58 PM
Proc append will create a new dataset without over and over replacing the original base dataset.
The datastep set (concat) in one step will create a new dataset in one run. It will be as effective as the proc append.
Both are creating one new big dataset you are going to work on.
Other possible options:
- use a dataetset set concat as view. It will not create a new dataset but at run time read all subsets.
- Use the SPD libname when having more dasd units available. Spreading load over them may increase speed.
All depends on your limitations and the involved sizes on your hardware system. (none of them is specified)
11-10-2014 08:27 PM
I tried using Proc append with force option and the log yield errors for appending the first it year's data itself.
I am attaching the log to theoriginal post for your reference. It did not add the variables that were missing in the base databse.
How can I have it add the new variables to the base data and also in the cases if some are missing from the new dataset?
Should I just use data and set step to avoid all these errors?
11-04-2014 03:31 PM
Before combining this data I would verify:
Variables with the same names are the same type, numeric or character in all data sets. If not, you'll need to address something...
Character variables of the same name have the same length or prepare to deal with that. Both Proc Append and the Data step may yield truncated data if the later appended datasets have longer versions.
That variables of the same name actually represent the same data and coding scheme. Nothing like finding out after doing a bunch of analysis that variable X had a code inserted so that after year Z the value 4 corresponds to a previous value of 3... Or code 6 changed meanings...
That the same topic doesn't have a different variable name from year to year.
What to do about precision if the numbers of significant digits stored has changed between sets, this mostly applies to numbers of decimals.
The longer the time frame the more the data needs to be examined as there are more opportunities for change.
11-11-2014 01:08 AM
Fort the base file with force all intended variables must be there. Adding variables by proc append is not possible only rows.
This is a result of the physical optimization of append doing it that way.
Combining datasets and changing the no of variables will require a complete rewrite of all data (set combines all datasets)
11-11-2014 03:28 AM
If you do have a dataset for each year (like data01, data02, .... , data14), then
will also do it. Variables only present in some datasets will have missing values for the other datasets.
Problems may arise if variables have different types or lengths.
11-12-2014 02:23 PM
You can probably get the best of both worlds by combining Kurt's suggestion with PROC APPEND. First, create an empty data set that is guaranteed to contain all the variables:
set data00 ... data14;
Reading 0 observations, that will run instantaneously. Then append to the empty data set.
11-12-2014 07:35 PM
Thanks for suggesting this way. It has worked.
I have another question. As far as further analyses is concerned; analyses such as multivariate regression and running some macros on the data, does it matter how the data is stacked, using data/set step or using proc append? Should I be anticipating any warning or errors based on how the data is appended?
The reason I am asking this question because the data /set step really long to append the data compared to the proc append procedure. The proc append took hardly a few minutes to append all 10 yrs of data and the data/set step took almost over an hour to append all the data.
11-14-2014 02:28 PM
Under the conditions you described, the analysis data would be identical whether produced by a DATA step or by PROC APPEND. If there were additional statements that were part of the DATA step that would be a different story. But the data will be the same, so the analysis results will be the same as well.
11-14-2014 04:41 PM
I am trying to copy the file from one library to another and I am getting folowing error message:
ERROR: A lock is not available for LIBRARY.NISALL11.DATA.
ERROR: File LIBRARY.NISALL11.DATA has not been saved because copy could not be completed.
This is the file that was created using proc append.
The file created using data/set statement was copied successfully.
I am not sure what the above errros means and how to fix it so that the file is copied successfully.