BookmarkSubscribeRSS Feed
Ashwini_uci
Obsidian | Level 7

Hi,

I am working on NIS (Nationwide Inpatient Sample) database and I am going to have to combine/stack data for 10 years.

There are numerous variables in this data but a few years do not have certain variables/columns in them so the information for those variables is missing from them.

My question is what would be the best SAS code/procedure to perform this operation of stacking these 10 yrs of data.

is the Data/Set procedure is fine or proc append is approrpiate?

I had used the data/set procedure in the past to stack 3 years of data and all the columns from all 3 years were stacked properly, and the columns were also created in the final stacked version of the data, if any of the years didnot include that particular column/variable which was present in some other year.

I thougtht of asking this question before proceeding with this as I just want to make sure I am heading in a right way with this as it involves a huge amount of data.

Any suggestions, ideas, tips will behighly appreciated.

Thanks!

Ashwini

18 REPLIES 18
Reeza
Super User

proc append with the force option.  Data steps read each observation, proc append doesn't.

More important than what variables are present are the variable formats/lengths as you risk truncating data if the lengths of the variables have changed over time.

jakarman
Barite | Level 11

Proc append will create a new dataset without over and over replacing the original base dataset.

The datastep set (concat) in one step will create a new dataset in one run.  It will be as effective as the proc append.

Both are creating one new big dataset you are going to work on.

Other possible options:

- use a dataetset set concat as view. It will not create a new dataset but at run time read all subsets.

- Use the SPD libname when having more dasd units available. Spreading load over them may increase speed.
All depends on your limitations and the involved sizes on your hardware system.  (none of them is specified)

---->-- ja karman --<-----
Ashwini_uci
Obsidian | Level 7

Hi,

I tried using Proc append with force option and the log yield errors for appending the first it year's data itself.

I am attaching the log to theoriginal post for your reference. It did not add the variables that were missing in the base databse.

How can I have it add the new variables to the base data and also in the cases if some are missing from the new dataset?

Should I just use data and set step to avoid all these errors?

Thanks!

LearnByMistk
Obsidian | Level 7

Better to use set statment to concat data coz in case of append it would remove unmatched column

ballardw
Super User

Before combining this data I would verify:

Variables with the same names are the same type, numeric or character in all data sets. If not, you'll need to address something...

Character variables of the same name have the same length or prepare to deal with that. Both Proc Append  and the Data step may yield truncated data if the later appended datasets have longer versions.

That variables of the same name actually represent the same data and coding scheme. Nothing like finding out after doing a bunch of analysis that variable X had a code inserted so that after year Z the value 4 corresponds to a previous value of 3... Or code 6 changed meanings...

That the same topic doesn't have a different variable name from year to year.

What to do about precision if the numbers of significant digits stored has changed between sets, this mostly applies to numbers of decimals.

The longer the time frame the more the data needs to be examined as there are more opportunities for change.

jakarman
Barite | Level 11

Fort the base file with force all intended variables must be there. Adding variables by proc append is not possible only rows.

This is a result of the physical optimization of append doing it that way.

Combining datasets and changing the no of variables will require a complete rewrite of all data (set combines all datasets) 

---->-- ja karman --<-----
Kurt_Bremser
Super User

If you do have a dataset for each year (like data01, data02, .... , data14), then

data all;

set

  data00

  data01

  .

  .

  .

  data14

;

run;

will also do it. Variables only present in some datasets will have missing values for the other datasets.

Problems may arise if variables have different types or lengths.

Ashwini_uci
Obsidian | Level 7

I do have data for each year with 90% variables that are same..some have a few additonal ones and some don't /.I will try what you have suggested.

Thanks!

Astounding
PROC Star

You can probably get the best of both worlds by combining Kurt's suggestion with PROC APPEND.  First, create an empty data set that is guaranteed to contain all the variables:

options obs=0;

data empty;

   set data00 ... data14;

run;

options obs=max;

Reading 0 observations, that will run instantaneously.  Then append to the empty data set.

Good luck.

Ashwini_uci
Obsidian | Level 7

Thanks for suggesting this way. It has worked.

I have another question. As far as further analyses is concerned; analyses such as multivariate regression and running some macros on the data, does it matter how the data is stacked, using data/set step or using proc append? Should I be anticipating any warning or errors based on how the data is appended?

The reason I am asking this question because the data /set step really long to append the data compared to the proc append procedure. The proc append took hardly a few minutes to append all 10 yrs of data and the data/set step took almost over an hour to append all the data.

Astounding
PROC Star

Under the conditions you described, the analysis data would be identical whether produced by a DATA step or by PROC APPEND.  If there were additional statements that were part of the DATA step that would be a different story.  But the data will be the same, so the analysis results will be the same as well.

Good luck.

Ashwini_uci
Obsidian | Level 7

I am trying to copy the file from one library to another and I am getting folowing error message:

ERROR: A lock is not available for LIBRARY.NISALL11.DATA.

ERROR: File LIBRARY.NISALL11.DATA has not been saved because copy could not be completed.

This is the file that was created using proc append.

The file created using data/set statement was copied successfully.

I am not sure what the above errros means and how to fix it so that the file is copied successfully.

Ashwini_uci
Obsidian | Level 7

Well, I guess it has nothing to do with how the data is created. I am getting the same error for copying the data created by data/set statement as well.

Reeza
Super User

It means you have the data open somewhere or it's still being used by some process.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 1925 views
  • 15 likes
  • 7 in conversation