BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ANWZimmerman
Obsidian | Level 7

I'm struggling a bit, I really want the best of both worlds and I'm frustrated that I can't get it as easily as I think I should.

 

I really want to use Proc Append because the first time I hit this point in my code the base dataset won't exist.

 

BUT the next time I hit this point there could be a new field, and I don't want to drop it, I want to add it.

 

AND if there is a difference in lengths, I don't want the BASE to overrule, I want the biggest one to win so I don't truncate any data.

 

Force will make them get concatenated even though there are difference, but it drop the new columns I want to keep, and if my new data is wider than the previous it'll get truncated.

 

Right now I suspect my best option is to use a SET statement, with LENGTH defined for all fields a little bigger than I think I need, and then use macro to see if BASE already exists and skip it's inclusion on the SET statement if it doesn't.

Any more elegant solutions I'm missing?

 

Probably doesn't impact but this is 9.4 sp3.

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

I think SQL would be your best bet.  Consider something along these lines:

 

proc sql;

create table combined as select * from source1 union * from source2;

quit;

 

You lose the APPEND advantage of skipping reading both data sets.  But SQL can automatically select the longest length needed (for a common field) from the various sources.

 

Good luck.

View solution in original post

12 REPLIES 12
LinusH
Tourmaline | Level 20

I think you summed it up quite nice.

Bottom line is, you can't change the structure of a data set without rewriting it back to disk.

There is the add column in SQL, but behind the scenes, it's still a rewrite of the whole data set.

 

But what kind of data operations do you do?

For me being an data architect, I always try to predict the future, or at least, have a change management process. So:

  • Agree with data suppliers on a data format
  • Accordingly, design a target data model, that can endure for some time without any major changes.
Data never sleeps
Reeza
Super User

Is it possible to define a master table structure and shove everything into that?

ANWZimmerman
Obsidian | Level 7
That's basically what I'm doing but what I had hoped to avoid.
Astounding
PROC Star

I think SQL would be your best bet.  Consider something along these lines:

 

proc sql;

create table combined as select * from source1 union * from source2;

quit;

 

You lose the APPEND advantage of skipping reading both data sets.  But SQL can automatically select the longest length needed (for a common field) from the various sources.

 

Good luck.

Reeza
Super User

Note that if you use a naming convention with a prefix, MY_TAB:, you can append all tables using the short cut reference to the data sets:

 

data want;
set my_data:;
run;

 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, there are a few things going on here.  First you can of course avoid length probelms by using @Astoundings suggestion of SQL union all.  But your second point is that you may have new variables, that wouldn't be "appendable" using normal method, you would need to re-write the dataset.  

 

However, that being said the real underlying problem has nothing to do with the strcuture or code, it is that you don't have a good understanding of your data.  How is it that you do not know, or be able to predict what length is needed, if you are unsure set it to a long length.  How is it you do not know what columns you will be getting?  This is the reall question to be answered, any solution at the moment can be broken by each and every run if this question is not answered.  In my field, if data comes from third party then we create and agree a data transfer spec which details what lengths/variables will be present in the data.  If this changes then the whole process changes - this is called Lifecycle Management.  This is far more important than the datastep which imports the data.  As always: Document, Document, Write, Test, Document, Production Lifecycle, Document.

ANWZimmerman
Obsidian | Level 7

It's not that I don't know what the fields will be, it's just that in each iteration there are different fields that might be added.

 

What I love about proc append is that you don't have to pre define the base table.

 

And the reason I don't know for sure how wide each field could be is because I'm preserving the values in macro parameters so we can replicate certain runs that produce interesting results. Some of the fields are limited in length, but there is at least one field that is normally a long list of variable names and really could be as much as 1000 or more characters.  I really don't want to set it that wide unless I really need to.

 

For now I'm using length statements to generate the data sets I'm appending and making sure to include all fields that might be coming later from other data sets but it just seems so klunky. Just wanted to confirm that I wasn't missing some new option or proc that would be more elegant.

 

Eventualy I'll write a macro that will check the lengths of matching vars and create a new dataset with all columns from both and with the length of common fields set to the max of the two. Just don't have the time right now.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

 

"It's not that I don't know what the fields will be, it's just that in each iteration there are different fields that might be added."

In what sense, is it your code which is adding variables, is or this a data transfer?  If it is a transfer, then look to nail down the process, create a data import agreement, have the vendor sign it and adhere to it - otherwise your just making a rod for your own back.  If they think there might be additional variables needed, then consider a parameter + result approach i.e. these would go down the page rather than across - this makes your coding a lot easier, as structure would not change, only data items.  If its your code, why do you not know?  Are you transposing, if so keep the data normalised and only transpose when aboslutely necessary. 

 

Thinking about your data and arranging it so thats its easy to work with is around 20% of any programming task, and there should never be an instance where you do not know the structure or content of your data, and if for some reason it is hard, then you have sashelp.vcolumn and vtable to help.

 

"What I love about proc append is that you don't have to pre define the base table."

This ties in with the above, you like proc append because you don't need to know what your data is going to look like, but that will just cause you problems further down the line.

 

"And the reason I don't know for sure how wide each field could be is because I'm preserving the values in macro parameters so we can replicate certain runs that produce interesting results"

I have no real clue what your doing?  If you need to re-run something, then take a snap shot of your data, if you have certain parameters, store them as a parameter/result dataset.  Macro is not a storage facility.

 

If you need examples of the parameter/result, have a look at the CDISC LAB domain, you can see there is a parameter/code and results in character/numeric.  The dataset never changes structure, but it can add new parameters etc.

ballardw
Super User

It looks like you get to write a program that examines the two sets, determines if they are "compatible" for Proc Append and append if compatible. Else pull out information to use the "maximum" length for same named variables, add a length statement BEFORE a set statement to combine the data.

 

I haven't time to go into details but information from Dictionary.Columns or sashelp.Vcolumn will let you do the first part.

ANWZimmerman
Obsidian | Level 7
Yep, that will be a fun macro to write 🙂
Something I can share at a future SGF
Alankar
Fluorite | Level 6

Your problem should be solved with below answer. the function EXIST checks whether the dataset exists or not.

 

/*proc sql;*/
/*drop table appenddata;quit;*/
data subdata;
length name1 $3.;
set sashelp.class;
name1 = substr(name,1,3);
keep name1;
run;

%macro a();
%if %sysfunc(exist(appenddata)) %then %do;
	proc sql;
		create table appenddata as 
			select * from appenddata union all select * from subdata;
	quit;
%end;
%else %do;
	proc sql;
		create table appenddata as select * from subdata;
	quit;
%end;
%mend a;
%a();

data subdata;
length name1 $10.;
set sashelp.class;
name1 = name;
/*keep name1;*/
run;
%a();

 

Ron_MacroMaven
Lapis Lazuli | Level 10

my solution is remember that data structure is a set of compiled statements,

which come before the algorithm, the execute statements.

 

DATA common_structure;

attrib varA length = $1

   varB length = $2

   varC length = 8;

stop;

*note: data has 3 variables, but no rows.

 

DATA step_one;

   *ensure all data have same structure:

   if 0 then set common_structure;

do until(endofile);

    set lib.what_1 end = endofile;

   output;

   end;

stop;

proc append data = &syslast base = The_Big_One;

 

Ron Fehd xor maven

xor(compiled,executable)

xor(data structure,algorithm)

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 5050 views
  • 7 likes
  • 8 in conversation