Solved: Re: Proc Append challenge

ANWZimmerman · Posted 03-16-2016 04:36 PM

I'm struggling a bit, I really want the best of both worlds and I'm frustrated that I can't get it as easily as I think I should.

I really want to use Proc Append because the first time I hit this point in my code the base dataset won't exist.

BUT the next time I hit this point there could be a new field, and I don't want to drop it, I want to add it.

AND if there is a difference in lengths, I don't want the BASE to overrule, I want the biggest one to win so I don't truncate any data.

Force will make them get concatenated even though there are difference, but it drop the new columns I want to keep, and if my new data is wider than the previous it'll get truncated.

Right now I suspect my best option is to use a SET statement, with LENGTH defined for all fields a little bigger than I think I need, and then use macro to see if BASE already exists and skip it's inclusion on the SET statement if it doesn't.

Any more elegant solutions I'm missing?

Probably doesn't impact but this is 9.4 sp3.

Astounding · Posted 03-16-2016 04:48 PM

I think SQL would be your best bet. Consider something along these lines:

proc sql;

create table combined as select * from source1 union * from source2;

quit;

You lose the APPEND advantage of skipping reading both data sets. But SQL can automatically select the longest length needed (for a common field) from the various sources.

Good luck.

View solution in original post

LinusH · Posted 03-16-2016 04:44 PM

I think you summed it up quite nice.

Bottom line is, you can't change the structure of a data set without rewriting it back to disk.

There is the add column in SQL, but behind the scenes, it's still a rewrite of the whole data set.

But what kind of data operations do you do?

For me being an data architect, I always try to predict the future, or at least, have a change management process. So:

Agree with data suppliers on a data format
Accordingly, design a target data model, that can endure for some time without any major changes.

Data never sleeps

Reeza · Posted 03-16-2016 04:47 PM

Is it possible to define a master table structure and shove everything into that?

ANWZimmerman · Posted 03-19-2016 01:23 PM

That's basically what I'm doing but what I had hoped to avoid.

Astounding · Posted 03-16-2016 04:48 PM

I think SQL would be your best bet. Consider something along these lines:

proc sql;

create table combined as select * from source1 union * from source2;

quit;

You lose the APPEND advantage of skipping reading both data sets. But SQL can automatically select the longest length needed (for a common field) from the various sources.

Good luck.

Reeza · Posted 03-16-2016 04:54 PM

Note that if you use a naming convention with a prefix, MY_TAB:, you can append all tables using the short cut reference to the data sets:

data want;
set my_data&colon;;
run;

RW9 · Posted 03-17-2016 05:34 AM

Well, there are a few things going on here. First you can of course avoid length probelms by using @Astoundings suggestion of SQL union all. But your second point is that you may have new variables, that wouldn't be "appendable" using normal method, you would need to re-write the dataset.

However, that being said the real underlying problem has nothing to do with the strcuture or code, it is that you don't have a good understanding of your data. How is it that you do not know, or be able to predict what length is needed, if you are unsure set it to a long length. How is it you do not know what columns you will be getting? This is the reall question to be answered, any solution at the moment can be broken by each and every run if this question is not answered. In my field, if data comes from third party then we create and agree a data transfer spec which details what lengths/variables will be present in the data. If this changes then the whole process changes - this is called Lifecycle Management. This is far more important than the datastep which imports the data. As always: Document, Document, Write, Test, Document, Production Lifecycle, Document.

ANWZimmerman · Posted 03-19-2016 01:22 PM

It's not that I don't know what the fields will be, it's just that in each iteration there are different fields that might be added.

What I love about proc append is that you don't have to pre define the base table.

And the reason I don't know for sure how wide each field could be is because I'm preserving the values in macro parameters so we can replicate certain runs that produce interesting results. Some of the fields are limited in length, but there is at least one field that is normally a long list of variable names and really could be as much as 1000 or more characters. I really don't want to set it that wide unless I really need to.

For now I'm using length statements to generate the data sets I'm appending and making sure to include all fields that might be coming later from other data sets but it just seems so klunky. Just wanted to confirm that I wasn't missing some new option or proc that would be more elegant.

Eventualy I'll write a macro that will check the lengths of matching vars and create a new dataset with all columns from both and with the length of common fields set to the max of the two. Just don't have the time right now.

RW9 · Posted 03-21-2016 05:10 AM

Hi,

"It's not that I don't know what the fields will be, it's just that in each iteration there are different fields that might be added."

In what sense, is it your code which is adding variables, is or this a data transfer? If it is a transfer, then look to nail down the process, create a data import agreement, have the vendor sign it and adhere to it - otherwise your just making a rod for your own back. If they think there might be additional variables needed, then consider a parameter + result approach i.e. these would go down the page rather than across - this makes your coding a lot easier, as structure would not change, only data items. If its your code, why do you not know? Are you transposing, if so keep the data normalised and only transpose when aboslutely necessary.

Thinking about your data and arranging it so thats its easy to work with is around 20% of any programming task, and there should never be an instance where you do not know the structure or content of your data, and if for some reason it is hard, then you have sashelp.vcolumn and vtable to help.

"What I love about proc append is that you don't have to pre define the base table."

This ties in with the above, you like proc append because you don't need to know what your data is going to look like, but that will just cause you problems further down the line.

"And the reason I don't know for sure how wide each field could be is because I'm preserving the values in macro parameters so we can replicate certain runs that produce interesting results"

I have no real clue what your doing? If you need to re-run something, then take a snap shot of your data, if you have certain parameters, store them as a parameter/result dataset. Macro is not a storage facility.

If you need examples of the parameter/result, have a look at the CDISC LAB domain, you can see there is a parameter/code and results in character/numeric. The dataset never changes structure, but it can add new parameters etc.

ballardw · Posted 03-17-2016 12:14 PM

It looks like you get to write a program that examines the two sets, determines if they are "compatible" for Proc Append and append if compatible. Else pull out information to use the "maximum" length for same named variables, add a length statement BEFORE a set statement to combine the data.

I haven't time to go into details but information from Dictionary.Columns or sashelp.Vcolumn will let you do the first part.

ANWZimmerman · Posted 03-19-2016 01:27 PM

Yep, that will be a fun macro to write 🙂
Something I can share at a future SGF

Alankar · Posted 03-18-2016 01:03 AM

Your problem should be solved with below answer. the function EXIST checks whether the dataset exists or not.

/*proc sql;*/
/*drop table appenddata;quit;*/
data subdata;
length name1 $3.;
set sashelp.class;
name1 = substr(name,1,3);
keep name1;
run;

%macro a();
%if %sysfunc(exist(appenddata)) %then %do;
	proc sql;
		create table appenddata as 
			select * from appenddata union all select * from subdata;
	quit;
%end;
%else %do;
	proc sql;
		create table appenddata as select * from subdata;
	quit;
%end;
%mend a;
%a();

data subdata;
length name1 $10.;
set sashelp.class;
name1 = name;
/*keep name1;*/
run;
%a();

Ron_MacroMaven · Posted 03-18-2016 08:07 PM

my solution is remember that data structure is a set of compiled statements,

which come before the algorithm, the execute statements.

DATA common_structure;

attrib varA length = $1

varB length = $2

varC length = 8;

stop;

*note: data has 3 variables, but no rows.

DATA step_one;

*ensure all data have same structure:

if 0 then set common_structure;

do until(endofile);

set lib.what_1 end = endofile;

output;

end;

stop;

proc append data = &syslast base = The_Big_One;

Ron Fehd xor maven

xor(compiled,executable)

xor(data structure,algorithm)

Classroom Training Available!