- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a problem with merging my data.
If have the following two tables
Table 1 | ||||||
x | y | z | 2019-01 | 2019-02 | 2020-06 | 2021-01 |
a | b | c | 1 | 2 | 1 | 10 |
a | b | d | 0 | 3 | 1 | 11 |
b | a | d | 2 | 1 | 3 | 8 |
a | c | b | 1 | 4 | 5 | 2 |
Table 2 | |||||||
x | y | z | 2019-01 | 2019-05 | 2020-07 | 2021-01 | 2021-02 |
a | b | c | 0 | 0 | 1 | 8 | 11 |
a | b | d | 1 | 4 | 3 | 2 | 20 |
b | a | d | 0 | 1 | 0 | 0 | 2 |
a | e | e | 0 | 0 | 0 | 2 | 1 |
When I merge them i get
x | y | z | 2019-01 | 2019-02 | 2020-06 | 2021-01 | 2019-05 | 2020-07 | 2021-02 |
a | b | c | 0 | 2 | 1 | 8 | 0 | 1 | 11 |
a | b | d | 1 | 3 | 1 | 2 | 4 | 3 | 20 |
a | c | b | 1 | 4 | 5 | 2 | . | . | . |
a | e | e | 0 | . | . | 2 | 0 | 0 | 1 |
b | a | d | 0 | 1 | 3 | 0 | 1 | 0 | 2 |
However i need to get this
x | y | z | 2019-01 | 2019-02 | 2019-05 | 2020-07 | 2020-06 | 2021-01 | 2021-02 |
a | b | c | 1 | 2 | 0 | 1 | 1 | 18 | 11 |
a | b | d | 1 | 3 | 4 | 3 | 1 | 13 | 20 |
b | a | d | 2 | 1 | 1 | 0 | 3 | 8 | 2 |
a | c | b | 1 | 4 | 0 | 0 | 5 | 2 | 0 |
a | e | e | 0 | 0 | 0 | 0 | 0 | 2 | 1 |
I found some posts that came close to the awnser i needed but there is a second problem that i don't now the year-month upfront and the differ every time i get a new file. (origanal files contain over 40000 records with up to 80 colums) .
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Easy enough if X,Y,Z are character and the others are numeric.
data both;
set table1 table2;
by x y z;
run;
proc summary data=both ;
by x y z;
var _numeric_;
output out=want(drop=_type_ _freq_) sum= ;
run;
You can even make the first data step create a view so that you don't have to make a physical copy of the combined data.
If some of the groups and some of the variables only appear on one of the datasets you might get some observations with missing values. You could use PROC STDIZE to replace the missing with zeros if you need.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please provide data in the form of a working data step for each data set.
Also, provide the code for how you are "merging" the data.
Almost certainly to get the desired sums you show a data step merge is not the approach you want. You would use SET to combine the data and then do the sums.
But we do have an issue in your data as the names you show are not default legal SAS data names AND you are hiding data in the variable name. Typically that means the data is structured poorly for almost any activity.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can make the process entirely dynamic by flipping the data to a long format and then appending the results, sorting and merging.
Or you can generate a dynamic select statement.
Rough idea for solution 1:
proc transpose data=table1 out=table1_long;
by x y z;
var _numeric_;
run;
proc transpose data=table2 out=table2_long;
by x y z;
var _numeric_;
run;
data combined;
set table1_long table2_long;
run;
proc sort data=combined;
by x y z _name_;
run;
proc transpose data=combined out=final prefix= D_;
by x y z;
var col1;
id _name_;
run;
And for solution 2:
*get dates in order;
proc sql noprint;
select nliteral(name) into :name_list separated by ", "
from sashelp.vcolumns
where libname='WORK' and upcase(memname) = 'MERGED'
and name not in ('X' , 'Y', 'Z')
order by name;
quit;
*recreate table with variables in order;
proc sql;
create table want as
select x, y, z, &name_list
from merged;
quit;
Untested code as I didn't want to make a data step to generate sample data.
@rgjpot wrote:
I have a problem with merging my data.
If have the following two tables
Table 1 x y z 2019-01 2019-02 2020-06 2021-01 a b c 1 2 1 10 a b d 0 3 1 11 b a d 2 1 3 8 a c b 1 4 5 2
Table 2 x y z 2019-01 2019-05 2020-07 2021-01 2021-02 a b c 0 0 1 8 11 a b d 1 4 3 2 20 b a d 0 1 0 0 2 a e e 0 0 0 2 1
When I merge them i get
x y z 2019-01 2019-02 2020-06 2021-01 2019-05 2020-07 2021-02 a b c 0 2 1 8 0 1 11 a b d 1 3 1 2 4 3 20 a c b 1 4 5 2 . . . a e e 0 . . 2 0 0 1 b a d 0 1 3 0 1 0 2
However i need to get this
x y z 2019-01 2019-02 2019-05 2020-07 2020-06 2021-01 2021-02 a b c 1 2 0 1 1 18 11 a b d 1 3 4 3 1 13 20 b a d 2 1 1 0 3 8 2 a c b 1 4 0 0 5 2 0 a e e 0 0 0 0 0 2 1
I found some posts that came close to the awnser i needed but there is a second problem that i don't now the year-month upfront and the differ every time i get a new file. (origanal files contain over 40000 records with up to 80 colums) .
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Easy enough if X,Y,Z are character and the others are numeric.
data both;
set table1 table2;
by x y z;
run;
proc summary data=both ;
by x y z;
var _numeric_;
output out=want(drop=_type_ _freq_) sum= ;
run;
You can even make the first data step create a view so that you don't have to make a physical copy of the combined data.
If some of the groups and some of the variables only appear on one of the datasets you might get some observations with missing values. You could use PROC STDIZE to replace the missing with zeros if you need.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Firdt of all, transpose the tables so that the date values become data instead of structure.
Then, stack those tables into one, and finally, build the sum with PROC SUMMARY, CLASS X Y Z DATE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just to illustrate the usefulness of a long layout:
data table1;
infile datalines dlm='09'x;
input (x y z) ($) '2019-01'n '2019-02'n '2020-06'n '2021-01'n;
datalines;
a b c 1 2 1 10
a b d 0 3 1 11
b a d 2 1 3 8
a c b 1 4 5 2
;
data table2;
infile datalines dlm='09'x;
input (x y z) ($) '2019-01'n '2019-05'n '2020-07'n '2021-01'n '2021-02'n;
datalines;
a b c 0 0 1 8 11
a b d 1 4 3 2 20
b a d 0 1 0 0 2
a e e 0 0 0 2 1
;
proc transpose data=table1 out=t1;
by x y z notsorted;
var _numeric_;
run;
data t1_dt;
set t1;
date = input(cats(_name_,"-01"),yymmdd10.);
format date yymmd7.;
drop _name_;
rename col1=value;
run;
proc transpose data=table2 out=t2;
by x y z notsorted;
var _numeric_;
run;
data t2_dt;
set t2;
date = input(cats(_name_,"-01"),yymmdd10.);
format date yymmd7.;
drop _name_;
rename col1=value;
run;
data all;
set
t1_dt
t2_dt
;
run;
Once your data is in proper shape, it's simple:
proc summary data=all nway;
class x y z date;
var value;
output out=want (drop=_:) sum()=;
run;
And if you need a report out of the raw data, PROC REPORT does the sum on the fly:
proc report data=all;
column x y z value,date;
define x / group;
define y / group;
define z / group;
define value / "" sum;
define date / "" across;
run;
If you wrap that PROC REPORT into ODS EXCEL, you get an Excel file with the wide structure that even your boss can read. But as long as you need to do analysis, avoid the wide structure. See Maxim 19.