Hi! For my thesis I have to harmonise data from two different databases, but I don't need all data from both databases. My question is what is the easiest way to do this? I don't know how to pick one variable from the database X and the same variable from database Y and put it in new database Z? How can I do that, without putting all the variables from X and Y in database Z? The databases have different names for the same variables (e.g. database X = height and database Y = length). Here are some syntaxes I used, but these give errors..
*database X same as database Y*;
DATA database X;
SET database X;
LABEL person_id = respno;
RUN;
*database Y same as database X*;
DATA database Y;
SET database Y;
LABEL length = height;
RUN;
*Interleaving*;
DATA database Z;
SET database X database Y;
by height respno;
RUN;
*Concatenating*;
DATA database Z;
SET database X database Y;
RUN;
*One-to-one reading and one-to-one merging*;
DATA database Z;
SET database X;
SET database Y;
RUN;
DATA database Z;
MERGE database X database Y;
BY respno;
RUN;
And I need to put a lot of variables from X and Y in database Z. Is there an easy way to change the names of the variables which are the same in databases X and Y (but have a different name, as already mentioned with height) and to put these variables in database Z? I hope my question is clear and someone can help me.
Thanks! Oh and i use SAS version 9.3
What do you mean by database? Is the data stored in a database like Oracle or Postgre or something like that, or is the data in SAS?
How much SAS knowledge do you have, as the code you give is wrong on a few levels, might be worth running through the SAS video help to get some basics. As you haven't given me anything to work with here (i.e. example data or what you want to do with it) I will just show some general examples - assumes the data is in SAS:
libname mylib "c:\path_to_data";
The above creates a library reference pointing to a physical location on your computer/network. This is how SAS knows where to look for data. We can then use this too set (put one under the other) datasets together (say there is dataset a and b):
data want; set mylib.a (keep=vara varb) mylib.b (keep=vara varb); run;
You will see I only keep the variables I want. This requires both sets of variables are the same type and length.
To merge data you first need to ensure both are sorted, then merge by key variables (take the above example, but merge on id and keep only varc from b):
data want; merge mylib.a (keep=id vara varb) mylib.b (keep=id varc);
by id; run;
These are quite basic and fundamental things which is why I suggest looking at the video tutorials and learning Base SAS a bit more.
Thanks for your reply! I understand what you're saying. The databases I talk about are databases in SAS format. And I have (hopefully) basic knowledge of SAS. That's why I think I'm missing something here.. When I do this:
DATA m.combined;
SET m.database X (keep=height respo)
m.database Y (keep=height respo);
RUN;
I get these errors:
ERROR: The variable height in the DROP, KEEP, or RENAME list has never been referenced.
ERROR: The variable respo in the DROP, KEEP, or RENAME list has never been referenced.
ERROR: The variable height in the DROP, KEEP, or RENAME list has never been referenced.
ERROR: The variable respo in the DROP, KEEP, or RENAME list has never been referenced.
What does this mean? I look for it on the internet, but can't solve it for my own data. Database X and Y both have the variables height and respo in them.
If "m" is the name of the library assigned to the directory in which the datasets "X" and "Y" are stored, than what is "database"?
Right now you try to combine the datasets m.database, work.x, m.database and work.y - maybe not what you want to do.
Yes, I'm sorry, I mean dataset! And m is the location where SAS needs to look for both datasets X and Y.
Let's make it more clear, I want to have a dataset Z in SAS like this:
Respno | Height | Length | Othervariable1 | Othervariable2 | Othervariable3 |
All respno from datasetx | All height from datasetx | All length from datasetx | etc | etc | etc |
All respno from datasety | All height from datasety | All length from datasety | etc | etc | etc |
When I do this:
DATA m.Z;
SET m.X (keep=respno height)
m.Y (keep=respno height);
RUN;
I only get the respno from dataset X and no respno from dataset Y or height from both datasets..
Please use the training facilities available, you are missing some fundamental knowledge. This code:
SET m.dataset X (keep=respno height)
Is invalid. To reference a dataset you first create a library reference - refer to my post. This library reference is the first part preceeding the .
The second part is the physical name of the dataset. So, if on your network/computer you have a sas dataset called X, which will look something like: x.sas7bdat, then you reference this by first creating a library reference to the network/computer location and then using the code:
set m.x (keep=respno height)
You will see there is not dataset or database word in there, simply <library reference>.<datasetname>
This is a fundamental concept which is learnt in episode 1 of SAS training.
Yes, I understand what you are saying. I already made a library m:
Libname m 'M:\Thesis data';
RUN;
Dataset X and dataset Y are fictitious, because I can't give too much information about my thesis. But the datasets are in the map thesis data. In real these datasets have a different name. So I hope it makes it more clear and someone can help with my questions.
Right, so this code:
Libname m 'M:\Thesis data';
Is fine. Then we add a datastep which sets two datasets within that library together:
data want; set m.dataset1 (keep=vara varb) m.dataset2 (keep=vara varb); run;
This will set dataset2 under dataset1 in the output dataset called want. It will select only vara (made up as you wont provide any test data) and varb from each dataset. It assumes dataset1 and 2 exist in that library, both have the same two variables which are the same. This will work. You could also do:
data want (keep=vara varb); set m.dataset1 m.dataset2; run;
So what is the question regarding this?
You second question was variables have different names, if so then you would use a rename:
data want; set m.dataset1 (keep=vara height) m.dataset2 (keep=vara height rename=(length=height)); run;
To add, this code:
data m.datasetZ;
set m.datasetX (keep=respno height)
m.datasetY (keep=respno height);
RUN;
Will take all the data from datasetX keeping only respno and height. Then it will take all the data from datasetY keeping only respno and height. Then it will put the data from the datasetY under datasetX. So you will see that data further down the table.
HI Hovliza. As an aside, like RW9 I would recommend some of the SAS videos, e-Learning (PG1) and popular post “Maxims of Maximally Efficient SAS Programmers” by KurtBremser – “Learning to read the documentation will enhance your problem-solving skills by orders of magnitude.” From the ERROR: The variable respo in the DROP, KEEP, or RENAME list has never been referenced, your issue may be with variables height in X and length in Y if you want to concatenate these as one variable in the combined dataset you need to use the RENAME data set option (http://documentation.sas.com/?docsetId=allprodslang&docsetTarget=syntaxByType-datasetOption.htm&docs... ) i.e. SET m.y (RENAME=(length=height));
Yes, my formulation is not that good..
Libname m 'M:\Thesis data';
RUN;
*X same as Y*;
DATA m.X;
SET m.X;
RENAME person_id = respno;
RUN;
*Y same as X*;
DATA m.Y;
SET m.Y;
RENAME length = height;
RUN;
DATA m.Z;
SET m.Y (keep=respno height)
m.X (keep=respno height);
RUN;
I get a table with respno for both dataset X and Y and height only for Y. I get the error;
ERROR: The variable height in the DROP, KEEP, or RENAME list has never been referenced.
But height is in dataset X..
Read the log. Look at the output data. Look at proc contents data=m.X; Look at the RENAME documentation – an option in the set statement i.e. in parentheses (RENAME=(old-name-1=new-name-1));
Don’t code in the style you’re currently using, where the data set name in the SET is the same as the DATA statement. It makes mistakes really hard to find and you run the risk of destroying your data.
Data have;
Set have;
I suspect thats already happened. I suggest recreating your original data sets, then retrying this SET statement with the RENAME.
And read the chapter in the documentation on Combining Data sets.
Read the log, the notes contain the number of observations read from each dataset.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.