- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
SAS amateur here! I am trying to figure out how to create 2 new variables and assign values to each observation based on the data contained in a reference file. I have created a mock example below. The first dataset is the main one I need to keep. It contains data on individual stores and several stores will have the same corporate ID. The second dataset is a reference that contains approximately 500 Corporate IDs and addresses, including the State and City. I would like to create 2 new variables for CorpState and CorpCity in Dataset 1 based on the CorpID contained in both files. Is this even possible to do and where do I start? Any help would be greatly appreciated.
Dataset 1 (Have): Main dataset with individual store information
Obs# | CorpID | StoreID | StoreState |
1 | 4002 | 40021234 | AL |
2 | 4002 | 40021237 | MS |
3 | 4003 | 40035678 | TN |
4 | 4100 | 41007890 | WV |
5 | 4110 | 41103456 | NC |
6 | 4110 | 41103478 | NC |
7 | 4597 | 45972345 | WA |
8 | 4975 | 49750023 | WY |
Dataset 2 (Have): Reference dataset with corporate information
Obs# | CorpID | CorpState | CorpCity |
1 | 4002 | AL | Birmingham |
2 | 4003 | GA | Atlanta |
3 | 4004 | TN | Nashville |
4 | 4100 | SC | Charleston |
5 | 4110 | NC | Charlotte |
6 | 4111 | FL | Tampa |
7 | 4597 | CA | San Francisco |
8 | 4975 | WY | Jackson |
Want: Dataset 1 file containing store and corporate information
Obs# | CorpID | CorpState | CorpCity | StoreID | StoreState |
1 | 4002 | AL | Birmingham | 40021234 | AL |
2 | 4002 | AL | Birmingham | 40021237 | MS |
3 | 4003 | GA | Atlanta | 40035678 | TN |
4 | 4100 | SC | Charleston | 41007890 | WV |
5 | 4110 | NC | Charlotte | 41103456 | NC |
6 | 4110 | NC | Charlotte | 41103478 | NC |
7 | 4597 | CA | San Francisco | 45972345 | WA |
8 | 4975 | WY | Jackson | 49750023 | WY |
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Proc sort data=data1; by corpid; run; Proc sort data=data2; by corpid; run; data want; merge data1 data2; by corpid; run;
Basic match merge. To merge sets on the value of one or more variables generally sort by them and then merge as shown.
Proc Sql can also do a join but can require a bit more typing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
NOTE: UNTESTED CODE as the data is not in a usable form. This is a MERGE in a DATA step:
data want;
merge have1(in=in1) have2;
by corpid;
if in1;
run;
This assumes both data sets are sorted by CORPID
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Proc sort data=data1; by corpid; run; Proc sort data=data2; by corpid; run; data want; merge data1 data2; by corpid; run;
Basic match merge. To merge sets on the value of one or more variables generally sort by them and then merge as shown.
Proc Sql can also do a join but can require a bit more typing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So basically I was overthinking by trying to create a new variable and assign values when all I needed to do was merge and keep the variables I want or drop those I don't. Thank you!
Unfortunately, I am still having an issue with the merge. It keeps saying that the CorpID is being treated as both character and numeric. I used the following code to convert it to character to match the main dataset, sort the data, and try to merge. The proc contents results show it as character following the first code, but when I tried to merge, I am still getting the same error.
/* Change the CorpID variable in CorpAddresses from Numeric to character and end up with the original variable name */
data N.CorpAdresses;
set N.CorpAddresses;
CorpID_char = put(CorpID, 4.);
drop CorpID;
rename CorpID_char = CorpID;
run;
proc contents data=N.CorpAdresses;
run;
Proc sort data=N.StoreInformation;
by CorpID;
run;
Proc sort data=N.CorpAddresses;
by CorpID;
run;
data StoreInfoWithCorp;
merge N.StoreInformation N.CorpAddresses;
by CorpID;
run;
ERROR: Variable CorpID has been defined as both character and numeric.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@BW18 wrote:
It keeps saying that the CorpID is being treated as both character and numeric.
This is not allowed, as you have discovered. We can't really debug your code as we don't have the data, but still one of the two data sets has numeric CorpID and one has character CorpID.
We cannot use data provided as you did in your initial post. Data must be provided as working SAS data step code, following these instructions and examples. Data provided via any other method (like in your first post) is unreliable, and may fail to represent your actual SAS data sets. So I am going to have to insist on you providing data using the instructions at the link.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Why did you make it numeric in one dataset and character in the other?
I would suggest you fix that first.
In general ID values should be character as you will not be doing arithmetic with the values.
But in converting from numeric to character you need take care that you generate the digit strings in the same way so they match. Values like '123', ' 123', and '0123' might look like the same number to a human but will be three different values as a character string.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@BW18 wrote:
I did not purposely make it numeric in one dataset and character in the other. As I said previously, I tried to convert the one that SAS was treating as numeric to character to match the main dataset. This data was imported from a text file, so I am not sure if that is why it was reading it as numeric instead of character.
If you are reading in a TEXT file then you have control over how the variables are created.
If you use PROC IMPORT to read the file then it will have to GUESS how to read the file and the only information is has is what it sees in that particular file.
If you are not sure how to read in a text file you could use the code that PROC IMPORT generated as a starting point. (But note that PROC IMPORT generates poorly designed code that attaches INFORMATs and FORMATs to many variables that do not need them.)
In general to read a delimited file (like a CSV file) the code just needs to look like:
data want;
infile 'myfile.txt' dsd truncover firstobs=2;
length varA $20 varB 8 varC $5 ;
input varA -- varC ;
run;
The DATA statement names the dataset to create.
The INFILE statement names the file to read and sets some processing options. If the delimiter is something other than a comma add the DLM= option to specify the delimiter. If there is no header (or the header is mote than one line) then remove the FIRSTOBS= option or change the number. If the lines are longer than 32K bytes then add the LRECL= option to something that is longer than the longest possible line.
The LENGTH statement defines the variables type and storage length. If you define the variable is the order they appear then the INPUT statement can just use the simple positional variable list as shown.
If any of the variables require special instructions for converting the text in the file into the value to be stored then add an INFORMAT statement and attach the needed informat. Most variables do not need this. The main exceptions are DATE, TIME and DATETIME values. You might also see some text files that have thousands separators or $ added for numbers which can be read using the COMMA informat.
Similarly for FORMAT statement. Most variables will not need to have a format attached. The main exception is DATE, TIME and DATETIME values.
You could also add LABEL statement to attach a descriptive label to one or more of the variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So for example if you have this CSV file. Let's call it main.csv.
CorpID,StoreID,StoreState 4002,40021234,AL 4002,40021237,MS 4003,40035678,TN 4100,41007890,WV 4110,41103456,NC 4110,41103478,NC 4597,45972345,WA 4975,49750023,WY
Then the code to make your MAIN dataset would simply be:
data main;
infile 'main.csv' dsd truncover firstobs=2;
length CorpID $20 StoreID $20 StoreState $2 ;
input CorpID -- StoreState;
run;
Which is almost less code that it would take to call PROC IMPORT to make guesses about what data the file has.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please share the output running below code:
proc contents data=N.CorpAdresses(keep=CorpID);
run;
proc contents data=N.StoreInformation(keep=CorpID);
run;
Do your CorpId values in the data set where the variable is of type character have any leading zeros?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
There are no leading zeros in the CorpIDs. They range from 4001-4975. However, I closed out SAS and reimported the text file containing CorpAddresses. I performed the transformation to a character variable for CorpIDs and it appeared to work this time. Following this, I was able to perform the merge successfully. Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@BW18 wrote:
There are no leading zeros in the CorpIDs. They range from 4001-4975. However, I closed out SAS and reimported the text file containing CorpAddresses. I performed the transformation to a character variable for CorpIDs and it appeared to work this time. Following this, I was able to perform the merge successfully. Thank you!
Good that you've got something working.
Just as a comment: Ideally you read the source string already into a character variable as part of the import step.