Hello
Could anybody guide me on this merge task?
Dataset 1(WB)
Country newyear Var1
A 1995 x1
A 1996 x2
A 1997 x3
B 1995 x4
B 1996 x5
Dataset 2(countries)
Country newyear Firm
A 1995 y1
A 1995 y2
A 1995 y3
A 1996 y1
A 1996 y2
B 1995 z1
B 1995 z2
B 1996 z1
B 1996 z2
Now I need to merge these two datasets: basically assign each country's value for each year to each of the firms in that country for the corresponding year. It is a one-to-many merge.
I used these codes.
"DM 'log;clear;output;clear';
proc import out= work.WB
datafile= "C:\Users\saikia\Desktop\WB.csv"
DBMS = csv replace;
Getnames= yes;
run;
data WB1;
set WB;
newyear=year*1; /*Initially, an error popped up that v46 that represented the year was both numeric and alphanumeric and so I created the variable 'newyear' for both datasets'*/
run;
proc sort data = work.WB1;
by newyear year;
run;
proc import out= work.countries
datafile= "C:\Users\n\Desktop\country.csv"
DBMS = csv replace;
Getnames= yes;
run;
data countries1;
set countries;
newyear=v46*1;
run;
proc sort data = work.countries1;
by country newyear;
run;
data work.combined;
merge work.WB1(in = in1) work.countries1(in = in2);
by country newyear;
if in1& in2;
run;"
The error that appears now is:
ERROR: BY variables are not properly sorted on data set WORK.WB1.
Could anybody please help me?
Thanks!
while sorting the data in the data set WORK.WBI
you have given to sort it by two variables "newyear" and ''year"
but the variable "year" is not at all there in your data WORK.WBI
you have only COUNTRY ,NEWYEAR,VAR1.
Hi Namrata.
While merging two datasets in data step,you have to sort both dataset with the same BY variables.
So You need to sort work.WB1 dataset based on country newyear variables.but in your code you are sorting WB1 dataset based on newyear year.
Please check.That might be causing Error.
Thanks & Regards.
Sanjeev.K
Hi Namrata,
to avoid that error note as suggest by few of the community members you need to sort the wb1 data on country and newyear. This will avoid the error.
However there is another point which i would to inform you is that, in your post you mentioned that when you tried to import the csv file an error popped up that the year was both numeric and character. It some time happens that in a particular column in excel or csv file we find that there are numeric and character values. While importing sas will check the first 8 rows and will consider the data type based on that, for example if we have first 8 records with numeric and next character, sas will take the remaining records are numeric and display them as missing as character values are there. to avoid this we have a statement in proc import, which is MIXED=YES; if you use this you can avoid that message and there will be no need for you create another variable.
Please try it and let me know your feedback
Good Luck
Jagadish
Thanks everyone.Yes,I had overlooked that sorting variable.
I have corrected the same;still the output table 'combined' has zero observations.Merging does not take place.
The 'MIXED' suggestion is a useful one Thanks!
My friend asked me to use Proc SQL since it is a one-to-many merge...apparently that seems more efficient compared to 'merge' when the 2 datasets do not contain one-to-one correspondence of the ID variables(eg. 1995 of country A in dataset 1 corresponds to several 1995s of country A in dataset 2).
One to Many merge should work fine with the MERGE statement. It is Many to Many that will work strangely.
Make sure that the variable that you want to copy from the "ONE" dataset is not already present on the "MANY" dataset. (in your example that would be VAR1) If it does appear on both datasets then only the first row of the "MANY" will be updated with the value for the "ONE".
I am not a big fan of using PROC IMPORT for CSV files. Just write a data step to read the file. That way you can control the way that COUNTRY and YEAR is defined in both datasets to avoid issues of type mismatches or length mismatches. You might also want to UPCASE() the COUNTRY variable to make sure that the value will match between the two sources.
data wb;
infile "C:\Users\saikia\Desktop\WB.csv" dsd dlm=',' truncover lrecl=10000 firstobs=2;
length country $20 year 8 var1 $8 ;
input country year var1 ;
country=upcase(country);
run;
proc sort ;
by country year;
run;
data countries ;
infile "C:\Users\n\Desktop\country.csv" dsd dlm=',' truncover lrecl=10000 firstobs=2;
length country $20 year 8 firm $20 ;
input country year firm;
country=upcase(country);
run;
proc sort ;
by country year;
run;
data want ;
merge wb (in=in1) countries(in=in2);
by country year;
if not in1 then put 'NOT in WB ' country= year= firm= ;
else if not in2 then put 'NOT in COUNTRIES ' country= year= var1=;
else output;
run;
Hi Tom
Your code works . I used it for a small trial dataset to understand.
However, could you please explain what it means: put 'NOT in 1' country= year= var1= var2= ; '?
Should the merged dataset contain 'Not in 1' for country & year observations in file 2 but not corresponding to that in file 1? I find that the merged dataset is one bearing perfect correspondence with both datasets.Any country/year id not contained in both are eliminated.
Also, I find that even if I do not add 'country= year=..',it gives the same output. I was curious since one of my datasets contains a lot of variables.
Thanks a lot,Tom.
I would really appreciate if you could explain the above mentioned parts of the code.
Regards
namrata
I added those PUT statements because you said that you were getting zero observations when you limited the output to country/year combinations that appeared in both datasets. By looking at the values that were appearing in only one of the two datasets you could possibly debug why the merge was not finding any matching records. If you have fixed whatever the issue was and are not worried about the mismatches then you can eliminate one or both of the put statements.
Thanks Tom.
Now,I need to use the statements in lieu of Proc Import...
I had a question:if I have a long list of variables in my dataset, is there any option with the INPUT command like using an aseterisk that would account for all variables rather than writing them all out?
Namrata
You can deal with it in a couple of ways.
1) Use PROC IMPORT interactively and recall the data step that it generated and clean it up.
2) Pull in the header line of the CSV file so that you have the variable names. What I normally do this is convert it to a LENGTH statement where I assign the type and length of each variable. I then add and INFORMAT statement for any date or time variables that require special informats to be read properly. Then my input statement can using the xx -- yy syntax to input all of the variables.
data want ;
infile 'my.csv' dsd dlm=',' firstobs=2 truncover lrecl=1000;
length country $20 year 8 var1 $40 age 8 sex $6 .... lastvar 8 ;
input country -- lastvar ;
run;
3) If all of the variables are basically the same then use numeric suffixes on the names.
data want ;
infile 'my.csv' dsd dlm=',' firstobs=2 truncover lrecl=1000;
length country $20 year 8 var1-var30 $20 ;
input country year var1-var30 ;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.