DATA Step, Macro, Functions and more

Merging question

Reply
Contributor
Posts: 52

Merging question

Hello

Could anybody guide me on this merge task?

Dataset 1(WB)

Country         newyear                   Var1

A                     1995                   x1

A                     1996                   x2

A                     1997                   x3

B                     1995                   x4

B                     1996                   x5

Dataset 2(countries)

Country          newyear                  Firm

A                    1995                  y1

A                    1995                  y2

A                    1995                  y3

A                    1996                  y1

A                    1996                  y2

B                   1995                   z1

B                   1995                   z2

B                   1996                   z1   

B                   1996                   z2

Now I need to merge these two datasets: basically assign each country's value for each year to each of the firms in that country for the corresponding year. It is a one-to-many merge.

I used these codes.

"DM 'log;clear;output;clear';

proc import out= work.WB

datafile= "C:\Users\saikia\Desktop\WB.csv"

DBMS = csv replace;

Getnames= yes;

run;

data WB1;

set WB;

newyear=year*1;        /*Initially, an error popped up that v46 that represented the year was both numeric and alphanumeric and so I created the variable 'newyear' for both datasets'*/

run;

proc sort data = work.WB1;

by  newyear year;

run;

proc import out= work.countries

datafile= "C:\Users\n\Desktop\country.csv"

DBMS = csv replace;

Getnames= yes;

run;

data countries1;

set countries;

newyear=v46*1;

run;

proc sort data = work.countries1;

by  country newyear;

run;

data work.combined;

merge work.WB1(in = in1) work.countries1(in = in2);

by country newyear;

if in1& in2;

run;"

The error that appears now is:

ERROR: BY variables are not properly sorted on data set WORK.WB1.

Could anybody please help me?

Thanks!

N/A
Posts: 1

Re: Merging question

while sorting the data in the data set WORK.WBI

you have given to sort it by two variables "newyear" and ''year"

but the variable "year" is not at all there in your data WORK.WBI

you have only COUNTRY ,NEWYEAR,VAR1.

Super Contributor
Posts: 276

Re: Merging question

Hi Namrata.

While merging two datasets in data step,you have to sort both dataset with the  same BY variables.

So You need to sort work.WB1 dataset based on country newyear variables.but in your code you are sorting WB1 dataset based on newyear year.

Please check.That might be causing Error.

Thanks & Regards.

Sanjeev.K

Trusted Advisor
Posts: 1,129

Re: Merging question

Hi Namrata,

to avoid that error note as suggest by few of the community members you need to sort the wb1 data on country and newyear.  This will avoid the error.

However there is another point which i would to inform you is that, in your post you mentioned that when you tried to import the csv file an error popped up that the year was both numeric and character. It some time happens that in a particular column in excel or csv file we find that there are numeric and character values. While importing sas will check the first 8 rows and will consider the data type based on that, for example if we have first 8 records with numeric and next character, sas will take the remaining records are numeric and display them as missing as character values are there. to avoid this we have a statement in proc import, which is MIXED=YES; if you use this you can avoid that message and there will be no need for you create another variable.

Please try it and let me know your feedback

Good Luck

Jagadish

Thanks,
Jag
Contributor
Posts: 52

Re: Merging question

Thanks everyone.Yes,I had overlooked that sorting variable.

I have corrected the same;still the output table 'combined' has zero observations.Merging does not take place.

The 'MIXED' suggestion is a useful one Smiley Happy Thanks!

My friend asked me to use Proc SQL since it is a one-to-many merge...apparently that seems more efficient compared to 'merge' when the 2 datasets do not contain one-to-one correspondence of the ID variables(eg. 1995 of country A in dataset 1 corresponds to several 1995s of country A in dataset 2).

Super User
Super User
Posts: 6,500

Re: Merging question

One to Many merge should work fine with the MERGE statement.  It is Many to Many that will work strangely.

Make sure that the variable that you want to copy from the "ONE" dataset is not already present on the "MANY" dataset. (in your example that would be VAR1)  If it does appear on both datasets then only the first row of the "MANY" will be updated with the value for the "ONE".

I am not a big fan of using PROC IMPORT for CSV files. Just write a data step to read the file. That way you can control the way that COUNTRY and YEAR is defined in both datasets to avoid issues of type mismatches or length mismatches.  You might also want to UPCASE() the COUNTRY variable to make sure that the value will match between the two sources.

data wb;

  infile "C:\Users\saikia\Desktop\WB.csv" dsd dlm=',' truncover lrecl=10000 firstobs=2;

  length country $20 year 8 var1 $8 ;

  input country year var1 ;

  country=upcase(country);

run;

proc sort ;

  by country year;

run;

data countries ;

  infile "C:\Users\n\Desktop\country.csv" dsd dlm=',' truncover lrecl=10000 firstobs=2;

  length country $20 year 8 firm $20 ;

  input country year firm;

  country=upcase(country);

run;

proc sort ;

  by country year;

run;

data want ;

  merge wb (in=in1) countries(in=in2);

  by country year;

  if not in1 then put 'NOT in WB ' country= year= firm= ;

  else if not in2 then put 'NOT in COUNTRIES ' country= year= var1=;

  else output;

run;

Contributor
Posts: 52

Re: Merging question

Hi Tom

Your code works Smiley Happy. I used it for a small trial dataset to understand.

However, could you please explain what it means: put 'NOT in 1' country= year= var1= var2= ; '?

Should the merged dataset contain 'Not in 1' for country & year observations in file 2 but not corresponding to that in file 1? I find that the merged dataset is one bearing perfect correspondence with both datasets.Any country/year id not contained in both are eliminated.

Also, I find that even if I do not add 'country= year=..',it gives the same output. I was curious since one of my datasets contains a lot of variables.

Thanks a lot,Tom.

I would really appreciate if you could explain the above mentioned  parts of the code.

Regards

namrata

Super User
Super User
Posts: 6,500

Re: Merging question

I added those PUT statements because you said that you were getting zero observations when you limited the output to country/year combinations that appeared in both datasets.   By looking at the values that were appearing in only one of the two datasets you could possibly debug why the merge was not finding any matching records.  If you have fixed whatever the issue was and are not worried about the mismatches then you can eliminate one or both of the put statements.

Contributor
Posts: 52

Re: Merging question

Thanks Tom.

Now,I need to use the statements in lieu of Proc Import...

I had a question:if I have a long list of variables in my dataset, is there any option with the INPUT command like using an aseterisk that would account for all variables rather than writing them all out?

Namrata    

Super User
Super User
Posts: 6,500

Re: Merging question

You can deal with it in a couple of ways.

1) Use PROC IMPORT interactively and recall the data step that it generated and clean it up.

2) Pull in the header line of the CSV file so that you have the variable names.  What I normally do this is convert it to a LENGTH statement where I assign the type and length of each variable.  I then add and INFORMAT statement for any date or time variables that require special informats to be read properly.  Then my input statement can using the xx -- yy syntax to input all of the variables.

data want ;

   infile 'my.csv' dsd dlm=',' firstobs=2 truncover lrecl=1000;

   length country $20 year 8 var1 $40 age 8 sex $6 .... lastvar 8 ;

   input country -- lastvar ;

run;

3) If all of the variables are basically the same then use numeric suffixes on the names.

data want ;

   infile 'my.csv' dsd dlm=',' firstobs=2 truncover lrecl=1000;

   length country $20 year 8 var1-var30 $20 ;

   input country year var1-var30  ;

run;

Ask a Question
Discussion stats
  • 9 replies
  • 408 views
  • 1 like
  • 5 in conversation