Data is lost during merge....

Reply
New Contributor
Posts: 3

Data is lost during merge....

Hello All,

I am struggling to merge 7 datasets. I am trying to "import" 6 of the files, which all follow the same formats and columns, into a "master" file, which has more headers.  Between the two different "types" of files (the master and the imports), two variables (VAR1 and VAR3) can be the same. In the code below, I have only used VAR1 to identify the observations, and it does not matter if VAR3 is replaced by other files, as it should be the same. Below illustrates the two different files:

Master file:  VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8 VAR9

Import files: VAR1           VAR3                                                        VAR10 VAR11 VAR12

I want to maintain all observations (there are more than already noted in the master file), and not all of the observations in the master file will be filled with imported variables.

The problem that I am facing is it seems that the merge is occurring, but the data that existed in the master file is deleted.

If someone could help find the error in my code? I know it is more complicated than it needs to be; I am still rather new to SAS and am teaching myself.

Thank you,

Briana

      proc import out = Database datafile = "X:\My SAS Files\9.3\Database"

                  dbms = xls replace;

            sheet = "Sheet1";

            Getnames = yes;

      run;

      proc import out = Data1 datafile = "X:\My SAS Files\9.3\Data1"

                  dbms = xls replace;

            sheet = "Sheet1";

            Getnames = yes;

      run;

proc import out = Data2 datafile = "X:\My SAS Files\9.3\Data2"

                  dbms = xls replace;

            sheet = "Sheet1";

            Getnames = yes;

      run;

      proc import out = Data3 datafile = "X:\My SAS Files\9.3\Data3"

                  dbms = xls replace;

            sheet = "Sheet1";

            Getnames = yes;

      run;

            proc import out = Data4 datafile = "X:\My SAS Files\9.3\Data4"

                  dbms = xls replace;

            sheet = "Sheet1";

            Getnames = yes;

      run;

            proc import out = Data5 datafile = "X:\My SAS Files\9.3\Data5"

                  dbms = xls replace;

            sheet = "Sheet1";

            Getnames = yes;

      run;

            proc import out = Data6 datafile = "X:\My SAS Files\9.3\Data6"

                  dbms = xls replace;

            sheet = "Sheet1";

            Getnames = yes;

      run;

data Database;

      set Database;

            VAR2char = put(VAR2_,20.);

            drop VAR2_;

   rename VAR2char = VAR2;

run;

data Database;

      set Database;

            VAR1char = put(VAR1,20.);

            drop VAR1;

   rename VAR1char = VAR1;

run;

data Data1;

      set Data1;

            VAR2char = put(VAR2_,20.);

            drop VAR2_;

   rename VAR2char = VAR2;

run;

data Data1;

      set Data1;

            VAR1char = put(VAR1,20.);

            drop VAR1;

   rename VAR1char = VAR1;

run;

data Data2;

      set Data2;

            VAR2char = put(VAR2_,20.);

            drop VAR2_;

   rename VAR2char = VAR2;

run;

data Data2;

      set Data2;

            VAR1char = put(VAR1,20.);

            drop VAR1;

   rename VAR1char = VAR1;

run;

data Data3;

      set Data3;

            VAR2char = put(VAR2_,20.);

            drop VAR2_;

   rename VAR2char = VAR2;

run;

data Data3;

      set Data3;

            VAR1char = put(VAR1,20.);

            drop VAR1;

   rename VAR1char = VAR1;

run;

data Data4;

      set Data4;

            VAR2char = put(VAR2_,20.);

            drop VAR2_;

   rename VAR2char = VAR2;

run;

data Data4;

      set Data4;

            VAR1char = put(VAR1,20.);

            drop VAR1;

   rename VAR1char = VAR1;

run;

data Data5;

      set Data5;

            VAR2char = put(VAR2_,20.);

            drop VAR2_;

   rename VAR2char = VAR2;

run;

data Data5;

      set Data5;

            VAR1char = put(VAR1,20.);

            drop VAR1;

   rename VAR1char = VAR1;

run;

data Data6;

      set Data6;

            VAR2char = put(VAR2_,20.);

            drop VAR2_;

   rename VAR2char = VAR2;

run;

data Data6;

      set Data6;

            VAR1char = put(VAR1,20.);

            drop VAR1;

   rename VAR1char = VAR1;

run;

proc sort data=Database; by VAR1; run;

proc sort data=Data1; by VAR1; run;

proc sort data=Data2; by VAR1; run;

proc sort data=Data3; by VAR1; run;

proc sort data=Data4; by VAR1; run;

proc sort data=Data5; by VAR1; run;

proc sort data=Data6; by VAR1; run;

options mergenoby=error;

data MergeAll;

      merge Database (keep=VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8 VAR9)

            Data1 Data2 Data3 Data4 Data5 Data6;

      by VAR1;

run;

ods tagsets.ExcelXP FILE="X:\My SAS Files\9.3\Data List.xls";

proc print data=MergeAll;

run;

ods tagsets.ExcelXP CLOSE;

Trusted Advisor
Posts: 1,229

Re: Data is lost during merge....

Try this if that works.

data MergeAll;

      merge Data1 Data2 Data3 Data4 Data5 Data6 Database;

      by VAR1;

run;

Super User
Super User
Posts: 7,076

Re: Data is lost during merge....

I do not understand what you mean by the data being replaced.

If two datasets are merged that have common variable names for variables not used in the BY statement then the values from the dataset specified later in the merge statement will "win" when there are contributions from both datasets.

Is your actual problem that you want to rename some of the variables?

It would help if you could construct a simple example with just two or three datasets of just a few observations each having a minimum number of variables that demonstrate where you think that you are losing data.  So input tables, the desired output and if possible what you are getting now that you consider wrong.

New Contributor
Posts: 3

Re: Data is lost during merge....

Below is a sample of the merged file. stat@sas's advice seemed to help a bit. Prior to adjusting the order of file names in the merge statement, the output was showing blanks for VAR1-9 where it should have appeared; now that is resolved, but there are other problems. In the merged file there are 65,711 observations of VAR9 (and corresponding VAR10-12; I forgot to include VAR13 in my original example, as a part of the lower sets), but there should be 186,728 observations from that set alone.

Another problem I'm seeing is that for VAR1, SAS is A) not recognizing the difference between variables such as "8002" and "00008002" (which does not appear in the merged file), nor do VAR9-12 appear for that observation anyway, or B) it is also dropping variables from the other sets.

It does not matter that VAR3 overlaps, or one wins over the other, because it should be the same for VAR1 across all datasets. My concern is about missing observations, or incomplete information. Would it help if I merged one file at a time?

Example of Master File (Database)

VAR1VAR3VAR2VAR4VAR5VAR6VAR7VAR8VAR9
60425LOOKOklahomaEABanana59E07ID:041340
5224SKEWOklahomaEABanana42R05ID:047907
68-32104ANGLEWashingtonFTMelon90S02ID:01921
20-08110DECALWashingtonEAMelon21K02ID:01304

Example of Data1-6 Files

VAR1VAR3VAR12VAR10VAR11VAR13
020-1026FILTER84212300GAChelsea
020-1120BONDED87089910GAChelsea
060-134EXTEND73202081GAChelsea
090-92BONDED87089997GAChelsea
1-71HEB*170116BLUE84294010GAChelsea

Example of Merged File:

ObsVAR3VAR10VAR11VAR12VAR1VAR13G (appeared after merge)VAR2VAR4VAR5VAR6VAR7VAR8VAR9
18185COLORTPSteven83021030000000-050.
18186REGULARDARachelle.91.
18187CATCH.00003212.CADTexasEAPumpkin42R12ID:7917-93690
18188CIRCLEDARachelle73182100000036272BCH.
18189LOOK.8002VirginiaEAApple42R05ID:1661-2487
Super User
Super User
Posts: 7,988

Re: Data is lost during merge....

Could you provide a smaller example of what you have and what you want out.  In your example, VAR1 does not match in either table, and in the output you have Obs rather than Var1.  Are you attempting to merge on row number?  Also, as KurtBremser said, you should not merge datasets which have the same variable names as you will get unusual results.

One final thing: the text string "8002" will not merge to "00008002" as they are not the same.  There are several methods you could use, if this is all numeric then convert that column to numbers and then the would match as both would be 8002.

Its really not easy to see what you are trying to achieve as none of the above data seems to match in anyway.

Super User
Super User
Posts: 7,076

Re: Data is lost during merge....

I suspect that most of your problems are from using EXCEL as you source of the data.  When you use PROC IMPORT to read from a XLS file it has to guess (or some windows utility that it calls guesses) what the variable names, types and lengths are by looking a the first few lines of the file.

Try saving the files as CSV files and then writing DATA steps to read them.  That way you can control the variable types (and names).  That should let you solve the '8002' vs '00008002' issue.  If you want those values to be the same then read them into a numeric variable.  If you want them to be different then read them into a character variable.

If you are unsure how to create such a data step you can use PROC IMPORT to read the CSV file and recall the data step code that it generates and use that as the basis for you data step. 

Your examples do not appear demonstrate the issue you are saying that you have. In particular there is no overlap in the values of your key variable (VAR1) between MASTER and DATA1 example that you posted.  So the results or MERGE or SET would be the same for that sample data.

For example here is how I read in your example data. 

data master ;

  length var1 $20 var3 $10 var2 $20 var4 $2 var5 $20 var6 8 var7 $3 var8 $10 var9 $10;

  input var1 -- var9 ;

cards ;

60425 LOOK Oklahoma EA Banana 59 E07 . ID:041340

5224 SKEW Oklahoma EA Banana 42 R05 . ID:047907

68-32104 ANGLE Washington FT Melon 90 S02 . ID:01921

20-08110 DECAL Washington EA Melon 21 K02 . ID:01304

run;

data data1 ;

  length var1 $20 var3 $10 var12 $20 var11 $2 var13 $20 ;

  input var1 -- var13 ;

cards ;

020-1026 FILTER 84212300 GA Chelsea

020-1120 BONDED 87089910 GA Chelsea

060-134 EXTEND 73202081 GA Chelsea

090-92 BONDED 87089997 GA Chelsea

1-71HEB*170116 BLUE 84294010 GA Chelsea

run;

All of the observations from MASTER will have missing values for all the variables from DATA1  and the reverse.

Obs|var1          |var3  |var2      |var4|var5  |var6|var7|var8|var9     |var12   |var11|var13

1  |020-1026      |FILTER|          |    |      |   .|    |    |         |84212300|GA   |Chelsea

2  |020-1120      |BONDED|          |    |      |   .|    |    |         |87089910|GA   |Chelsea

3  |060-134       |EXTEND|          |    |      |   .|    |    |         |73202081|GA   |Chelsea

4  |090-92        |BONDED|          |    |      |   .|    |    |         |87089997|GA   |Chelsea

5  |1-71HEB*170116|BLUE  |          |    |      |   .|    |    |         |84294010|GA   |Chelsea

6  |20-08110      |DECAL |Washington|EA  |Melon |  21|K02 |    |ID:01304 |        |     |

7  |5224          |SKEW  |Oklahoma  |EA  |Banana|  42|R05 |    |ID:047907|        |     |

8  |60425         |LOOK  |Oklahoma  |EA  |Banana|  59|E07 |    |ID:041340|        |     |

9  |68-32104      |ANGLE |Washington|FT  |Melon |  90|S02 |    |ID:01921 |        |     |


New Contributor
Posts: 3

Re: Data is lost during merge....

Good Afternoon,

I hope everyone had a nice holiday if they were in the United States. I apologize for the delay in my response. I had to divert my attention to work the data manually in Excel in order to have it prepared in a timely fashion. I have returned to the files through SAS now, so that I am able to more efficiently work other files in the future.

I took a close look at the files I was working with. The "missing" data was missing from the SAS output because it was nonexistent in the files I was importing from, a problem that arose when I saved the files as an .xls (the '97-'03 version, versus the .xlsx), which limited the length. This problem was resolved when I followed Tom's advice and saved the original files as .csv. The loss of certain characters (such as zeros preceding numbers) was resolved when I pasted the log files respective to each import and modified the 'informat's and 'format's with '$' signs and made the lengths the same across all files.

The final output is much more accurate now to what I want to see. There are a few more glitches that need to be worked out. There are two variables that were created upon import, both being cases where the original variable title is followed by a blank column; all of the information that was meant to go in those columns has been shifted to a newly created variable. One observation is repeated at the beginning of the file 946 times. I suspect that there are other cases, as there are 188,576 rows of information only available in the master dataset, and the import file has only 186,728 Another thing I have noticed is one row starting with " in the first cell, with information that should be shifted to the right to line up with the columns correctly. There are commas in the files for the variable I've been calling VAR3; could this be confusing the reading of the .csv files?

I've learned thusfar that formatting has been my biggest problem, so if anyone has any thoughts I'd be happy to hear them.

Thank you so much for all of your help! It's much appreciated.

Best,

Briana

Super User
Super User
Posts: 7,076

Re: Data is lost during merge....

Another common source of trouble when reading data from Excel files (or CSV files generated from them) is the presence of embedded end-of-line characters in the values inside of cells with the sheet. One solution to try is to use the TERMSTR=CRLF option on the INFILE statement that reads the CSV file. You will still need to watch out where the embedded CR or LF has gotten into the value of your character variable in your SAS dataset, but you can normally clean that by using the COMPRESS() or TRANSLATE() function.

Super User
Posts: 7,854

Re: Data is lost during merge....

You absolutely need to avoid duplicate varnames in a data step w/ merge if you want to preserve the contents. The only vars that should be common across the input data sets are those in the by statement.

So I suggest that you rename those variables that exist in more than one dataset to something that identifies the originating dataset, like

merge

  ....

  dataX (rename=(var3=var3X))

  dataY(rename=(var3=var3Y))

  ...

;

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 7,988

Re: Data is lost during merge....

Just to add, you could also use SQL to explicitly select variables:

proc sql;

     create table WANT as

     select     COALESCE(A.VAR1,B.VAR1,C.VAR1) as VAR1,

                    A.VAR2 as DATASETA_VARIABLE2,

                    A.VAR3 as DATASETA_VARIABLE3,

                    ...

                    B.VAR5 as DATASETB_VARIABLE5

     from        DATA1 A,

                    DATA2 B,

                    ...

     where     A.VAR1=B.VAR1=...

Ask a Question
Discussion stats
  • 9 replies
  • 475 views
  • 6 likes
  • 5 in conversation