BookmarkSubscribeRSS Feed
rajeshm
Quartz | Level 8

 

found few same records are in  "onlyfirst" and  "onlysecond". but it is missing  "both".can somebody explain me how it is possible?

data both onlyfirst onlysecond;
merge xx(in=first) yy(in=second);
by _all_;
if first and second then output both;
else if first and not second then output onlyfirst;
else if second and not first then output onlysecond;
run;

 

11 REPLIES 11
mkeintz
PROC Star

I don't see anything wrong with your logic, or syntax.

 

Show the log please, and perhaps you can provide a sample of the offending records in XX and YY, in a data step. 

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
rajeshm
Quartz | Level 8

 

data both onlyA onlyB
merge datasetA(in=first) datasetB(in=second);
by sumtotal dT SYS PR AGT PID DATE;
if first and second then output both;
else if first and not second then output onlyA 
else if second and not first then output onlyB
run;

WARNING: Multiple lengths were specified for the BY variable sys by input data sets. This might cause
unexpected results.
WARNING: Multiple lengths were specified for the BY variable pr by input data sets. This might
cause unexpected results.
WARNING: Multiple lengths were specified for the BY variable agt by input data sets. This might
cause unexpected results.
WARNING: Multiple lengths were specified for the BY variable pid by input data sets. This might
cause unexpected results.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 423581 observations read from the data set WORK.datasetA.
NOTE: There were 420182 observations read from the data set WORK.datasetB.
NOTE: The data set WORK.BOTH has 376201 observations and 7 variables.
NOTE: The data set WORK.onlyA has 47380 observations and 7 variables.
NOTE: The data set WORK.ONLYP5 has 43981 observations and 7 variables.

------------log--------------------------------------------------------------------------------------

 

merge conflicts confirmation

I exported datasetA and datasetB seperately and then pasted these two in a new different excel(47380+43981 rows),Again imported in sas and used the proc sql to find the unique records and found only 3500+ unique records(means record may be in WORK.onlyA or may be in WORK.onlyB)

 

if any issues with log , how "both" contains that many number of records? I am totally confused can you help me.

 

 

rajeshm
Quartz | Level 8
amount=198.67 date=05JUL2020 sys=1234 prin=0000 agent=1000 promoid=DT25DAY date_prod=05JUL2020
first=1 second=0
amount=198.67 date=05JUL2020 sys=1234 prin=0000 agent=1000 promoid=DT25DAY date_prod=05JUL2020
first=0 second=1

same values in first and second dataset and log showing the same  but ....it killing me. please help me.

first and second are datasets in varibles.

Tom
Super User Tom
Super User

@rajeshm wrote:
amount=198.67 date=05JUL2020 sys=1234 prin=0000 agent=1000 promoid=DT25DAY date_prod=05JUL2020
first=1 second=0
amount=198.67 date=05JUL2020 sys=1234 prin=0000 agent=1000 promoid=DT25DAY date_prod=05JUL2020
first=0 second=1

same values in first and second dataset and log showing the same  but ....it killing me. please help me.

first and second are datasets in varibles.


What are the BY variables here?  All of them?  Let's look for sources of mismatches.

For the variable AMOUNT remember that decimal fractions (0.67) cannot be exactly represented in floating point numbers. Make sure to round them both to the same level of precision before compare.

 

For the DATE_PROD variable, if that is a date variable remember that the DATE9 format (any any date format) will ignore the decimal part of the number of days.  SAS stores datetime as number of seconds but Excel stores datetime values as number of days and percent of 24 hours.  So if you imported datetime values as dates from EXCEL you might have a value like a value like 22,101.5 for noon on 05JUL2020 instead of just 22,101.

If any of the variables are character then look for leading spaces, those normally are not displayed in listings and error messages.  You could use the $QUOTE format to have it print the values with quotes around it so that leading spaces would be more visible.  

 

Also look for invisible characters in character variables, like TAB, CR, LF, non-breaking space, nulls, etc.

Kurt_Bremser
Super User

So you took nicely perfumed data (SAS) and pulled it through a stinking pile of rotten manure (Excel), and now you expect that it still smells nicely?

 

What for are you taking the Excel detour that renders your data unusable?

rajeshm
Quartz | Level 8
I have imported data from 2 dwh and created 2 datasets separately. Now my task is to know common records in 2 dwh and only fird dwh record count and only second dwh record count. So used above logic , but only first dwh is showing 47k and only second dwh showing as 45k and then used put option to know why this much variation and logged the put values and attached here already. For testing, exported both (only first and only second) and merged both as Excel and again imported and with my code, found only 3k unique records .
Q1 log/put values are same but finally it is showing as record in only first dwh and not in second and vice-versa.
Kurt_Bremser
Super User

If you need to compare datasets between data warehouses that do not reside on the same server, the easiest method is to copy the .sas7bdat file from one server to the other. If one (or both) of the DWH's is not running SAS, export to a text file and import that, so you can make sure that the variable attributes match before you start to compare.

Excel files are the WORST option for this, by lightyears.

Reeza
Super User
It means you don't have identical records in both data sets.
ballardw
Super User

@rajeshm wrote:

 

found few same records are in  "onlyfirst" and  "onlysecond". but it is missing  "both".can somebody explain me how it is possible?


The code is examining which data sets each record comes from. The In= option creates a variable that is true when the current observation contains a record from that data set and false otherwise.

So when the current observation only comes from one set in these cases those records only go to the indicated output dataset.

else if first and not second then output onlyfirst;
else if second and not first then output onlysecond;

The above ONLY gets executed when observations do not contain records from both data sets.

AMSAS
SAS Super FREQ

A simple example should help.

Code below creates 2 datasets first and second, then merges them.

Any observation in both first and second will be output to both (A, C & D)
Any observation in first only will be output to first (B)

Any observation in second only will be output to second (E)

 

Note I simplified the logic, you don't need the "and not second" or "and not first", after the first IF condition you would be left with only observations that are in one or the other dataset. 

 

data first ;
	input id $ ;
cards ;
A
B
C
D
;
data second ;
	input id $ ;
cards ;
A
C
D
E
;

data both  first second ;
	merge first (in=first) second (in=second) ;
	by id ;
	if first and second then output both ;
	else if first then output first ;
	else output second ;
run ;
AMSAS
SAS Super FREQ
There's a good lesson here to be learned, and that is to always simplify the problem, reducing the number of observations to something you can manage in your head.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 991 views
  • 0 likes
  • 7 in conversation