DATA Step, Macro, Functions and more

Overwriting of a row in a data set by another row in another data set (different values)

Accepted Solution Solved
Reply
Super Contributor
Posts: 395
Accepted Solution

Overwriting of a row in a data set by another row in another data set (different values)

Assuming that columns with underscore has passed several conditions. I want the rows in the second data set overwrite the one in the first data set. The biggest problem here is that some values don't match because of variable columns (_) in the first and second data set. Another problem is the multiplying rows. I don’t want rows to multiply. The proc sql that i have created below combines the data sets however multiplies the rows. Can you help me about this problem? I have attached the code and the image so it might help you. 

 

data Main_SQL;
length InstanceID $ 20 Default 8 Q0001_ 8 Q0002_ 8 Q0003 8 Q0004_ 8 Q0005 8 Q0006 8;
infile datalines missover dlm=",";
input InstanceID Default Q0001_ Q0002_ Q0003 Q0004_ Q0005 Q0006 ;
datalines;
InstanceID1,0,0.90,0.80,0.00,0.90,0.00,0.70
InstanceID2,0,0.100,0.100,0.00,0.70,0.00,60
InstanceID3,1,0.40,0.80,0.00,0.90,0.00,0.50
InstanceID4,0,0.55,0.80,0.05,0.90,0.00,0.69
InstanceID5,0,0.00,0.80,0.60,0.90,0.20,0.90
InstanceID6,1,0.96,0.00,0.40,0.90,0.00,0.95
InstanceID7,0,0.00,0.80,0.90,0.90,0.00,0.99
InstanceID8,1,0.56,0.80,0.55,0.90,0.00,0.93
InstanceID9,0,0.99,0.80,0.99,0.90,0.00,0.70
InstanceID10,0,0.89,0.88,0.56,0.90,0.00,0.00
;
run;
Proc means data=Main_SQL STACKODS noprint
FW=12
PRINTALLTYPES
CHARTYPE
QMETHOD=OS
VARDEF=DF
	MEAN
	STD
	MODE
	P10
	P90 ;
VAR Q:;
ODS OUTPUT Summary=backTestGroup2;
run;
data backTestGroup2;
  set backTestGroup2;
  UPPER =((P90-MODE)/(P90-P10))*STDDEV+MODE;
  LOWER =((MODE-P10)/(P90-P10))*STDDEV+MODE;
  CENTER=(UPPER+LOWER)/2;
  SLOPE =2.95/(UPPER-CENTER);
  keep Variable Mean Stddev Center Slope;
run;
data ozMeans;
length Variable $ 12 Mean 8 Stddev 8 Center 8 Slope 8;
infile datalines missover dlm=",";
input Variable Mean Stddev Center Slope;
datalines;
Q0002_,0.700,0.12227662376012,0.56113831188006,-5.3700431505101
Q0004,0.535,0.398281028644069,0.199140514322035,14.8136606458166
Q0006_,0.658,0.32227662376012,0.96113831188006,-20.3700431505101
;
PROC SQL;
Create table backTestGroup2 as
Select* from(Select * from ozMeans 
union all
select * from backTestGroup2
where left(trim(Variable)||"_") not in(Select Variable from ozMeans)) AA
order by aa.variable;
QUIT;

Accepted Solutions
Solution
‎10-11-2015 06:09 PM
Contributor hbi
Contributor
Posts: 66

Re: Overwriting of a row in a data set by another row in another data set (different values)

If you use "upcase()", your results should be consistent with the one I provided.

 

change:  TRANSTRN(Variable, "_", "")

    to:  UPCASE(TRANSTRN(Variable, "_", "")) 

 

See below:

 

 

PROC SQL;
 /* CREATE TABLE backTestGroup2 AS */
    CREATE TABLE backTestGroup3 AS 
    SELECT Variable  length=20
         , COUNT(DISTINCT CATX('|', Mean, Stddev, Center, Slope, Mean)) AS num_distinct_values
         , AVG(Mean)   AS Mean_Mean
         , AVG(Stddev) AS Mean_Stddev
         , AVG(Center) AS Mean_Center
         , AVG(Slope)  AS Mean_Slope
         , CASE WHEN calculated num_distinct_values > 1 
                THEN '<== Note: summary statistic is an average (more than one unique row was returned)'  
           ELSE '' END AS remark length=100
    FROM (SELECT DISTINCT UPCASE(TRANSTRN(Variable, "_", "")) AS Variable, Mean, Stddev, Center, Slope
          FROM ozMeans
          UNION
          SELECT DISTINCT UPCASE(TRANSTRN(Variable, "_", "")) AS Variable, Mean, Stddev, Center, Slope
          FROM backTestGroup2
          WHERE UPCASE(TRANSTRN(Variable, "_", "")) 
                   NOT IN (SELECT UPCASE(TRANSTRN(Variable, "_", "")) FROM ozMeans)
         ) As InnerQry
    GROUP BY Variable
    ORDER BY 1;
QUIT;

 

View solution in original post


All Replies
Super User
Posts: 5,431

Re: Overwriting of a row in a data set by another row in another data set (different values)

Well, I gues that you need to make your Q columns names match somehow. Compressing "_" whould be simple, but perhaps not on your real data?

Then I would use data step - modify by to update the master data set, should not generate duplicates.

Data never sleeps
Super User
Super User
Posts: 7,970

Re: Overwriting of a row in a data set by another row in another data set (different values)

Hi,

 

You mean you want to use the values from OZMEANS when they exist or use the other table.  If so the below code will do that, the trick is putting the update first in the COALESCE() order, so it will be taken if it exists.  Do note, consisten formatting, casing, indetation, makes code far more readable.

data main_sql;
  length instanceid $ 20 default 8 q0001_ 8 q0002_ 8 q0003 8 q0004_ 8 q0005 8 q0006 8;
  infile datalines missover dlm=",";
  input instanceid default q0001_ q0002_ q0003 q0004_ q0005 q0006 ;
datalines;
InstanceID1,0,0.90,0.80,0.00,0.90,0.00,0.70
InstanceID2,0,0.100,0.100,0.00,0.70,0.00,60
InstanceID3,1,0.40,0.80,0.00,0.90,0.00,0.50
InstanceID4,0,0.55,0.80,0.05,0.90,0.00,0.69
InstanceID5,0,0.00,0.80,0.60,0.90,0.20,0.90
InstanceID6,1,0.96,0.00,0.40,0.90,0.00,0.95
InstanceID7,0,0.00,0.80,0.90,0.90,0.00,0.99
InstanceID8,1,0.56,0.80,0.55,0.90,0.00,0.93
InstanceID9,0,0.99,0.80,0.99,0.90,0.00,0.70
InstanceID10,0,0.89,0.88,0.56,0.90,0.00,0.00
;
run;
proc means data=main_sql stackods noprint fw=12 printalltypes chartype qmethod=os vardef=df mean std mode p10 p90;
  var q:;
  ods output summary=backtestgroup2;
run;
data backtestgroup2;
  set backtestgroup2;
  upper =((p90-mode)/(p90-p10))*stddev+mode;
  lower =((mode-p10)/(p90-p10))*stddev+mode;
  center=(upper+lower)/2;
  slope =2.95/(upper-center);
  keep variable mean stddev center slope;
run;
data ozmeans;
  length variable $ 12 mean 8 stddev 8 center 8 slope 8;
  infile datalines missover dlm=",";
  input variable mean stddev center slope;
datalines;
Q0002_,0.700,0.12227662376012,0.56113831188006,-5.3700431505101
Q0004,0.535,0.398281028644069,0.199140514322035,14.8136606458166
Q0006_,0.658,0.32227662376012,0.96113831188006,-20.3700431505101
;
run;

proc sql;
  create table RESULT as
  select  COALESCE(A.VARIABLE,B.VARIABLE) as VARIABLE,
          COALESCE(A.MEAN,B.MEAN) as MEAN,
          COALESCE(A.STDDEV,B.STDDEV) as STDDEV,
          COALESCE(A.CENTER,B.CENTER) as CENTER,
          COALESCE(A.SLOPE,B.SLOPE) as SLOPE
  from    OZMEANS A
  full join BACKTESTGROUP2 B
  on      strip(tranwrd(A.VARIABLE,"_",""))=strip(tranwrd(B.VARIABLE,"_",""));
quit;
Super Contributor
Posts: 395

Re: Overwriting of a row in a data set by another row in another data set (different values)

Thank you very much for your support. You have got me right but i couldnt do overwriting on the code you have sent. I think youe code multiplied too. I want it to be 6 rows totally.

 

Thanks a lot Smiley Happy

 

Super User
Super User
Posts: 7,970

Re: Overwriting of a row in a data set by another row in another data set (different values)

Did you run the code I posted, as when I run it I got 6 records.  Post your output if it doesn't work and the code you use.

Super Contributor
Posts: 395

Re: Overwriting of a row in a data set by another row in another data set (different values)

Hi RW9,

 

Am i doing something wrong. The result of table is as follows:

 

Try.png

Contributor hbi
Contributor
Posts: 66

Re: Overwriting of a row in a data set by another row in another data set (different values)

[ Edited ]

Hi there,

 

This should be pretty close to what you are looking for. Note that for "Q0002" and "Q0006", multiple summary statistics are returned, so you will need to use another summary function such as MIN(), MAX() or AVG(). In this example, I used AVG, but you can use MIN or MAX if that better suits your needs. 

 

PROC SQL;
 /* CREATE TABLE backTestGroup2 AS */
    CREATE TABLE backTestGroup3 AS 
    SELECT Variable  length=20
         , COUNT(DISTINCT CATX('|', Mean, Stddev, Center, Slope, Mean)) AS num_distinct_values
         , AVG(Mean)   AS Mean_Mean
         , AVG(Stddev) AS Mean_Stddev
         , AVG(Center) AS Mean_Center
         , AVG(Slope)  AS Mean_Slope
         , CASE WHEN calculated num_distinct_values > 1 
                THEN '<== Note: summary statistic is an average (more than one unique row was returned)'  
           ELSE '' END AS remark length=100
    FROM (SELECT DISTINCT TRANSTRN(Variable, "_", "") AS Variable, Mean, Stddev, Center, Slope
          FROM ozMeans
          UNION
          SELECT DISTINCT TRANSTRN(Variable, "_", "") AS Variable, Mean, Stddev, Center, Slope
          FROM backTestGroup2
          WHERE TRANSTRN(Variable, "_", "") NOT IN (SELECT Variable FROM ozMeans)) As InnerQry
    GROUP BY Variable
    ORDER BY 1;
QUIT;

 

Here is the output: 

(edited: updated image; old output was incorrect)

 

proc_sql_union_mean_of_aggregates_(fixed).gif 

Enjoy. 

 

-- hbi

Super Contributor
Posts: 395

Re: Overwriting of a row in a data set by another row in another data set (different values)

Thank you very much. But i'm not sure we receive the same results. Because when i run your code i got the data set as below.But your codes give some idea to me to write result which i want it. Thanks a lot @hbi Smiley Happy

 

hbi.png

 

Your advices seems to be fine. Can you give some sample data set or share with me a link ? @LinusH

Solution
‎10-11-2015 06:09 PM
Contributor hbi
Contributor
Posts: 66

Re: Overwriting of a row in a data set by another row in another data set (different values)

If you use "upcase()", your results should be consistent with the one I provided.

 

change:  TRANSTRN(Variable, "_", "")

    to:  UPCASE(TRANSTRN(Variable, "_", "")) 

 

See below:

 

 

PROC SQL;
 /* CREATE TABLE backTestGroup2 AS */
    CREATE TABLE backTestGroup3 AS 
    SELECT Variable  length=20
         , COUNT(DISTINCT CATX('|', Mean, Stddev, Center, Slope, Mean)) AS num_distinct_values
         , AVG(Mean)   AS Mean_Mean
         , AVG(Stddev) AS Mean_Stddev
         , AVG(Center) AS Mean_Center
         , AVG(Slope)  AS Mean_Slope
         , CASE WHEN calculated num_distinct_values > 1 
                THEN '<== Note: summary statistic is an average (more than one unique row was returned)'  
           ELSE '' END AS remark length=100
    FROM (SELECT DISTINCT UPCASE(TRANSTRN(Variable, "_", "")) AS Variable, Mean, Stddev, Center, Slope
          FROM ozMeans
          UNION
          SELECT DISTINCT UPCASE(TRANSTRN(Variable, "_", "")) AS Variable, Mean, Stddev, Center, Slope
          FROM backTestGroup2
          WHERE UPCASE(TRANSTRN(Variable, "_", "")) 
                   NOT IN (SELECT UPCASE(TRANSTRN(Variable, "_", "")) FROM ozMeans)
         ) As InnerQry
    GROUP BY Variable
    ORDER BY 1;
QUIT;

 

Super Contributor
Posts: 395

Re: Overwriting of a row in a data set by another row in another data set (different values)

Thanks a lot. It worked.I'll examine whole of the code in detail. Thanks for leaded me.

 

Can.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 324 views
  • 1 like
  • 4 in conversation