BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
alepage
Barite | Level 11

Hello,

we have two servers (development and production) on a Unix platform.  I need to move many datasets hosted on the development server to the production server. Thereafter, we need to carry out some quality control, just to make sure that the transfered datasets are identical.

 

I have looked on SAS community how to do that but the paper are older than 10 years.  So I wonder if someone has worked on this task recently and if he or she tell me how to make an efficient comparison.

Please note that the file transfer is done via winscp.  We don't have SAS connect

 

1 ACCEPTED SOLUTION

Accepted Solutions
Quentin
Super User

If your SAS session cannot see both servers, then I agree PROC COMPARE is not an option.

 

If you google a bit, there are plenty of blog posts etc about using hash functions to calculate checksum on linux, e.g.:

https://www.redhat.com/en/blog/hashing-checksums

 

You could do it from SAS, but not necessary to use SAS.  And there may be easier OS tools for.

View solution in original post

8 REPLIES 8
Patrick
Opal | Level 21

I have looked on SAS community how to do that but the paper are older than 10 years.  So I wonder if someone has worked on this task recently and if he or she tell me how to make an efficient comparison.


A 10 year old paper should be more than good enough and still fully valid.

 

Do you need to compare a SAS file moved from Dev with an already existing SAS file on Prod OR do you need some sort of check to ensure that the transfer of the SAS file from Dev to Prod didn't alter the file?

 

Proc Compare is what you can use for comparing two SAS files.

To ensure that a transfer didn't alter a file you can create a checksum over the file in both the source and target environment. If the checksums are the same then the files are identical. Make sure you transfer the files in binary mode.

In Linux I'd be using commands like sha512sum for creating such checksums.

 

 

alepage
Barite | Level 11

Hello Patrick,

I need to check that the SAS dataset from the development server that are transfered on the production server are identical and that we did not lost any observation or alter the file during transfer.  The files transfer were not hosted on the production server before.

 

Could you please provide an example how to use sha512sum linux command to compare let's say file A on lib1 and file A on lib2.

Please remerber that we are on a Linux platform and that the development server is not the same as the production server so I am not sure that a proc compare will works.

Quentin
Super User

If your SAS session cannot see both servers, then I agree PROC COMPARE is not an option.

 

If you google a bit, there are plenty of blog posts etc about using hash functions to calculate checksum on linux, e.g.:

https://www.redhat.com/en/blog/hashing-checksums

 

You could do it from SAS, but not necessary to use SAS.  And there may be easier OS tools for.

Patrick
Opal | Level 21

@alepage wrote:

Hello Patrick,

I need to check that the SAS dataset from the development server that are transfered on the production server are identical and that we did not lost any observation or alter the file during transfer.  The files transfer were not hosted on the production server before.

 

Could you please provide an example how to use sha512sum linux command to compare let's say file A on lib1 and file A on lib2.

Please remerber that we are on a Linux platform and that the development server is not the same as the production server so I am not sure that a proc compare will works.


@alepage 

Proc Compare would of course require that you transfer the source files first to some location on your target server. 

 

Using a Linux command like md5sum you could just create checksums both in the source and target environment and then compare these checksums. 

 

Creating and collecting checksums of all *.sas7b* files in a folder.

md5sum /mnt/c/temp/*.sas7b*>/mnt/c/temp/checksums_source_env.txt

Creating and collecting checksums of all *.sas7b* files in a folder and all sub-folders.

find /mnt/c/temp/_test -type f -name "*.sas7b*" -exec md5sum {} \; > /mnt/c/temp/checksums_source_env.txt

Running either of above commands will create a text file that's easy to use for comparing checksums between two environments.

Patrick_0-1729722541443.png

 

Please note:

Environmental differences like SAS files created with different encodings (i.e. wlatin1 and utf8) will result in different checksums even though the data is "the same".

This means that you would still need to execute Proc Compare for files with different checksums to determine what the differences actually are.

ChrisNZ
Tourmaline | Level 20
If you want to transfer the files, you can compare the data inside using proc compare. If you want to compare before transferring, or if you want to just use the OS on exactly identical files, you can compare the files' checksum (the SHA-256 hash values).
Ksharp
Super User

You could use PROC COMPARE to compare two datasets from different database.

If there are some XXXX character in output dataset of proc compare ,that means these two datasets is different.

The following code is I wrote for this purpose for the small data,And it maybe is not what you are looking for.

%let old_data=D:\XiaKeShan\LKM-2023-BCG01_study_data_sas_20240117134719\dataset   ;   *前一次数据 的路径;
%let new_data=D:\XiaKeShan\LKM-2023-BCG01_study_data_sas_20240129111311\dataset   ;   *新数据 的路径;

%let out= D:\XiaKeShan\listing编程需求 ;   /*指定 输出Excel路径 . 注意:不需要写 want.xlsx */


%let subjid=subject_code;   *受试者编号的变量名. 注意:所有的数据集中是一样的,且值是唯一的;
%let drop= SITE_NAME VERSION_NAME PI_NAME USERNAME SUBMIT_TIME UPDATE_USER UPDATE_TIME ;   *需要去掉的变量名;





















options validvarname=any validmemname=extend mrecall nofmterr nonumber nodate dkricond=nowarn dkrocond=nowarn;

/*******比较 前一次数据 和 新数据********/
libname old v9 "&old_data." access=readonly;
libname new v9 "&new_data." access=readonly;

%macro type_one(idx=,dsn=,title=);
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn.(drop=&drop.) out=old;by  &subjid.  ;run;
proc sort data=new.&dsn.(drop=&drop.) out=new;by  &subjid.  ;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid.;
run;
data call_define;
 set diff_&dsn.(drop=_TYPE_ _OBS_);
 retain _dummy_ .;
 array n{*} _numeric_;
 array c{*} _character_;
 length vname $ 40;
 do i=1 to dim(n);
   if n{i} not in (. 0) then do;vname=vname(n{i}); if upcase(vname) not in ("%upcase(&subjid.)") then output;end;
 end;
 do i=1 to dim(c);
   if findc(c{i},'X') then do;vname=vname(c{i}); if upcase(vname) not in ("%upcase(&subjid.)") then output;end;
 end;
keep &subjid. vname;
run;
filename t temp;
data _null_;
 set call_define;
 file t;
/* put 'if &subjid.="' &subjid. '" then call define("' vname '","style","style={background=yellow}");';*/
 put "if &subjid.='" &subjid. "' then call define('" vname "','style','style={background=yellow}');";
run;
data &dsn.;
if _n_=1 then do;
 length new_change $ 40;
 if 0 then set old.&dsn.;
 declare hash h(dataset:"old.&dsn.");
 h.definekey("&subjid.");
 h.definedone();

 declare hash h1(dataset:'call_define');
 h1.definekey("&subjid.");
 h1.definedone();
end;
 set new.&dsn.;
 new_change='old' ;
 if h.check() ne 0 then new_change='new' ;
 if h1.check() = 0 then new_change='change' ;
label new_change='数据状态';
run;
/*ods excel file="&out.\&idx..&dsn..xlsx" options(sheet_name="&dsn." autofilter='all' ) ;*/
ods excel options(sheet_name="&title."  Embedded_Titles='yes' autofilter='all' ) ;
title j=l link="#目录!A1"  "(*ESC*)S={textdecoration=underline color=blue }返回目录";
proc report data=&dsn. split='*'  nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
endcomp;
run;
/*ods excel close;*/
%mend;

%macro type_two(idx=,dsn=,key=,title=);
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn.(drop=&drop.) out=old;by  &subjid. &key. ;run;
proc sort data=new.&dsn.(drop=&drop.) out=new;by  &subjid. &key. ;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid. &key. ;
run;
data call_define;
 set diff_&dsn.(drop=_TYPE_ _OBS_);
 retain _dummy_ .;
 array n{*} _numeric_;
 array c{*} _character_;
 length vname $ 40;
 do i=1 to dim(n);
   if n{i} not in (. 0) then do;vname=vname(n{i});if upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)") then output;end;
 end;
 do i=1 to dim(c);
   if findc(c{i},'X') then do;vname=vname(c{i});if upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)") then output;end;
 end;
keep &subjid. &key. vname;
run;
filename t temp;
data _null_;
 set call_define;
 file t;
 put "if &subjid.='" &subjid. "' and &key.='" &key. "' then call define('" vname "','style','style={background=yellow}');";
run;
data &dsn.;
if _n_=1 then do;
 length new_change $ 40;
 if 0 then set new.&dsn.;
 if 0 then set old.&dsn.;
 declare hash h(dataset:"old.&dsn.");
 h.definekey("&subjid.","&key.");
 h.definedone();

 declare hash h1(dataset:'call_define');
 h1.definekey("&subjid.","&key.");
 h1.definedone();
end;
 set new.&dsn.;
 new_change='old' ;
 if h.check() ne 0 then new_change='new' ;
 if h1.check() = 0 then new_change='change' ;
label new_change='数据状态';
run;
/*ods excel file="&out.\&idx..&dsn..xlsx" options(sheet_name="&dsn." autofilter='all' ) ;*/
ods excel options(sheet_name="&title."  Embedded_Titles='yes' autofilter='all' ) ;
title j=l link="#目录!A1"  "(*ESC*)S={textdecoration=underline color=blue }返回目录";
proc report data=&dsn. split='*'  nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
endcomp;
run;
/*ods excel close;*/
%mend;


%macro type_three(idx=,dsn=,key=,key2=,title=);
%local  value lower  upper ;
%if %upcase(%substr(&dsn.,1,2))=LB %then %do;
proc sql noprint;
select name into :value trimmed from dictionary.columns where libname='OLD' and memname="%upcase(&dsn.)" and label='检查结果';
select name into :lower trimmed from dictionary.columns where libname='OLD' and memname="%upcase(&dsn.)" and label='正常值下限';
select name into :upper trimmed from dictionary.columns where libname='OLD' and memname="%upcase(&dsn.)" and label='正常值上限';
quit;
%put &=value.  &=lower. &=upper.;
%end;

proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn.(drop=&drop.) out=old;by  &subjid. &key.  &key2.;run;
proc sort data=new.&dsn.(drop=&drop.) out=new;by  &subjid. &key.  &key2.;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid. &key.  &key2.;
run;
data call_define;
 set diff_&dsn.(drop=_TYPE_ _OBS_);
 retain _dummy_ .;
 array n{*} _numeric_;
 array c{*} _character_;
 length vname $ 40;
 do i=1 to dim(n);
   if n{i} not in (. 0) then do;vname=vname(n{i});if upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)" "%upcase(&key2.)") then output;end;
 end;
 do i=1 to dim(c);
   if findc(c{i},'X') then do;vname=vname(c{i});if upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)" "%upcase(&key2.)") then  output;end;
 end;
keep &subjid. &key. &key2. vname;
run;
filename t temp;
data _null_;
 set call_define;
 file t;
 put "if &subjid.='" &subjid. "' and &key.='" &key. "' and  &key2.='" &key2. "' then call define('" vname "','style','style={background=yellow}');";
run;
data &dsn.;
if _n_=1 then do;
 length new_change $ 40;
 if 0 then set old.&dsn.;
 declare hash h(dataset:"old.&dsn.");
 h.definekey("&subjid.","&key.","&key2.");
 h.definedone();

 declare hash h1(dataset:'call_define');
 h1.definekey("&subjid.","&key.","&key2.");
 h1.definedone();
end;
 set new.&dsn.;
 new_change='old' ;
 if h.check() ne 0 then new_change='new' ;
 if h1.check() = 0 then new_change='change' ;
label new_change='数据状态';


%if %upcase(%substr(&dsn.,1,2))=LB %then %do;
_value=input(&value.,?? best.);
_lower=input(&lower.,?? best.);
_upper=input(&upper.,?? best.);
if not missing(_value) and not missing(_lower) then do; if _value<_lower then &value.=catx(' ',&value.,'-');  end;
if not missing(_value) and not missing(_upper) then do; if _value>_upper then &value.=catx(' ',&value.,'+');  end;
drop _value _lower _upper;
%end;
run;
/*ods excel file="&out.\&idx..&dsn..xlsx" options(sheet_name="&dsn." autofilter='all' ) ;*/
ods excel options(sheet_name="&title."  Embedded_Titles='yes' autofilter='all' ) ;
title j=l link="#目录!A1"  "(*ESC*)S={textdecoration=underline color=blue }返回目录";
proc report data=&dsn. split='*'  nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
endcomp;
run;
/*ods excel close;*/
%mend;


 




/********  
注意: 
1)目录的值和宏参数(title=)的值保持一致 。
2)目录的值不能包括 /\*~ 等非法字符,因为Sheet名无法显示这些字符。
********/
data 目录;
infile cards truncover expandtabs;
input 序号	目录 $80.;
cards;
1	标题
2	签署知情同意书
3	访视日期
4	人口学资料
5	非膀胱癌既往病史-手术史问询
6	非膀胱癌既往病史-手术史
7	过敏史
8	一般问询
9	膀胱癌病史
10	膀胱癌既往治疗史问询
11	膀胱癌既往治疗史
12	体格检查
13	生命体征问询
14	生命体征
15	ECOG评分
16	胸部X线片
17	超声心动图
18	十二导联心电图
19	C-TST试验
20	血妊娠试验
21	血常规
22	尿常规
23	血生化
24	肌酐
25	凝血功能
26	传染病筛查
27	泌尿系统彩超
28	影像学检查
29	入选排除标准问询
30	入选排除标准
31	筛选结果
32	给药记录表(1周1次)
33	PK采血
34	细胞因子血样采集
35	细胞因子尿样采集
36	药物脱落情况尿样采集
37	不良事件问询
38	不良事件
39	既往或合并用药问询
40	既往或合并用药
41	既往或合并非药物治疗问询
;

ods noresults;
ods _all_ close;
ods excel file="&out.\want.xlsx" ;
ods excel options(sheet_name="目录"  Embedded_Titles='yes' ) ;
title "医学监查数据审核表";
proc report data=目录 split='*' nowd;
compute 目录;
 call define(_col_,'url',cats('#''',目录,'''!A1') );
 call define(_col_,'style','style={textdecoration=underline color=blue}');
endcomp;
run;

%type_one(idx=1,dsn=PRIMARY,title=标题)
%type_one(idx=2,dsn=DS,title=签署知情同意书)
%type_two(idx=3,dsn=SV1,key=folder_name,title=访视日期)
%type_one(idx=4,dsn=DM,title=人口学资料)
%type_one(idx=5,dsn=MH1CO ,title=非膀胱癌既往病史-手术史问询)
%type_two(idx=6,dsn=MH1,key=spid,title=非膀胱癌既往病史-手术史)
%type_one(idx=7,dsn=MH2,title=过敏史)
%type_one(idx=8,dsn=SU,title=一般问询)
%type_one(idx=9,dsn=MH3,title=膀胱癌病史)
%type_one(idx=10,dsn=MH4CO,title=膀胱癌既往治疗史问询)
%type_two(idx=11,dsn=MH4,key=spid,title=膀胱癌既往治疗史)
%type_three(idx=12,dsn=PE,key=folder_name,key2=OID,title=体格检查)
%type_two(idx=13,dsn=VSCO,key=folder_name ,title=生命体征问询)
%type_three(idx=14,dsn=VS,key=spid,key2=OID,title=生命体征)
%type_two(idx=15,dsn=QS,key=folder_name,title=ECOG评分)
%type_two(idx=16,dsn=FA1,key=folder_name,title= 胸部X线片)
%type_two(idx=17,dsn=FA2,key=folder_name,title=超声心动图)
%type_two(idx=18,dsn=EG,key=folder_name,title=十二导联心电图)
%type_two(idx=19,dsn=TST,key=folder_name,title=C-TST试验)
%type_three(idx=20,dsn=LBHCG,key=folder_name,key2=OID ,title=血妊娠试验)
%type_three(idx=21,dsn=LBBR,key=folder_name,key2=OID,title=血常规)
%type_three(idx=22,dsn=LBUR,key=folder_name,key2=OID,title=尿常规)
%type_three(idx=23,dsn=LBBB,key=folder_name,key2=OID,title=血生化)
%type_three(idx=24,dsn=LBSCR,key=folder_name,key2=OID,title=肌酐)
%type_three(idx=25,dsn=LBCG,key=folder_name,key2=OID,title=凝血功能)
%type_two(idx=26,dsn=IFD,key=folder_name,title=传染病筛查)
%type_three(idx=27,dsn=FA3,key=folder_name,key2=OID,title=泌尿系统彩超)
%type_two(idx=28,dsn=FA4,key=folder_name,title=影像学检查)
%type_two(idx=29,dsn=IECO,key=folder_name,title=入选排除标准问询)
%type_two(idx=30,dsn=IE,key=spid,title=入选排除标准)
%type_one(idx=31,dsn=DS1 ,title=筛选结果)
%type_two(idx=32,dsn=EX1,key=OID,title=给药记录表(1周1次))
%type_two(idx=33,dsn=PC,key=OID ,title=PK采血)
%type_two(idx=34,dsn=CB,key=OID,title=细胞因子血样采集)
%type_two(idx=35,dsn=CU,key=OID,title=细胞因子尿样采集)
%type_two(idx=36,dsn=MU,key=OID,title=药物脱落情况尿样采集)
%type_one(idx=37,dsn=AECO,title=不良事件问询)
%type_two(idx=38,dsn=AE,key=spid,title=不良事件)
%type_one(idx=39,dsn=CMCO,title=既往或合并用药问询)
%type_two(idx=40,dsn=CM,key=spid,title=既往或合并用药)
%type_one(idx=41,dsn=CM1CO,title=既往或合并非药物治疗问询)

ods excel close;


 

sajjad1122
Fluorite | Level 6
To compare datasets stored in two different libraries, follow these steps:

Import Libraries: Load the necessary libraries for data manipulation and comparison, such as pandas in Python.

Load Datasets: Read the datasets from their respective libraries into dataframes. For example:

python
Copy code
import pandas as pd
df1 = pd.read_csv('path/to/dataset1.csv')
df2 = pd.read_csv('path/to/dataset2.csv')
Identify Common Keys: Determine the columns that can be used as keys to align the datasets, such as IDs or timestamps.

Merge Datasets: Use a merge function to combine the datasets based on the identified keys. For instance:

python
Copy code
merged_df = pd.merge(df1, df2, on='common_key', how='outer')
Compare Data: Analyze the merged dataset to identify differences. You can use functions like isnull() to find missing values or compare specific columns to identify discrepancies.

Visualize Differences: Optionally, use visualization libraries like matplotlib or seaborn to graphically represent the differences between the datasets.

By following these steps, you can effectively compare datasets from two different libraries and gain insights into their similarities and differences.
https://tglinkhub.com/
Patrick
Opal | Level 21

@sajjad1122 I really like Python but I can't see the point to use it for comparing SAS datasets given there is Proc Compare which does the job. And unlike with Python Proc Compare will compare the actual original SAS data sets and not data that first got converted to data frames.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Creating Custom Steps in SAS Studio

Check out this tutorial series to learn how to build your own steps in SAS Studio.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 2561 views
  • 3 likes
  • 6 in conversation