Hello,
we have two servers (development and production) on a Unix platform. I need to move many datasets hosted on the development server to the production server. Thereafter, we need to carry out some quality control, just to make sure that the transfered datasets are identical.
I have looked on SAS community how to do that but the paper are older than 10 years. So I wonder if someone has worked on this task recently and if he or she tell me how to make an efficient comparison.
Please note that the file transfer is done via winscp. We don't have SAS connect
If your SAS session cannot see both servers, then I agree PROC COMPARE is not an option.
If you google a bit, there are plenty of blog posts etc about using hash functions to calculate checksum on linux, e.g.:
https://www.redhat.com/en/blog/hashing-checksums
You could do it from SAS, but not necessary to use SAS. And there may be easier OS tools for.
I have looked on SAS community how to do that but the paper are older than 10 years. So I wonder if someone has worked on this task recently and if he or she tell me how to make an efficient comparison.
A 10 year old paper should be more than good enough and still fully valid.
Do you need to compare a SAS file moved from Dev with an already existing SAS file on Prod OR do you need some sort of check to ensure that the transfer of the SAS file from Dev to Prod didn't alter the file?
Proc Compare is what you can use for comparing two SAS files.
To ensure that a transfer didn't alter a file you can create a checksum over the file in both the source and target environment. If the checksums are the same then the files are identical. Make sure you transfer the files in binary mode.
In Linux I'd be using commands like sha512sum for creating such checksums.
Hello Patrick,
I need to check that the SAS dataset from the development server that are transfered on the production server are identical and that we did not lost any observation or alter the file during transfer. The files transfer were not hosted on the production server before.
Could you please provide an example how to use sha512sum linux command to compare let's say file A on lib1 and file A on lib2.
Please remerber that we are on a Linux platform and that the development server is not the same as the production server so I am not sure that a proc compare will works.
If your SAS session cannot see both servers, then I agree PROC COMPARE is not an option.
If you google a bit, there are plenty of blog posts etc about using hash functions to calculate checksum on linux, e.g.:
https://www.redhat.com/en/blog/hashing-checksums
You could do it from SAS, but not necessary to use SAS. And there may be easier OS tools for.
@alepage wrote:
Hello Patrick,
I need to check that the SAS dataset from the development server that are transfered on the production server are identical and that we did not lost any observation or alter the file during transfer. The files transfer were not hosted on the production server before.
Could you please provide an example how to use sha512sum linux command to compare let's say file A on lib1 and file A on lib2.
Please remerber that we are on a Linux platform and that the development server is not the same as the production server so I am not sure that a proc compare will works.
Proc Compare would of course require that you transfer the source files first to some location on your target server.
Using a Linux command like md5sum you could just create checksums both in the source and target environment and then compare these checksums.
Creating and collecting checksums of all *.sas7b* files in a folder.
md5sum /mnt/c/temp/*.sas7b*>/mnt/c/temp/checksums_source_env.txt
Creating and collecting checksums of all *.sas7b* files in a folder and all sub-folders.
find /mnt/c/temp/_test -type f -name "*.sas7b*" -exec md5sum {} \; > /mnt/c/temp/checksums_source_env.txt
Running either of above commands will create a text file that's easy to use for comparing checksums between two environments.
Please note:
Environmental differences like SAS files created with different encodings (i.e. wlatin1 and utf8) will result in different checksums even though the data is "the same".
This means that you would still need to execute Proc Compare for files with different checksums to determine what the differences actually are.
You could use PROC COMPARE to compare two datasets from different database.
If there are some XXXX character in output dataset of proc compare ,that means these two datasets is different.
The following code is I wrote for this purpose for the small data,And it maybe is not what you are looking for.
%let old_data=D:\XiaKeShan\LKM-2023-BCG01_study_data_sas_20240117134719\dataset ; *前一次数据 的路径;
%let new_data=D:\XiaKeShan\LKM-2023-BCG01_study_data_sas_20240129111311\dataset ; *新数据 的路径;
%let out= D:\XiaKeShan\listing编程需求 ; /*指定 输出Excel路径 . 注意:不需要写 want.xlsx */
%let subjid=subject_code; *受试者编号的变量名. 注意:所有的数据集中是一样的,且值是唯一的;
%let drop= SITE_NAME VERSION_NAME PI_NAME USERNAME SUBMIT_TIME UPDATE_USER UPDATE_TIME ; *需要去掉的变量名;
options validvarname=any validmemname=extend mrecall nofmterr nonumber nodate dkricond=nowarn dkrocond=nowarn;
/*******比较 前一次数据 和 新数据********/
libname old v9 "&old_data." access=readonly;
libname new v9 "&new_data." access=readonly;
%macro type_one(idx=,dsn=,title=);
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn.(drop=&drop.) out=old;by &subjid. ;run;
proc sort data=new.&dsn.(drop=&drop.) out=new;by &subjid. ;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid.;
run;
data call_define;
set diff_&dsn.(drop=_TYPE_ _OBS_);
retain _dummy_ .;
array n{*} _numeric_;
array c{*} _character_;
length vname $ 40;
do i=1 to dim(n);
if n{i} not in (. 0) then do;vname=vname(n{i}); if upcase(vname) not in ("%upcase(&subjid.)") then output;end;
end;
do i=1 to dim(c);
if findc(c{i},'X') then do;vname=vname(c{i}); if upcase(vname) not in ("%upcase(&subjid.)") then output;end;
end;
keep &subjid. vname;
run;
filename t temp;
data _null_;
set call_define;
file t;
/* put 'if &subjid.="' &subjid. '" then call define("' vname '","style","style={background=yellow}");';*/
put "if &subjid.='" &subjid. "' then call define('" vname "','style','style={background=yellow}');";
run;
data &dsn.;
if _n_=1 then do;
length new_change $ 40;
if 0 then set old.&dsn.;
declare hash h(dataset:"old.&dsn.");
h.definekey("&subjid.");
h.definedone();
declare hash h1(dataset:'call_define');
h1.definekey("&subjid.");
h1.definedone();
end;
set new.&dsn.;
new_change='old' ;
if h.check() ne 0 then new_change='new' ;
if h1.check() = 0 then new_change='change' ;
label new_change='数据状态';
run;
/*ods excel file="&out.\&idx..&dsn..xlsx" options(sheet_name="&dsn." autofilter='all' ) ;*/
ods excel options(sheet_name="&title." Embedded_Titles='yes' autofilter='all' ) ;
title j=l link="#目录!A1" "(*ESC*)S={textdecoration=underline color=blue }返回目录";
proc report data=&dsn. split='*' nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
endcomp;
run;
/*ods excel close;*/
%mend;
%macro type_two(idx=,dsn=,key=,title=);
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn.(drop=&drop.) out=old;by &subjid. &key. ;run;
proc sort data=new.&dsn.(drop=&drop.) out=new;by &subjid. &key. ;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid. &key. ;
run;
data call_define;
set diff_&dsn.(drop=_TYPE_ _OBS_);
retain _dummy_ .;
array n{*} _numeric_;
array c{*} _character_;
length vname $ 40;
do i=1 to dim(n);
if n{i} not in (. 0) then do;vname=vname(n{i});if upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)") then output;end;
end;
do i=1 to dim(c);
if findc(c{i},'X') then do;vname=vname(c{i});if upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)") then output;end;
end;
keep &subjid. &key. vname;
run;
filename t temp;
data _null_;
set call_define;
file t;
put "if &subjid.='" &subjid. "' and &key.='" &key. "' then call define('" vname "','style','style={background=yellow}');";
run;
data &dsn.;
if _n_=1 then do;
length new_change $ 40;
if 0 then set new.&dsn.;
if 0 then set old.&dsn.;
declare hash h(dataset:"old.&dsn.");
h.definekey("&subjid.","&key.");
h.definedone();
declare hash h1(dataset:'call_define');
h1.definekey("&subjid.","&key.");
h1.definedone();
end;
set new.&dsn.;
new_change='old' ;
if h.check() ne 0 then new_change='new' ;
if h1.check() = 0 then new_change='change' ;
label new_change='数据状态';
run;
/*ods excel file="&out.\&idx..&dsn..xlsx" options(sheet_name="&dsn." autofilter='all' ) ;*/
ods excel options(sheet_name="&title." Embedded_Titles='yes' autofilter='all' ) ;
title j=l link="#目录!A1" "(*ESC*)S={textdecoration=underline color=blue }返回目录";
proc report data=&dsn. split='*' nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
endcomp;
run;
/*ods excel close;*/
%mend;
%macro type_three(idx=,dsn=,key=,key2=,title=);
%local value lower upper ;
%if %upcase(%substr(&dsn.,1,2))=LB %then %do;
proc sql noprint;
select name into :value trimmed from dictionary.columns where libname='OLD' and memname="%upcase(&dsn.)" and label='检查结果';
select name into :lower trimmed from dictionary.columns where libname='OLD' and memname="%upcase(&dsn.)" and label='正常值下限';
select name into :upper trimmed from dictionary.columns where libname='OLD' and memname="%upcase(&dsn.)" and label='正常值上限';
quit;
%put &=value. &=lower. &=upper.;
%end;
proc datasets library=work kill nolist nodetails;quit;
proc sort data=old.&dsn.(drop=&drop.) out=old;by &subjid. &key. &key2.;run;
proc sort data=new.&dsn.(drop=&drop.) out=new;by &subjid. &key. &key2.;run;
proc compare data=old compare=new noprint out=diff_&dsn.;
id &subjid. &key. &key2.;
run;
data call_define;
set diff_&dsn.(drop=_TYPE_ _OBS_);
retain _dummy_ .;
array n{*} _numeric_;
array c{*} _character_;
length vname $ 40;
do i=1 to dim(n);
if n{i} not in (. 0) then do;vname=vname(n{i});if upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)" "%upcase(&key2.)") then output;end;
end;
do i=1 to dim(c);
if findc(c{i},'X') then do;vname=vname(c{i});if upcase(vname) not in ("%upcase(&subjid.)" "%upcase(&key.)" "%upcase(&key2.)") then output;end;
end;
keep &subjid. &key. &key2. vname;
run;
filename t temp;
data _null_;
set call_define;
file t;
put "if &subjid.='" &subjid. "' and &key.='" &key. "' and &key2.='" &key2. "' then call define('" vname "','style','style={background=yellow}');";
run;
data &dsn.;
if _n_=1 then do;
length new_change $ 40;
if 0 then set old.&dsn.;
declare hash h(dataset:"old.&dsn.");
h.definekey("&subjid.","&key.","&key2.");
h.definedone();
declare hash h1(dataset:'call_define');
h1.definekey("&subjid.","&key.","&key2.");
h1.definedone();
end;
set new.&dsn.;
new_change='old' ;
if h.check() ne 0 then new_change='new' ;
if h1.check() = 0 then new_change='change' ;
label new_change='数据状态';
%if %upcase(%substr(&dsn.,1,2))=LB %then %do;
_value=input(&value.,?? best.);
_lower=input(&lower.,?? best.);
_upper=input(&upper.,?? best.);
if not missing(_value) and not missing(_lower) then do; if _value<_lower then &value.=catx(' ',&value.,'-'); end;
if not missing(_value) and not missing(_upper) then do; if _value>_upper then &value.=catx(' ',&value.,'+'); end;
drop _value _lower _upper;
%end;
run;
/*ods excel file="&out.\&idx..&dsn..xlsx" options(sheet_name="&dsn." autofilter='all' ) ;*/
ods excel options(sheet_name="&title." Embedded_Titles='yes' autofilter='all' ) ;
title j=l link="#目录!A1" "(*ESC*)S={textdecoration=underline color=blue }返回目录";
proc report data=&dsn. split='*' nowd;
column _all_ dummy;
define _all_/display;
define dummy/computed noprint;
compute dummy;
%include t;
endcomp;
run;
/*ods excel close;*/
%mend;
/********
注意:
1)目录的值和宏参数(title=)的值保持一致 。
2)目录的值不能包括 /\*~ 等非法字符,因为Sheet名无法显示这些字符。
********/
data 目录;
infile cards truncover expandtabs;
input 序号 目录 $80.;
cards;
1 标题
2 签署知情同意书
3 访视日期
4 人口学资料
5 非膀胱癌既往病史-手术史问询
6 非膀胱癌既往病史-手术史
7 过敏史
8 一般问询
9 膀胱癌病史
10 膀胱癌既往治疗史问询
11 膀胱癌既往治疗史
12 体格检查
13 生命体征问询
14 生命体征
15 ECOG评分
16 胸部X线片
17 超声心动图
18 十二导联心电图
19 C-TST试验
20 血妊娠试验
21 血常规
22 尿常规
23 血生化
24 肌酐
25 凝血功能
26 传染病筛查
27 泌尿系统彩超
28 影像学检查
29 入选排除标准问询
30 入选排除标准
31 筛选结果
32 给药记录表(1周1次)
33 PK采血
34 细胞因子血样采集
35 细胞因子尿样采集
36 药物脱落情况尿样采集
37 不良事件问询
38 不良事件
39 既往或合并用药问询
40 既往或合并用药
41 既往或合并非药物治疗问询
;
ods noresults;
ods _all_ close;
ods excel file="&out.\want.xlsx" ;
ods excel options(sheet_name="目录" Embedded_Titles='yes' ) ;
title "医学监查数据审核表";
proc report data=目录 split='*' nowd;
compute 目录;
call define(_col_,'url',cats('#''',目录,'''!A1') );
call define(_col_,'style','style={textdecoration=underline color=blue}');
endcomp;
run;
%type_one(idx=1,dsn=PRIMARY,title=标题)
%type_one(idx=2,dsn=DS,title=签署知情同意书)
%type_two(idx=3,dsn=SV1,key=folder_name,title=访视日期)
%type_one(idx=4,dsn=DM,title=人口学资料)
%type_one(idx=5,dsn=MH1CO ,title=非膀胱癌既往病史-手术史问询)
%type_two(idx=6,dsn=MH1,key=spid,title=非膀胱癌既往病史-手术史)
%type_one(idx=7,dsn=MH2,title=过敏史)
%type_one(idx=8,dsn=SU,title=一般问询)
%type_one(idx=9,dsn=MH3,title=膀胱癌病史)
%type_one(idx=10,dsn=MH4CO,title=膀胱癌既往治疗史问询)
%type_two(idx=11,dsn=MH4,key=spid,title=膀胱癌既往治疗史)
%type_three(idx=12,dsn=PE,key=folder_name,key2=OID,title=体格检查)
%type_two(idx=13,dsn=VSCO,key=folder_name ,title=生命体征问询)
%type_three(idx=14,dsn=VS,key=spid,key2=OID,title=生命体征)
%type_two(idx=15,dsn=QS,key=folder_name,title=ECOG评分)
%type_two(idx=16,dsn=FA1,key=folder_name,title= 胸部X线片)
%type_two(idx=17,dsn=FA2,key=folder_name,title=超声心动图)
%type_two(idx=18,dsn=EG,key=folder_name,title=十二导联心电图)
%type_two(idx=19,dsn=TST,key=folder_name,title=C-TST试验)
%type_three(idx=20,dsn=LBHCG,key=folder_name,key2=OID ,title=血妊娠试验)
%type_three(idx=21,dsn=LBBR,key=folder_name,key2=OID,title=血常规)
%type_three(idx=22,dsn=LBUR,key=folder_name,key2=OID,title=尿常规)
%type_three(idx=23,dsn=LBBB,key=folder_name,key2=OID,title=血生化)
%type_three(idx=24,dsn=LBSCR,key=folder_name,key2=OID,title=肌酐)
%type_three(idx=25,dsn=LBCG,key=folder_name,key2=OID,title=凝血功能)
%type_two(idx=26,dsn=IFD,key=folder_name,title=传染病筛查)
%type_three(idx=27,dsn=FA3,key=folder_name,key2=OID,title=泌尿系统彩超)
%type_two(idx=28,dsn=FA4,key=folder_name,title=影像学检查)
%type_two(idx=29,dsn=IECO,key=folder_name,title=入选排除标准问询)
%type_two(idx=30,dsn=IE,key=spid,title=入选排除标准)
%type_one(idx=31,dsn=DS1 ,title=筛选结果)
%type_two(idx=32,dsn=EX1,key=OID,title=给药记录表(1周1次))
%type_two(idx=33,dsn=PC,key=OID ,title=PK采血)
%type_two(idx=34,dsn=CB,key=OID,title=细胞因子血样采集)
%type_two(idx=35,dsn=CU,key=OID,title=细胞因子尿样采集)
%type_two(idx=36,dsn=MU,key=OID,title=药物脱落情况尿样采集)
%type_one(idx=37,dsn=AECO,title=不良事件问询)
%type_two(idx=38,dsn=AE,key=spid,title=不良事件)
%type_one(idx=39,dsn=CMCO,title=既往或合并用药问询)
%type_two(idx=40,dsn=CM,key=spid,title=既往或合并用药)
%type_one(idx=41,dsn=CM1CO,title=既往或合并非药物治疗问询)
ods excel close;
@sajjad1122 I really like Python but I can't see the point to use it for comparing SAS datasets given there is Proc Compare which does the job. And unlike with Python Proc Compare will compare the actual original SAS data sets and not data that first got converted to data frames.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.