About Junyong

Junyong · ‎09-10-2019

Thanks, but I'm not picking the observations above the 10th percentile. Instead, I'm picking the observations that consist 90% of the sum from the biggest one. Here's what you recommended. data _1; input size @@; i=1; cards; 1 2 3 4 5 6 7 8 9 10 395 ; run; proc summary nway; class i; var size; output out=_2(drop=_:) p10=/autoname; run; data _1; merge _1 _2; by i; if size>=size_p10; run; There are 11 observations. PROC SUMMARY in your code picks "2" as the 10th percentile. The code then removes the first observation as SIZE is lower than the 10th percentile. The resulting subset includes 10 observations from 395 to 2. This is different from what I'm asking. In this case, the sum of all SIZEs is 450, and the 90% of this sum is 405. That is, I can include 90% of the SIZEs in my subset by just including only the first two observations (395 and 10) there—the remainder observations are negligible as they are only the lower 10% of the sample. The resulting subset must include the 2 observations—395 and 10.

Junyong · ‎09-10-2019

I'm a bit unsure if the title describes the problem correctly, but I have the following data set. data have; input year firm size indep depen; cards; 1 1 4907.98 514.14 515.08 1 2 5045.94 509.85 509.03 1 3 4819.58 505.49 506.34 1 4 4967.02 482.39 482.02 1 5 5094.18 506.08 505.62 1 6 5040.42 501.66 499.79 1 7 4987.62 491.48 489.79 2 1 5086.52 512.95 510.39 2 2 4895.88 503.15 503.28 2 3 5002.13 494.76 495.36 3 1 5078.44 506.06 508.51 3 2 4980.23 520.75 521.21 3 3 4948.08 484.90 485.79 3 4 5050.82 497.25 497.74 3 5 4964.70 498.54 497.97 4 1 4952.31 517.95 517.92 4 2 5027.91 501.90 502.15 4 3 4979.13 512.67 512.07 4 4 5005.56 499.99 501.16 4 5 4905.29 517.62 517.56 4 6 4848.28 479.76 481.32 5 1 1969.43 501.68 502.21 5 2 1931.39 497.89 497.67 5 3 9935.65 505.15 504.44 5 4 9995.49 510.52 511.26 5 5 9008.28 525.02 524.49 6 1 9981.95 518.30 517.78 6 2 15.49 506.13 506.81 7 1 5095.48 517.15 516.85 ; run; Where YEAR, FIRM, SIZE, INDEP, and DEPEN are the time index, the individual index, the market capitalization, the independent variable, and the dependent variable, correspondingly. Instead of using all the observations to do the regression, I need to pick only the first top 90% in terms of SIZE each year, so I'm currently making the subset as follows. proc sort; by year descending size; run; data have; set have; by year; if first.year then accumulate=.; accumulate+size; run; proc sql; create table have as select *, 0.9*max(accumulate) as hurdle, accumulate<=calculated hurdle as pick from have group by year order by year,size desc; quit; proc reg; where pick=1; model depen=indep; run; In other words, I (1) downward sort by SIZE each year, (2) ACCUMULATE by SIZE each year, (3) set the 90% of the sum as HURDLE each year, (4) PICK if ACCUMULATE up to the observation each year doesn't exceed HURDLE, and (5) regress DEPEN on INDEP using the observations where PICK=1. Here's the outcome. 1. Is there any other approach simpler and more experienced than this? I can understand these processes, but still wonder whether SAS provides some matching PROCs for the subsets. 2. I want to also include the observation that "touches" the HURDLE as (a) this method picks no observation in Year 6 since the first observation already exceeds the HURDLE, and (b) picks no observation in Year 7 since there's only one observation. According to the result, I want to pick (i) the last observations in Years 1, 2, 3, and 4 as they touch the HURDLEs, (ii) the second last observation in Year 5 due to the same reason, (iii) the first observation in Year 6 that alone exceeds the HURDLE, and (iv) the only one observation in Year 7. I much appreciate any comment from your experience.

Junyong · ‎09-07-2019

The following code creates a variable m5 with an observation of five Ms. data m5; m5="MMMMM"; run; SAS formats the variable as $5. since it contains five characters, but I cannot see the last M when I open the data set. How should I determine the format to properly include all the characters in the screen? In this case, by the way, it seems $6. displays the last M correctly. Thanks.

Junyong · ‎09-06-2019

I have the following data and want to create macro variables using each observation. data have; input variable $ abbreviation $; cards; apple AP orange OR ; run; The _NULL_ and SYMPUTX combination can pass each observation without leading and trailing blanks as follows. data _null_; set have(obs=1); call symputx("variable",variable); call symputx("abbreviation",abbreviation); run; %put &variable.&abbreviation.; And the output is 1 %put &variable.&abbreviation.; appleAP I tried something similar in SQL with STRIP, but the blanks were there. proc sql noprint; select strip(variable),strip(abbreviation) into :variable,:abbreviation from have(firstobs=1); quit; %put &variable.&abbreviation.; but 1 %put &variable.&abbreviation.; apple AP It seems SQL respects the length of each variable and locates the values correspondingly. I wonder whether there's something similar to SYMPUTX in SQL as the length of VARIABLE varies observation by observation—for instance, I cannot add LENGTH=5 after STRIP(VARIABLE). Many thanks.

Junyong · ‎08-21-2019

Much appreciate. This was what I thought. %macro repeat; %do pagenumber=1 %to 5; filename tempfile temp; proc http method="get" out=tempfile url="https://www.walmart.com/search/?cat_id=0%str(&)page=&pagenumber.%str(&)ps=40%str(&)query=chobani#searchProductResult"; run; data output&pagenumber.; infile tempfile length=length lrecl=32767; input line $varying32767. length; run; %end; %mend; %repeat; Thanks.

Junyong · ‎08-21-2019

Thanks. I used something similar with %LET and %NRSTR, but wondered whether one can shorten this by 'https://www.walmart.com/search/?cat_id=0&page='||"&pagenumber"||'&ps=40&query=chobani#searchProductResult' or something like \&.

Junyong · ‎08-21-2019

When there are multiple ampersands in one double-quoted string, can one resolve only some of them leaving others unresolved? For example, %macro repeat; %do pagenumber=1 %to 5; filename tempfile temp; proc http method="get" out=tempfile url="https://www.walmart.com/search/?cat_id=0&page=&pagenumber.&ps=40&query=chobani#searchProductResult"; run; data output&pagenumber.; infile tempfile length=length lrecl=32767; input line $varying32767. length; run; %end; %mend; %repeat; Since PROC HTTP requires a string for URL, I put a double-quoted string, which resolves all ampersands. If there is a macro variable such as QUERY, then SAS resolves it as the string has &QUERY. I need to only resolve PAGENUMBER since iterate, but don't want to resolve anything else. How can I distinguish those different ampersands? I tried to combine a single-quoted string and a double-quoted one by ||, but didn't work well. Thanks.

Junyong · ‎08-07-2019

Thanks for the considerate details. I didn't know whether WHERE uses SQL to subset.

Junyong · ‎08-07-2019

Is it impossible to use SUM OF in WHERE? For example, suppose six dummy variables as follows. data have; do i=1 to 30; a1=ranbin(1,1,0.5); a2=ranbin(1,1,0.5); a3=ranbin(1,1,0.5); b1=ranbin(1,1,0.5); b2=ranbin(1,1,0.5); b3=ranbin(1,1,0.5); output; end; run; The following successfully subsets by IF with SUM OF. data usual; set have; if sum(of a: b:)=3; run; However, it seems WHERE does not allow SUM OF, while allows SUM per se. data want; *set have(where=(sum(of a: b:)=3)); set have; *where sum(of a: b:)=3; *where sum(of a1--b3)=3; where sum(a1,a2,a3,b1,b2,b3)=3; run; Should I always list all the variables to use WHERE? There are too many variables to be specified. Thanks.

Junyong · ‎07-03-2019

I misunderstood the function of BW and BWM so far. Thanks for the notice.

Junyong · ‎07-03-2019

The following HAVE contains X and Y, which are normally distributed. data have; do i=1 to 5000; x=rannor(1); y=rannor(1); output; end; run; KDE estimates the kernel densities. ods listing gpath="!userprofile\desktop\"; ods graphics on; proc kde; univar x(bwm=0.05) y(bwm=0.05)/plots=(density densityoverlay); run; ods graphics off; The code spits out the following three plots—(1) the kernel density of X, (2) that of Y, and (3) the overlapped one. The problem is the third one, which poorly overlaps the first two. Though it works well without the BWMs above, I practically need to use them. What is the problem here? Thanks a lot.

Junyong · ‎06-28-2019

Thanks for the helpful post, but I have one additional question about that. I just tried the following. proc iml; i=1; submit i; %put NOTE: &i; endsubmit; quit; Sadly, the %PUT inside spits out nothing due to the colon : right after the NOTE. I want to use the NOTE: if possible as it highlights anything in a log. Do you have any suggestion in this respect?

Junyong · ‎06-28-2019

I oftentimes monitor MACRO iterations as follows. %macro repeat; %do i=1 %to 100; %if %sysfunc(mod(&i.,10))=0 %then %put NOTE: &i.th iteration now; %end; %mend; %repeat; I can do the same thing in DATA as follows. data _null_; do i=1 to 100; if mod(i,10)=0 then put i; end; run; I use this code as something similar to a loading bar—the code notifies at the 10th, 20th iterations, etc. I tried to do the identical thing in IML but failed because PUT in IML works differently. proc iml; do i=1 to 100; if mod(i,10)=0 then put i; end; quit; The log is here. 1 proc iml; NOTE: IML Ready 2 do i=1 to 100; 3 if mod(i,10)=0 then put i; 4 end; ERROR: No current file to write to. statement : PUT at line 3 column 21 5 quit; NOTE: Exiting IML. NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE IML used (Total process time): real time 0.01 seconds cpu time 0.01 seconds Second, I tried CALL SYMPUT and %PUT instead, but still failed. proc iml; do i=1 to 100; if mod(i,10)=0 then do; call symputx("i",i); %put &i.; end; end; quit; The corresponding log is here. 1 proc iml; NOTE: IML Ready 2 do i=1 to 100; 3 if mod(i,10)=0 then do; 4 call symputx("i",i); 5 %put &i.; 100 6 end; 7 end; 8 quit; NOTE: Exiting IML. NOTE: PROCEDURE IML used (Total process time): real time 0.01 seconds cpu time 0.01 seconds Is there any considerable alternative in IML? Much appreciate again.

Junyong · ‎06-25-2019

Can I insert a new variable in a specific position? For example, the following working data set is ordered already and contains some numbers. data _(drop=i); array x(*) us uk france germany poland spain italy ireland sweden denmark norway finland china japan korea iran iraq turkey india brazil singapore mexico; do i=1 to dim(x); x(i)=rannor(1); end; run; If I use the following code, then the new variable greece will be located at the end of the data set. data _; set _; greece=rannor(1); run; I wonder whether the position can be changed—for example, in between the existing variables italy and ireland. Thanks in advance.

Junyong · ‎06-25-2019

Rick, I understand the difference between the sampling distribution and the data distribution, but sometimes one simulates the sampling distribution and has the data that contain the simulated sampling distribution. For example, one can check whether the sampling distribution of the sample variance from the normal data is the chi-squared distribution—https://newonlinecourses.science.psu.edu/stat414/node/174/. According to the document, the sampling distribution of the sample variance with a proper scale is the chi-squared distribution. In this case, we can simulate the sample variance 5,000 times for example. data sample; do s=1 to 5000; do i=1 to 100; x=rannor(1); output; end; end; run; proc means noprint; var x; by s; output out=stat var=s2; run; data stat; set stat; s2_=99*s2/1; run; In the code, x is the normal variable (100 observations for each of 5,000 simulations), s2 is the sample variance, and s2_ is the scaled one. To see whether s2_ is the chi-squared distribution, one can use UNIVARIATE as it has instead the gamma distribution more generally—as you mentioned in your post. The chi-squared distribution with 99 is the gamma distribution with 0, 2, and 99/2, respectively. proc univariate noprint; var s2_; histogram/gamma(theta=0,sigma=2,alpha=49.5); run; In this working code, as expected, all the Kolmogorov–Smirnov, Cramer–von Mises, and Anderson–Darling tests do not reject the null. If GAMMA is unavailable, then what I found instead is an indirect two-sample test using NPAR1WAY—generate some random numbers from the target distribution, and test if the above simulated distribution and the generated distribution just before are reasonably close. data simulate; _type_=1; do s=1 to 5000; s2_=rand("chisq",99); output; end; run; proc append base=stat; run; Then, the stat data set contains 5,000 simulated sample variances as _type_=0 and 5,000 random numbers from the chi-squared distribution as _type_=1. proc npar1way edf; var s2_; class _type_; run; As expected again, both Kolmogorov–Smirnov and Kuiper tests do not reject the null hypothesis. As aforementioned, I hope that someday UNIVARIATE embraces other distributions as well since often the data are from those distributions such as the chi-squared or F distributions.

Online Status	Offline
Date Last Visited	‎02-18-2025 03:14 PM

How to Capture Part of Log as Macro Variable?

How to Escape Line Break in Long Code Line?

Re: How to Prevent Resolution of Ampersand?

How to Prevent Resolution of Ampersand?

How to Italicize Just One Word in FOOTNOTE?

Re: Applying Arrow Tips to SGPLOT Lines and Axes

Applying Arrow Tips to SGPLOT Lines and Axes

Displaying Values for Histograms

SGPLOT VBAR XAXIS Label Interval?

Reading Tab-Delimited Data with Spaces

Re: In VIEWTABLE, How Can I Directly Go to Certain Observation?

In VIEWTABLE, How Can I Directly Go to Certain Observation?

DO Loop and INFILE FILEVAR Together

Re: How to Download a Folder from SAS OnDemand?

Re: Skipping Invalid Lines

Re: How to Subset with Only First Top 90% by Each Group?

How to Subset with Only First Top 90% by Each Group?

Length and Format of Character Variable

SYMPUTX in SQL—How to Delete the Trailing Blanks

Re: Resolving Ampersands Partially?

Re: Resolving Ampersands Partially?

Resolving Ampersands Partially?

Re: SUM OF in WHERE

SUM OF in WHERE

Re: DENSITYOVERLAY in KDE Works Incorrectly

DENSITYOVERLAY in KDE Works Incorrectly

Re: How to Monitor IML Iterations Using Log?

How to Monitor IML Iterations Using Log?

Pinpointing the Location of New Variable?

Re: Kolmogorov–Smirnov Tests for Various Distributions?