say I have a vector of observations which I wanna test against some distribution, say Poisson. I see there is a chisq.test(x, p) function, but it always uses df = k -1, where k is number of bins. Obviously I don't know the distribution parameters and estimate them using data, so losing df in the process. There isn't a df argument in the function. Hence the problem in using chisq.test is not able to properly tell it how many estimates are there in the p, so the chi-square uses wrong df. data = c(0, 0, 0, 1, 0, 1, 2, 2); lambda = mean(data); #0.75 bins = c(0, 1, 2, 3); #bins for grouping data x = c(4, 2, 2, 0); #number of observations for bins p = dpois(bins, lambda); chisq.test(x, p=p, rescale.p=TRUE); #the df should be number of bins - 1 - number of estimates, so 2, but R always uses df = k-1?
Code: #purpose: to test if some observed data can be said to come from a common distribution #signature: chisq_test_distribution(observed, distribution, bins) #arguments: observed is the observed data in vector. distribution is the octave distribution function name in string(poiss for Poisson, norm for normal etc). bins is a vector for grouping observations, use cell array {} when a bin is a range of values. #descriptions: the test is based on "sum over bins i (Observed i - Expected i).^2 / Expected i ~ chisquare(bins number-1-df lost through estimation) " #assumptions: the observed data are i.i.d. #output: chi-square, p-value, degree of freedom in octave structure format (equivalent to dictionary) #reference: Sheldon Ross's Introduction to Statistics for Engineers chapter 11 function output= chisq_test_distribution(observed, distribution, bins) #only below distributions are accepted for now, will add gradually distribution_supported = {"norm", "poiss"}; #checking arguments entered correctly and converting formats assert(ismatrix(observed) && rows(observed) == 1, "observed data should be a 1xn matrix"); assert(ismatrix(bins) && rows(bins) == 1, "bins should be a 1xn matrix or cell array"); assert(ischar(distribution), "distribution name should be entered as string"); assert(any(strcmpi(distribution, distribution_supported)),"distribution entered isn't recognized or not yet supported"); if !iscell(bins) #if bins is a vector, convert it to a cell format bins = num2cell(bins); #it could be a vector, just this function is coded to loop a cell endif #set some variables based on distribution entered, for use in calling its probability density function switch (distribution) case "norm" pdf = "normpdf"; mu = mean(observed); var = var(observed, 0); #compute variance with (n-1) in denominator pdf_args = [mu, var]; case "poiss" pdf = "poisspdf"; lambda = mean(observed); pdf_args = [lambda]; otherwise disp("something is fucked up"); return; endswitch #variables declaration k = length(bins); #number of bin groups pdf_args_number = length(pdf_args); #number of parameter estimated df = k - 1 - pdf_args_number; #chi-square degree of freedom n = length(observed); #number of observations probability = []; observations = []; #looping through bins to count realised observations in each and to calculate probability under null_hypothesis in each for i=1:k inner_bins = bins{1, i}; #get this bin entry inner_k = length(inner_bins); #number of values in this bin, more than 1 if a range inner_probability = []; inner_observations = []; for j=1:inner_k switch (pdf_args_number) #pdf function call differed by number of arguments to pass case 1 inner_probability(1,j) = feval(pdf, inner_bins(1, j), pdf_args(1)); case 2 inner_probability(1,j) = feval(pdf, inner_bins(1, j), pdf_args(1), pdf_args(2)); case 3 inner_probability(1,j) = feval(pdf, inner_bins(1, j), pdf_args(1), pdf_args(2), pdf_args(3)); case 4 inner_probability(1,j) = feval(pdf, inner_bins(1, j), pdf_args(1), pdf_args(2), pdf_args(3), pdf_args(4)); otherwise disp("Not aware of any distribution having more than 4 parameters."); return; endswitch inner_observations(1, j) = histc(observed, inner_bins(1, j)); #count how many observations fall into this bin endfor probability(1, i) = sum(inner_probability, 2); #assign over the sum of all inner probabilities observations(1, i) = sum(inner_observations, 2); #assign over the sum of all inner observation counts endfor #calculate the test statistics and return results expected = n .* probability; #expected observations if this distribution hypothesis is true chisq = sum((observations - expected).^2 ./ expected, 2); #Pearson chi-square statistics pvalue = 1 - chi2cdf(chisq, df); output = struct("actual", observations, "expected", expected, "chisq", chisq, "pvalue", pvalue, "df", df); return; endfunction