;+
; NAME:
;    gauss_fit
;
; PURPOSE:
;    This function estimates the parameters of the Gaussian distribution given 
;    a set of data.
;
; CATEGORY:
;    Statistics
;
; CALLING SEQUENCE:
;    result = gauss_fit( data_vec )
;
; INPUTS:
;    DATA_VEC:  A required float vector of length N_DATA containing the data to 
;        which to fit the Gaussian distribution..
;    ANALYTIC, BOOTSTRAP_SEED, CHARSIZE, CI_COLOR, COLOR, COVARIATE_DATA, 
;      COVARIATE_PARAM_ID, N_BOOTSTRAP, FONT, N_BOOTSTRAP, P_VALUE, 
;      PLOT_PROBABILITY, PLOT_QUANTILE, THICK, TITLE, XTHICK, XTITLE, YTHICK, 
;      YTITLE
;
; KEYWORD PARAMETERS:
;    ANALYTIC:  If set, then the parameter values are calculated analytically 
;        as the mean and standard deviation of the input data.  This is not 
;        possible if covariates are included.  The default is to solve use the 
;        maximum likelihood method.
;    BOOTSTRAP_CDF:  If CDF is set and N_BOOTSTRAP is input, then this returns 
;        the cumulative density function values for the bootstrap Gauss 
;        distributions at the DATA_VEC locations.
;    BOOTSTRAP_SEED:  An optional scalar integer containing the seed for the 
;        random number generator used for the bootstrap sampling.  This is 
;        useful for reproducibility.
;    CHARSIZE:  The optional CHARSIZE keyword parameter for the plot function.
;    CDF:  If set, then this returns a N_DATA float vector containing the 
;        cumulative density function values of the Gaussian distribution at the 
;        DATA_VEC locations.
;    CI_COLOR:  An optional scalar integer specifying the color index for 
;        plotting confidence intervals.
;    COLOR:  The optional COLOR keyword parameter for the plot function.
;    COVARIATE_DATA:  An optional float array containing values of covariate 
;        functions at the location of each of the N_DATA input values.  Of size 
;        N_DATA,N_COVARIATE, where N_COVARIATE is the number of covariates.  
;        If input then COVARIATE_PARAM_ID must also be input.
;    COVARIATE_PARAM_ID:  An optional integer array if length N_COVARIATE 
;        specifying the Gauss parameter incorporating each of the N_COVARIATE 
;        covariates in COVARIATE_DATA.  0 specifies the location parameter 
;        (mean) and 1 the scale parameter (standard deviation).  Required if 
;        COVARIATE_DATA is input.
;    FONT:  The optional FONT keyword parameter for the plot function.
;    N_BOOTSTRAP:  An optional scalar integer defining the number of bootstrap 
;        samples to use in estimating confidence intervals on the parameters 
;        using a bootstrap approach.
;    P_VALUE:  An optional scalar float containing the p-value for any 
;        confidence interval estimates.  The default is 0.10.
;    PLOT_PROBABILITY:  If set then the function plots a probability plot.
;    PLOT_QUANTILE:  If set then the function plots a quantile-quantile plot.  
;        This is not possible if there are covariates defined.
;    PARAMS_CI:  Returns a 2,(2+N_COVARIATE) float array containing the 
;        estimated 1-P_VALUE confidence intervals on the Gaussian model 
;        parameters.  The first dimension returns the lower and upper bounds 
;        respectively, while the order of parameters in the second dimension 
;        corresponds to the order in RESULT.  If N_BOOTSTRAP is input, these confidence 
;        intervals are estimated using a bootstrap approach;  no other method 
;        is implemented yet.
;    THICK:  The optional THICK keyword parameter for the plot function.
;    [X,Y]THICK:  The optional XTHICK and YTHICK keyword parameters for the 
;        plot function.
;    TITLE:  The optional TITLE keyword parameter for the plot function.
;    [X,Y]TITLE:  The optional XTITLE and YTITLE keyword parameters for the 
;        plot function.
;
; OUTPUTS:
;    RESULT:  A float vector containing the estimated values of the Gaussian model 
;        parameters.  The first element contains the value for the location 
;        parameter (mean), the second element contains the value for the scale 
;        parameter (standard deviation), and any further elements contain the 
;        values for the regression parameters on the N_COVARIATE covariate 
;        functions.
;    BOOTSTRAP_CDF, CDF, PARAMS_CI
;
; USES:
;    quantile_threshold.pro
;    shuffle.pro
;    ;gauss_fit_eqn.pro (included in this file)
;
; PROCEDURE:
;    This function estimates the parameters of the Gaussian model using 
;    analytic or maximum log likelihood methods.  If covariates are included, 
;    the total log likelihood function is the sum of the likelihood function at 
;    each of the specific N_DATA locations.  The log likelihood function is 
;    contained within the gauss_fit_eqn.pro subfunction.
;
; EXAMPLE:
;    ; Generate 1000 data values following a Gaussian distribution with mean 1.5 modified by a linear trend of slope 0.4, and standard deviation 2.0.
;    seed = 1
;    n_data = 1000l
;    covariate_data = findgen( n_data ) / ( n_data - 1. ) * 2. - 1.
;    data_vec = 1.5 + 2. * randomn( seed, n_data ) + 0.4 * covariate_data
;    ; Calculate the parameter values, including uncertainties, and plot the probability plot. 
;    result = gauss_fit( data_vec, params_ci=params_ci, n_bootstrap=1000, covariate_data=covariate_data, covariate_param_id=0, plot_probability=1 )
;    for i = 0, 2 do print, result[i], params_ci[*,i]
;
; MODIFICATION HISTORY:
;    Written by:  Daithi A. Stone (dastone@runbox.com), 2020-01-17
;-

;***********************************************************************

FUNCTION GAUSS_FIT, $
    DATA_VEC, $
    BOOTSTRAP_SEED=bootstrap_seed, $
    CHARSIZE=charsize, FONT=font, $
    COLOR=color, CI_COLOR=ci_color, $
    COVARIATE_DATA=covariate_data, COVARIATE_PARAM_ID=covariate_param_id, $
    N_BOOTSTRAP=n_bootstrap, $
    P_VALUE=p_value, $
    THICK=thick, XTHICK=xthick, YTHICK=ythick, $
    TITLE=title, XTITLE=xtitle, YTITLE=ytitle, $
    ANALYTIC=analytic_opt, $
    PLOT_PROBABILITY=plot_probability_opt, PLOT_QUANTILE=plot_quantile_opt, $
    BOOTSTRAP_CDF=bootstrap_cdf, $
    CDF=cdf, $
    PARAMS_CI=params_ci

;***********************************************************************
; Constants and options

; Create common block for sharing with gauss_fit_eqn.pro
common common_gauss_fit_eqn, gauss_data, gauss_covariate_data, $
    gauss_covariate_param_id

; The tolerance value for the amoeba function
amoeba_ftol = 1.0e-5

; The default p-value
if not( keyword_set( p_value ) ) then p_value = 0.1

; The number of input data
n_data = n_elements( data_vec )
if n_data le 3 then stop

; Check covariate inputs
n_covariate = n_elements( covariate_param_id )
if n_covariate gt 0 then begin
  ; Confirm everything is there
  if not( keyword_set( covariate_data ) ) then stop
  if n_elements( covariate_data[*,0] ) ne n_data then stop
  if n_elements( covariate_data[0,*] ) ne n_covariate then stop
  ; Calculate covariate ranges (for use in setting up amoeba solver)
  covariate_range = max( covariate_data, dimension=1 ) $
      - min( covariate_data, dimension=1 )
  ; Confirm covariates vary
  if min( covariate_range ) eq 0 then stop
endif

; The option to solve analytically
analytic_opt = keyword_set( analytic_opt )
if ( analytic_opt eq 1 ) and ( n_covariate gt 0 ) then stop

; Copy data required by minimisation function to common block
; (Fitting is unreliable without double precision)
gauss_data = double( data_vec )
if n_covariate gt 0 then begin
  gauss_covariate_data = double( covariate_data )
  gauss_covariate_param_id = covariate_param_id
; Otherwise clear unneeded variables
endif else begin
  if n_elements( gauss_covariate_data ) gt 0 then begin
    temp = temporary( gauss_covariate_data )
  endif
  if n_elements( gauss_covariate_param_id ) gt 0 then begin
    temp = temporary( gauss_covariate_param_id )
  endif
endelse

; Ensure the CDF is calculated if plotting is requested
if keyword_set( plot_probability_opt ) or keyword_set( plot_quantile_opt ) $
    then begin
  cdf = 1
endif

;***********************************************************************
; Estimate parameters

; The analytic solution
if analytic_opt eq 1 then begin
  ; Initialise output
  params = dblarr( 2 )
  ; Calculate the parameters (mean and standard deviation)
  params[0] = mean( gauss_data )
  params[1] = stddev( gauss_data )
endif

; The maximum likelihood solution
if analytic_opt eq 0 then begin
  ; Make a first guess of parameters (assuming no covariates)
  params_0 = dblarr( 2 )
  params_0[0] = mean( gauss_data )
  params_0[1] = stddev( gauss_data )
  ; Define scale for first iteration of solver's search
  params_scale = [ 1., 0.5 * params_0[1] ]
  ; Add any covariates to the first guess
  if n_covariate gt 0 then begin
    ; Assume zero scaling on covariates
    params_0 = [ params_0, fltarr( n_covariate ) ]
    ; Define scale for first iteration of solver's search
    params_scale = [ params_scale, fltarr( n_covariate ) ]
    for i_covariate = 0, n_covariate - 1 do begin
      params_scale[2+i_covariate] = params_0[covariate_param_id] * 0.5 $
          / covariate_range[i_covariate]
    endfor
  endif
  ; Fit parameters
  params = amoeba( amoeba_ftol, function_name='gauss_fit_eqn', p0=params_0, $
      scale=params_scale )
endif

; Estimate confidence intervals using bootstrap
if keyword_set( n_bootstrap ) then begin
  ; Initialise array containing bootstrap parameter estimates
  boot_params = fltarr( 2 + n_covariate, n_bootstrap )
  ; Iterate through bootstrap samples
  for i_boot = 0, n_bootstrap - 1 do begin
    ; Select random sample of data
    gauss_data = shuffle( double( data_vec ), seed=bootstrap_seed, replace=1, $
        index=boot_index )
    if n_covariate gt 0 then begin
      gauss_covariate_data = double( covariate_data[boot_index,*] )
    endif
    ; Fit parameters analytically
    if analytic_opt eq 1 then begin
      boot_params[0,i_boot] = mean( gauss_data )
      boot_params[1,i_boot] = stddev( gauss_data )
    ; Fit parameters using maximum likelihood
    endif else begin
      boot_params[*,i_boot] = amoeba( 1.0e-5, function_name='gauss_fit_eqn', $
          p0=params_0, scale=params_scale )
    endelse
  endfor
  ; Determine confidence interval
  params_ci = fltarr( 2, 2 + n_covariate )
  for i_param = 0, 2 + n_covariate - 1 do begin
    temp = quantile_threshold( reform( boot_params[i_param,*] ), $
        [ p_value / 2., 1. - p_value / 2. ] )
    params_ci[1,i_param] = params[i_param] + ( params[i_param] - temp[0] )
    params_ci[0,i_param] = params[i_param] - ( temp[1] - params[i_param] )
  endfor
endif

; Calculate the CDF for the PARAMS parameter values
if keyword_set( cdf ) then begin
  ; Assemble parameters
  mi = params[0]
  sigma = params[1]
  if n_covariate gt 0 then begin
    id = where( covariate_param_id eq 0, n_id )
    if n_id gt 0 then mi = mi + params[2+id] ## covariate_data[*,id]
    id = where( covariate_param_id eq 1, n_id )
    if n_id gt 0 then sigma = sigma + params[2+id] ## covariate_data[*,id]
  endif
  ; Calculate CDF
  cdf = gauss_pdf( ( data_vec - mi ) / sigma )
  ; If we have bootstrap samples
  if keyword_set( n_bootstrap ) then begin
    ; Initialise bootstrap CDF array
    bootstrap_cdf = fltarr( n_data, n_bootstrap )
    ; Iterate through bootstrap samples
    for i_boot = 0, n_bootstrap - 1 do begin
      ; Assemble parameters
      mi = boot_params[0,i_boot]
      sigma = boot_params[1,i_boot]
      if n_covariate gt 0 then begin
        id = where( covariate_param_id eq 0, n_id )
        if n_id gt 0 then begin
          mi = mi + boot_params[2+id,i_boot] ## covariate_data[*,id]
        endif
        id = where( covariate_param_id eq 1, n_id )
        if n_id gt 0 then begin
          sigma = sigma + boot_params[2+id,i_boot] ## covariate_data[*,id]
        endif
      endif
      ; Calculate the CDF
      bootstrap_cdf[*,i_boot] = gauss_pdf( ( data_vec - mi ) / sigma )
    endfor
  endif
endif

;***********************************************************************
; Plot output

; Produce a probability plot
if keyword_set( plot_probability_opt ) then begin
  ; Default axis titles
  if not( keyword_set( xtitle ) ) then xtitle = 'Empirical'
  if not( keyword_set( ytitle ) ) then ytitle = 'Model'
  ; Determine the empirical quantiles
  quantiles = ( findgen( n_data ) + 0.5 ) / n_data
  ; Set up plotting window, including diagonal
  plot, [0,1], [0,1], isotropic=1, linestyle=1, xtitle=xtitle, ytitle=ytitle, $
      xthick=xthick, ythick=ythick, thick=thick, charsize=charsize, font=font, $
      title=title
  ; If we have bootstrap samples
  if keyword_set( n_bootstrap ) then begin
    ; Sort bootstrap samples
    for i_boot = 0, n_bootstrap - 1 do begin
      id = sort( bootstrap_cdf[*,i_boot] )
      bootstrap_cdf[*,i_boot] = bootstrap_cdf[id,i_boot]
    endfor
    ; Plot confidence range at each quantile
    for i_data = 0, n_data - 1 do begin
      temp = quantile_threshold( bootstrap_cdf[i_data,*], $
          [ p_value / 2., 1. - p_value / 2. ] )
      oplot, quantiles[i_data]+[0,0], temp, thick=thick, color=ci_color
    endfor
  endif
  ; Plot the points
  id = sort( cdf )
  oplot, quantiles, cdf[id], psym=4, thick=thick, color=color
endif

; Produce a quantile-quantile plot
if keyword_set( plot_quantile_opt ) then begin
  ; I do not think this is possible if there are covariates
  if n_covariate gt 0 then stop
  ; Default axis titles
  if not( keyword_set( xtitle ) ) then xtitle = 'Data'
  if not( keyword_set( ytitle ) ) then ytitle = 'Model'
  ; Determine the empirical locations for the quantiles
  id_sort = sort( data_vec )
  temp_data_vec = data_vec[id_sort]
  ; Determine the Gaussian model locations for the empirical quantiles
  quantiles = ( findgen( n_data ) + 0.5 ) / n_data
  inv_quantiles = quantiles
  for i_data = 0, n_data - 1 do begin
    inv_quantiles[i_data] = gauss_cvf( 1. - quantiles[i_data] ) * params[1] $
        + params[0]
  endfor
  ; Determine plotting range
  xyrange = [ min( [ inv_quantiles, data_vec], max=temp ), temp ]
  xyrange = ( xyrange - mean( xyrange ) ) * 1.05 + mean( xyrange )
  ; Set up plotting window
  plot, xyrange, xyrange, xstyle=1, ystyle=1, isotropic=1, linestyle=1, $
      xtitle=xtitle, ytitle=ytitle, xthick=xthick, ythick=ythick, thick=thick, $
      charsize=charsize, font=font, title=title
  ; If we have bootstrap samples
  if keyword_set( n_bootstrap ) then begin
    ; Calculate the bootstrap Gaussian model locations
    boot_inv_quantiles = fltarr( n_data, n_bootstrap )
    for i_boot = 0, n_bootstrap - 1 do begin
      for i_data = 0, n_data - 1 do begin
        boot_inv_quantiles[i_data,i_boot] $
            = gauss_cvf( 1. - quantiles[i_data] ) * boot_params[1,i_boot] $
            + boot_params[0,i_boot]
      endfor
    endfor
    ; Sort bootstrap samples
    for i_boot = 0, n_bootstrap - 1 do begin
      id = sort( boot_inv_quantiles[*,i_boot] )
      boot_inv_quantiles[*,i_boot] = boot_inv_quantiles[id,i_boot]
    endfor
    ; Plot confidence range at each quantile
    for i_data = 0, n_data - 1 do begin
      temp = quantile_threshold( boot_inv_quantiles[i_data,*], $
          [ p_value / 2., 1. - p_value / 2. ] )
      oplot, temp_data_vec[[i_data,i_data]], temp, thick=thick, $
          color=ci_color
    endfor
  endif
  ; Plot quantile-quantile plot
  oplot, temp_data_vec, inv_quantiles, psym=4, thick=thick, color=color
endif

;***********************************************************************
; The end

return, params
END


;***********************************************************************
; NAME:
;    gauss_fit_eqn
; PURPOSE:
;    This function contains the GAUSS likelihood function to be minimised by 
;    gauss_fit.pro.
; INPUTS:
;    PARAMS:  A required float vector containing proposed values of the GAUSS 
;        model parameters.  The length and order are the same as for RESULT 
;        from gauss_fit.pro.
; COMMON BLOCK common_gauss_fit_eqn inputs:
;    GAUSS_DATA:  Same as DATA_VEC input for gauss_fit.pro.  Required.
;    GAUSS_COVARIATE_DATA:  Same as COVARIATE_DATA keyword input for 
;        gauss_fit.pro.  Required if covariates are to be included.
;    GAUSS_COVARIATE_PARAM_ID:  Same as COVARIATE_PARAM_ID keyword input for 
;        gauss_fit.pro.  Required if covariates are to be included.
; USES:
;    -
; OUTPUTS:
;    RESULT:  The negative of the log likelihood function for the Gaussian 
;        model given the input set of parameters.  The negative is returned 
;        because IDL's amoeba function performs a minimisation, while we want a 
;        maximumisation of the log likelihood.
;***********************************************************************

FUNCTION GAUSS_FIT_EQN, $
    PARAMS

;***********************************************************************
; Constants and inputs

; Get input values
common common_gauss_fit_eqn
; Determine the data length
n_data = n_elements( gauss_data )
; Determine the number of covariates
n_covariate = n_elements( gauss_covariate_param_id )
; Define parameters for use in calculations (including covariates)
mi = params[0]
sigma = params[1]
if n_covariate gt 0 then begin
  id = where( gauss_covariate_param_id eq 0, n_id )
  if n_id gt 0 then begin
    mi = mi + params[2+id] ## gauss_covariate_data[*,id]
  endif else begin
    mi = mi + fltarr( n_data )
  endelse
  id = where( gauss_covariate_param_id eq 1, n_id )
  if n_id gt 0 then begin
    sigma = sigma + params[2+id] ## gauss_covariate_data[*,id]
  endif else begin
    sigma = sigma + fltarr( n_data )
  endelse
endif

;***********************************************************************
; Calculate log likelihood

; Abort if illegal scale parameter
if min( sigma ) lt 0 then return, double( n_data ) * 100

; Calculate the log likelihood
result = -0.5 * ( ( gauss_data - mi ) / sigma ) ^ 2. $
    - alog( sigma * sqrt( 2. * !pi ) )
result = total( result )

; Take negative (so amoeba is effectively maximising rather than minimising)
result = -result

;***********************************************************************
; The end

return, result
END
