-
PDF
- Split View
-
Views
-
Cite
Cite
Gareth Davies, Tsunami variability from uncalibrated stochastic earthquake models: tests against deep ocean observations 2006–2016, Geophysical Journal International, Volume 218, Issue 3, September 2019, Pages 1939–1960, https://doi.org/10.1093/gji/ggz260
- Share Icon Share
SUMMARY
This study tests three models for generating stochastic earthquake-tsunami scenarios on subduction zones by comparison with deep ocean observations from 18 tsunamis during the period 2006–2016. It focusses on the capacity of uncalibrated models to generate a realistic distribution of hypothetical tsunamis, assuming the earthquake location, magnitude and subduction interface geometry are approximately known, while details of the rupture area and slip distribution are unknown. Modelling problems like this arise in tsunami hazard assessment, and when using historical and palaeo-tsunami observations to study pre-instrumental earthquakes. Tsunamis show significant variability depending on their parent earthquake’s properties, and it is important that this is realistically represented in stochastic tsunami scenarios. To clarify which aspects of earthquake variability should be represented, three scenario generation approaches with increasing complexity are tested: a simple fixed-area-uniform-slip (FAUS) model with earthquake area and slip deterministically related to moment magnitude; a variable-area-uniform-slip (VAUS) model which accounts for earthquake area variability; and a heterogeneous-slip (HS) model which accounts for both earthquake area variability and slip heterogeneity. The models are tested using deep-ocean tsunami time-series from 18 events (2006–2016) with moment magnitude Mw > 7.7. For each model and each observed event a ‘corresponding family of model scenarios’ is generated which includes stochastic scenarios with earthquake location and magnitude similar to the observation, with no additional calibration. For an ideal model (which perfectly characterizes the variability of tsunamis) the 18 observed events should appear like a random sample of size 18, created by taking one draw from each of the 18 ‘corresponding family of model scenarios’. This idea forms the basis of statistical techniques to test the models. First a goodness-of-fit criterion is developed to identify stochastic scenarios ‘most similar’ to the observed tsunamis, and assess the capacity of different models to produce good-fitting scenarios. Both the HS and VAUS models show similar capacity to generate tsunamis matching observations, while the FAUS model performs much more poorly in some cases. Secondly, the observed tsunami stage ranges are tested for consistency with the null hypothesis that they were randomly generated by the model. The null hypothesis cannot be rejected for the HS model, whereas both uniform-slip models exhibit a statistically significant tendency to produce small tsunamis too often. Finally, the statistical properties of slip for stochastic earthquake scenarios are compared against earthquake scenarios that best fit the observations. For the VAUS models the best-fitting model scenarios have higher slip on average than the stochastic scenarios, highlighting biases in this model. The techniques developed in this study can be applied to test stochastic tsunami scenario generation techniques, identify and partially correct their biases, and provide better justification for their use in applications.
1 INTRODUCTION
Subduction zone earthquakes are the most common cause of destructive tsunamis globally (Grezio et al. 2017). Accurately modelling the size and variability of the tsunamis they generate is important for tsunami hazard assessment (e.g. Li et al. 2016; Butler et al. 2017; De Risi & Goda 2017; Mori et al. 2017a), and for studies of pre-instrumental earthquakes which interpret historical or palaeo data using modelled scenarios (e.g. Satake et al. 2003; Okal et al. 2004; Goff et al. 2012; Liu & Harris 2013; Ioualalen et al. 2017). For these applications it is necessary to simulate a range of synthetic earthquake–tsunamis with specified magnitudes, and the results may vary substantially depending on the scenario generation method (Li et al. 2016; Mori et al. 2017a). Thus there is a need to test scenario generation methods against observations to confirm they adequately simulate observed tsunamis and their variability (Geist & Oglesby 2014).
In most tsunami modelling applications the study context constrains the scenario magnitudes of interest as well as their general location, strike, dip and rake. When these parameters are fixed, variations in the earthquake’s rupture area and the location, depth and concentration of high slip regions can have a first-order effect on modelled tsunami wave heights (Geist & Dmowska 1999; Gica et al. 2007; Geist 2013; Li et al. 2016; Butler et al. 2017; Mori et al. 2017a). The rigidity is also usually weakly constrained, with values of ≃30−50 GPa commonly applied to subduction interfaces and lower values (≃10) potentially occurring at shallow depths, which affects tsunamis through the inverse relation of rigidity and spatially integrated slip when seismic moment is fixed (Bilek & Lay 1999; Geist & Bilek 2001; Fujii & Satake 2007; Lorito et al. 2010; Newman et al. 2011a,b). All these factors imply some tsunami variability that should be accurately represented in synthetic scenarios, with both over and underestimation leading to biases in applications. For example in tsunami hazard assessment the use of scenarios with greater variability increases the hazard at rare return periods because rare events are more strongly affected by the ‘upper tail’ of the scenario distribution (Li et al. 2016; Davies et al. 2017).
Diverse scenario generation models have been applied in the literature. The simplest approach constructs uniform-slip earthquakes with area deterministically related to magnitude (using an empirical scaling relation) which are moved through a range of fault-plane locations with nominal rigidity (Sørensen et al. 2012; Horspool et al. 2014; Lorito et al. 2015; Hoechner et al. 2016; Davies et al. 2017). The use of a deterministic magnitude–area relation is a strong simplification: for example, several empirical subduction earthquake scaling relations represent log10(area) as a function of Mw with an additional prediction standard deviation σ ≃ [0.25−0.3] (Strasser et al. 2010; Allen & Hayes 2017). This implies the rupture area has a 95 per cent prediction interval (±2σ) ranging over more than a factor of 10. One way to represent this is by varying the area of uniform-slip scenarios. For example Butler et al. (2017) modelled seven Mw 8.6 uniform-slip thrust scenarios at an Aleutian Arc site, with area varying by a factor of 6, resulting in ≃2 × change to modelled runup in Hawaii. Gica et al. (2007) reported that a factor of 2 reduction in uniform-slip earthquake area (with proportionate increase in slip) for a scenario near Chile could approximately double modelled wave heights near Hawaii.
The impact of earthquake slip heterogeneity on tsunamis has received much attention in recent literature (e.g. Mueller et al. 2015; Goda et al. 2015; Davies et al. 2015; Ruiz et al. 2015; LeVeque et al. 2016; Sepúlveda et al. 2017). Several studies find uniform-slip earthquake models produce statistically small tsunamis in comparison to heterogeneous-slip (HS) models with similar rupture dimensions (Geist & Dmowska 1999; Geist et al. 2007; Mueller et al. 2015; Ruiz et al. 2015; Li et al. 2016). Mueller et al. (2015) found some of their HS scenarios increased the modelled inundation area by the equivalent of 0.3–0.4 Mw units, in comparison to uniform-slip scenarios with the same rupture area. It is often suggested that slip heterogeneity is of limited importance at sites far from the earthquake source (e.g. Geist 2002; Okal & Synolakis 2008; Lauterjung et al. 2010), however Li et al. (2016) found it substantially affected modelled tsunami hazard throughout the South China Sea. Similarly Butler et al. (2017) found that earthquake scenarios with high near-trench slip (and lower deep slip) on the Aleutian Arc induced up to 75 per cent greater modelled inundation area at sites in Hawaii, as compared with uniform-slip scenarios with the same mean slip and rupture extent.
Given the slip heterogeneity observed in finite-fault inversions, HS models are increasingly advocated to enhance the robustness of tsunami hazard assessments (Li et al. 2016; Griffin et al. 2016; De Risi & Goda 2017). However a plethora of algorithms for generating HS scenarios are used in the literature (e.g. Geist et al. 2007; Geist 2013; Davies et al. 2015; Goda et al. 2015; Geist 2016; LeVeque et al. 2016; Murphy et al. 2016; Sepúlveda et al. 2017), and the size and variability of their synthetic tsunamis varies substantially among algorithms, and even for different parametrizations of the same algorithm. For example Mori et al. (2017a) investigated the sensitivity of a HS model to variations in its parameters, which were fit separately to 11 different inversions of the 2011 Tohoku earthquake. The median area of inundation with depth exceeding 1 m (based on 50 random scenarios) varied by more than a factor of 2 among the 11 parameter sets at multiple sites, even though all were based on inversions of the same earthquake. To reduce the sensitivity of HS models to parameter choices, several studies used finite-fault inversions from the SRCMOD database (Mai & Thingbaijam 2014) to relate model parameters to earthquake magnitude, including predictive uncertainties which can be integrated in stochastic simulations (Davies et al. 2015; Goda et al. 2016). Such models can be applied without reference to a particular finite-fault inversion, but this does not guarantee robust performance because the results are also affected by other algorithmic choices. Davies et al. (2015) evaluated eight different HS models by comparing their slip and tsunami inundation with the statistical properties of 66 finite-fault inversions, where the latter were used to develop statistical predictors for all eight model’s parameters. Despite being ‘fit to the inversions’ in this way, multiple models produced scenarios which statistically underestimated the inversion’s slip maxima and inundation, with the best models showing greater capacity to produce spatially localized slip. Mori et al. (2017b) used a HS model with parameter distributions derived by Goda et al. (2016), finding the ‘upper tail’ of the modelled tsunami wave height distribution became heavier (i.e. larger extremes) when rupture area variability was accounted for. HS models also vary in their capacity to generate near-trench rupture, and several studies find those which allow ‘rupture to the trench’ predict larger tsunami amplitudes on the overriding-plate side of subduction zones (Geist et al. 2007; Murphy et al. 2016).
It remains unclear which, if any, of the stochastic scenario generation approaches reviewed above best represents real tsunamis. The testing of models using data has received limited attention in the literature, with most studies focusing on algorithmic issues, sensitivity testing and implications for hazard assessment (e.g. Goda et al. 2014, 2015; Griffin et al. 2016; Murphy et al. 2016; Mori et al. 2017a; Sepúlveda et al. 2017). For example all the latter studies compared model predictions with at most one observed tsunami event, whereas multi-event testing is necessary to quantify stochastic model biases.
This study tests three tsunami scenario generation models by comparing stochastic scenarios with deep-ocean observations from 18 recent tsunamis (2006–2016). The scenarios are constructed to have similar location and magnitude as the actual earthquake, but are not otherwise calibrated to the observations, to better understand their performance in applications where calibration is impossible. Techniques to assess the model’s performance are developed, and these naturally generalize to a model-specific bias-adjustment. This study was conducted as part of the 2018 Australian Probabilistic Tsunami Hazard Assessment (Davies & Griffin 2018), with the primary objective being to understand the performance of alternative stochastic tsunami generation methodologies at regional and far-field distances, and guide their use in applications.
2 TEST EVENT DATA

Location of test events, DART buoys, and the unit-source database. The latter was developed for the 2018 Australian Probabilistic Tsunami Hazard Assessment and so excludes some globally significant source-zones with limited importance for Australia (Davies & Griffin 2018).
Deep ocean tsunameter (DART buoy) measurements were obtained from the ‘National Oceanic and Atmospheric Administration National Data Buoy Centre’ historical DART data website (NOAA/NDBC 2017). This included data for 64 sites beginning as early as 2003 (Fig. 1), although records begin in 2006 or later at 87 per cent of sites and so test events are restricted to this period. Nearshore tidal gauge measurements were excluded because they are strongly affected by subkilometre scale variations in nearshore coastal topography (Power & Tolkova 2013; Rabinovich et al. 2017) and a priori may not be simulated reliably with our global scale tsunami propagation model (1 arcmin resolution, Section 3.1).
For all earthquakes matching the above criteria, DART buoy measurements recording the tsunami with a sampling frequency of 2 min or better were selected for further analysis. Eighteen events produced tsunamis that could be clearly identified at one or more DART buoys, with the most widely measured event being the 2011 Tohoku tsunami (28 sites, Table 1). A tsunami signal could not be identified for the 2006/07/17 Mw 7.7 Java earthquake or the 2016/11/13 Mw 7.9 Kaikoura earthquake, so these events are not included in the test set.
Events from the GCMT catalogue modelled in this study. The location (lon,lat) and depth parameters refer to the hypocentre. The ‘# DART’ column gives the number of DART buoys at which a tsunami signal was extracted.
Source zone . | Event ID . | # DART . | Date . | Mw . | Lon . | Lat . | Depth . |
---|---|---|---|---|---|---|---|
Kermadec-Tonga | KT1 | 1 | 2006/05/03 15:26:40.3 | 8.0 | −174.12 | −20.19 | 55.00 |
Kurils-Japan | KJ1 | 12 | 2006/11/15 11:14:17.8 | 8.3 | 153.29 | 46.57 | 38.90 |
Solomons | So1 | 2 | 2007/04/01 20:39:56.4 | 8.1 | 157.04 | −8.46 | 10.00 |
South-America | SA1 | 3 | 2007/08/15 23:40:57.9 | 8.0 | −76.60 | −13.39 | 39.00 |
Sunda | Su1 | 1 | 2007/09/12 11:10:26.8 | 8.5 | 101.37 | −4.44 | 34.00 |
South-America | SA2 | 2 | 2007/11/14 15:40:50.5 | 7.8 | −69.89 | −22.25 | 40.00 |
Puysegur | Pu1 | 2 | 2009/07/15 09:22:29.0 | 7.8 | 166.56 | −45.76 | 12.00 |
Kermadec-Tonga | KT2 | 5 | 2009/09/29 17:48:11.0 | 8.1 | −172.10 | −15.49 | 18.00 |
New-Hebrides | NH1 | 1 | 2009/10/07 22:18:51.2 | 7.8 | 166.38 | −12.52 | 35.00 |
South-America | SA3 | 16 | 2010/02/27 06:34:15.6 | 8.8 | −72.71 | −35.85 | 44.80 |
Sunda | Su2 | 1 | 2010/04/06 22:15:01.6 | 7.8 | 97.05 | 2.38 | 31.00 |
Sunda | Su3 | 1 | 2010/10/25 14:42:22.5 | 7.9 | 100.08 | −3.49 | 20.10 |
Kurils-Japan | KJ2 | 28 | 2011/03/11 05:46:23.0 | 9.1 | 142.37 | 38.32 | 24.40 |
New-Hebrides | NH2 | 5 | 2013/02/06 01:12:25.8 | 7.9 | 165.11 | −10.80 | 24.00 |
South-America | SA4 | 7 | 2014/04/01 23:46:47.3 | 8.2 | −70.77 | −19.61 | 25.00 |
South-America | SA5 | 18 | 2015/09/16 22:54:32.9 | 8.3 | −71.67 | −31.57 | 22.40 |
South-America | SA6 | 2 | 2016/04/16 23:58:36.9 | 7.8 | −79.93 | 0.35 | 21.00 |
Solomons | So2 | 3 | 2016/12/08 17:38:46.3 | 7.8 | 161.32 | −10.68 | 40.00 |
Source zone . | Event ID . | # DART . | Date . | Mw . | Lon . | Lat . | Depth . |
---|---|---|---|---|---|---|---|
Kermadec-Tonga | KT1 | 1 | 2006/05/03 15:26:40.3 | 8.0 | −174.12 | −20.19 | 55.00 |
Kurils-Japan | KJ1 | 12 | 2006/11/15 11:14:17.8 | 8.3 | 153.29 | 46.57 | 38.90 |
Solomons | So1 | 2 | 2007/04/01 20:39:56.4 | 8.1 | 157.04 | −8.46 | 10.00 |
South-America | SA1 | 3 | 2007/08/15 23:40:57.9 | 8.0 | −76.60 | −13.39 | 39.00 |
Sunda | Su1 | 1 | 2007/09/12 11:10:26.8 | 8.5 | 101.37 | −4.44 | 34.00 |
South-America | SA2 | 2 | 2007/11/14 15:40:50.5 | 7.8 | −69.89 | −22.25 | 40.00 |
Puysegur | Pu1 | 2 | 2009/07/15 09:22:29.0 | 7.8 | 166.56 | −45.76 | 12.00 |
Kermadec-Tonga | KT2 | 5 | 2009/09/29 17:48:11.0 | 8.1 | −172.10 | −15.49 | 18.00 |
New-Hebrides | NH1 | 1 | 2009/10/07 22:18:51.2 | 7.8 | 166.38 | −12.52 | 35.00 |
South-America | SA3 | 16 | 2010/02/27 06:34:15.6 | 8.8 | −72.71 | −35.85 | 44.80 |
Sunda | Su2 | 1 | 2010/04/06 22:15:01.6 | 7.8 | 97.05 | 2.38 | 31.00 |
Sunda | Su3 | 1 | 2010/10/25 14:42:22.5 | 7.9 | 100.08 | −3.49 | 20.10 |
Kurils-Japan | KJ2 | 28 | 2011/03/11 05:46:23.0 | 9.1 | 142.37 | 38.32 | 24.40 |
New-Hebrides | NH2 | 5 | 2013/02/06 01:12:25.8 | 7.9 | 165.11 | −10.80 | 24.00 |
South-America | SA4 | 7 | 2014/04/01 23:46:47.3 | 8.2 | −70.77 | −19.61 | 25.00 |
South-America | SA5 | 18 | 2015/09/16 22:54:32.9 | 8.3 | −71.67 | −31.57 | 22.40 |
South-America | SA6 | 2 | 2016/04/16 23:58:36.9 | 7.8 | −79.93 | 0.35 | 21.00 |
Solomons | So2 | 3 | 2016/12/08 17:38:46.3 | 7.8 | 161.32 | −10.68 | 40.00 |
Events from the GCMT catalogue modelled in this study. The location (lon,lat) and depth parameters refer to the hypocentre. The ‘# DART’ column gives the number of DART buoys at which a tsunami signal was extracted.
Source zone . | Event ID . | # DART . | Date . | Mw . | Lon . | Lat . | Depth . |
---|---|---|---|---|---|---|---|
Kermadec-Tonga | KT1 | 1 | 2006/05/03 15:26:40.3 | 8.0 | −174.12 | −20.19 | 55.00 |
Kurils-Japan | KJ1 | 12 | 2006/11/15 11:14:17.8 | 8.3 | 153.29 | 46.57 | 38.90 |
Solomons | So1 | 2 | 2007/04/01 20:39:56.4 | 8.1 | 157.04 | −8.46 | 10.00 |
South-America | SA1 | 3 | 2007/08/15 23:40:57.9 | 8.0 | −76.60 | −13.39 | 39.00 |
Sunda | Su1 | 1 | 2007/09/12 11:10:26.8 | 8.5 | 101.37 | −4.44 | 34.00 |
South-America | SA2 | 2 | 2007/11/14 15:40:50.5 | 7.8 | −69.89 | −22.25 | 40.00 |
Puysegur | Pu1 | 2 | 2009/07/15 09:22:29.0 | 7.8 | 166.56 | −45.76 | 12.00 |
Kermadec-Tonga | KT2 | 5 | 2009/09/29 17:48:11.0 | 8.1 | −172.10 | −15.49 | 18.00 |
New-Hebrides | NH1 | 1 | 2009/10/07 22:18:51.2 | 7.8 | 166.38 | −12.52 | 35.00 |
South-America | SA3 | 16 | 2010/02/27 06:34:15.6 | 8.8 | −72.71 | −35.85 | 44.80 |
Sunda | Su2 | 1 | 2010/04/06 22:15:01.6 | 7.8 | 97.05 | 2.38 | 31.00 |
Sunda | Su3 | 1 | 2010/10/25 14:42:22.5 | 7.9 | 100.08 | −3.49 | 20.10 |
Kurils-Japan | KJ2 | 28 | 2011/03/11 05:46:23.0 | 9.1 | 142.37 | 38.32 | 24.40 |
New-Hebrides | NH2 | 5 | 2013/02/06 01:12:25.8 | 7.9 | 165.11 | −10.80 | 24.00 |
South-America | SA4 | 7 | 2014/04/01 23:46:47.3 | 8.2 | −70.77 | −19.61 | 25.00 |
South-America | SA5 | 18 | 2015/09/16 22:54:32.9 | 8.3 | −71.67 | −31.57 | 22.40 |
South-America | SA6 | 2 | 2016/04/16 23:58:36.9 | 7.8 | −79.93 | 0.35 | 21.00 |
Solomons | So2 | 3 | 2016/12/08 17:38:46.3 | 7.8 | 161.32 | −10.68 | 40.00 |
Source zone . | Event ID . | # DART . | Date . | Mw . | Lon . | Lat . | Depth . |
---|---|---|---|---|---|---|---|
Kermadec-Tonga | KT1 | 1 | 2006/05/03 15:26:40.3 | 8.0 | −174.12 | −20.19 | 55.00 |
Kurils-Japan | KJ1 | 12 | 2006/11/15 11:14:17.8 | 8.3 | 153.29 | 46.57 | 38.90 |
Solomons | So1 | 2 | 2007/04/01 20:39:56.4 | 8.1 | 157.04 | −8.46 | 10.00 |
South-America | SA1 | 3 | 2007/08/15 23:40:57.9 | 8.0 | −76.60 | −13.39 | 39.00 |
Sunda | Su1 | 1 | 2007/09/12 11:10:26.8 | 8.5 | 101.37 | −4.44 | 34.00 |
South-America | SA2 | 2 | 2007/11/14 15:40:50.5 | 7.8 | −69.89 | −22.25 | 40.00 |
Puysegur | Pu1 | 2 | 2009/07/15 09:22:29.0 | 7.8 | 166.56 | −45.76 | 12.00 |
Kermadec-Tonga | KT2 | 5 | 2009/09/29 17:48:11.0 | 8.1 | −172.10 | −15.49 | 18.00 |
New-Hebrides | NH1 | 1 | 2009/10/07 22:18:51.2 | 7.8 | 166.38 | −12.52 | 35.00 |
South-America | SA3 | 16 | 2010/02/27 06:34:15.6 | 8.8 | −72.71 | −35.85 | 44.80 |
Sunda | Su2 | 1 | 2010/04/06 22:15:01.6 | 7.8 | 97.05 | 2.38 | 31.00 |
Sunda | Su3 | 1 | 2010/10/25 14:42:22.5 | 7.9 | 100.08 | −3.49 | 20.10 |
Kurils-Japan | KJ2 | 28 | 2011/03/11 05:46:23.0 | 9.1 | 142.37 | 38.32 | 24.40 |
New-Hebrides | NH2 | 5 | 2013/02/06 01:12:25.8 | 7.9 | 165.11 | −10.80 | 24.00 |
South-America | SA4 | 7 | 2014/04/01 23:46:47.3 | 8.2 | −70.77 | −19.61 | 25.00 |
South-America | SA5 | 18 | 2015/09/16 22:54:32.9 | 8.3 | −71.67 | −31.57 | 22.40 |
South-America | SA6 | 2 | 2016/04/16 23:58:36.9 | 7.8 | −79.93 | 0.35 | 21.00 |
Solomons | So2 | 3 | 2016/12/08 17:38:46.3 | 7.8 | 161.32 | −10.68 | 40.00 |
The tsunami wave train was extracted from the DART time-series using a LOESS smoother to isolate the tidal component (Romano et al. 2015; Davies & Griffin 2018). Rayleigh waves were removed by truncating the time-series (Rabinovich & Eblé 2015). Additional data spikes were replaced by interpolation. For each event, if the tsunami could be clearly detected at four or more DART buoys, then other stations where the wave was very small (range ≲ 2 cm) were not used because small waves are of less interest for our target applications, and it is more difficult to unambiguously de-tide such small signals.
3 SCENARIO MODELLING
The earthquake scenario database was developed as part of a tsunami hazard assessment for Australia (Davies & Griffin 2018). It includes scenarios on major earthquake sources in the Pacific and Indian Oceans (depicted as unit-sources in Fig. 1), including many sources that do not feature in the current study.
3.1 Source discretization and tsunami propagation
For source-zones used in this study, the fault-plane geometries were defined with either SLAB1.0 or SLAB2.0 (Hayes et al. 2012, 2018). It was assumed that earthquake slip could occur anywhere between the trench and a source-zone specific maximum downdip depth, with the latter based on the maximum of the three values suggested by Berryman et al. (2015).
Each fault-plane was discretized into a curvilinear grid of unit-sources with along-strike length and down-dip width of approximately 50 x 50 km2 (Fig. 2). All earthquakes are represented as linear combinations of these unit-sources. Depending on the source-zone width, there are between 2 and 4 unit-sources in the downdip direction. To permit the non-uniform source-zone geometry to be represented despite this relatively coarse discretization, all unit-sources were internally discretized into rectangular sub-unit-sources with depth and dip assigned from the original fault-plane geometry (Davies et al. 2017). The slip on the sub-unit-sources was initially set equal to slip on the parent unit-source, but was subsequently tapered between neighbouring unit-sources to prevent slip discontinuities, which can otherwise artificially enhance short wavelengths in tsunami initial conditions (Baba & Cummins 2005; Goda 2015; Davies & Griffin 2018).

Example earthquake scenarios generated by each of the three models on the Kurils-Japan source-zone. The depicted scenarios all have Mw 9.1. The GCMT hypocentre for the 2011/03/11 Tohoku event is shown in the top-left-hand panel. (a, b) Fixed-area-uniform-slip (FAUS) model; (c, d) variable-area-uniform-slip (VAUS) model; (e, f) heterogeneous-slip (HS) model.
The use of a 50 x–50 km2 unit-source patch size is necessarily a compromise. Smaller patches can represent smaller earthquake scales and allow larger deep-water tsunamis (by concentrating more slip on the ‘most tsunamigenic’ unit-source), but also lead to rapid growth in computational and storage requirements (Li et al. 2016). Sufficiently fine patch sizes may also necessitate dispersive hydrodynamic models, further increasing the computational cost (Li et al. 2016). For this study the patch size was limited to be computationally tractable at the global scale, noting coarser unit-sources are typically employed in global-scale databases (Gica et al. 2008; Greenslade & Titov 2008; Tang et al. 2009; Rasmussen et al. 2015). Importantly, many prior studies used comparable unit-source sizes to reconstruct tsunami observations from Mw⪆ 7.7 earthquakes reasonably well (e.g. Fuji & Satake 2006; Greenslade & Titov 2008; Tang et al. 2009; Greenslade et al. 2011; Uslu et al. 2011; Fujii & Satake 2013; Borrero et al. 2015; Dilmen et al. 2015; An et al. 2018). A priori this suggests that comparable accuracy could be obtained with our unit-sources, if the earthquake scenario models perform adequately.
For each unit-source, the vertical coseismic seabed deformation from 1 m of slip was computed by summing over sub-unit-sources using the Okada (1985) model with Poisson’s ratio equal to 0.25. The sea-surface deformation was derived by filtering the seabed deformation (Kajiura 1963; Saito & Furumura 2009; Glimsdal et al. 2013).
For each unit-source the resulting tsunami propagation was modelled with the linear shallow water equations including a Coriolis term, which were solved numerically on a 1-arcmin grid using the leap-frog method (Satake 1995; Imamura et al. 2006; Baba et al. 2015). The bathymetry was mostly derived from the GEBCO 2014 Grid (version 20150318), combined with the GA250 DEM (Whiteway 2009) at sites near Australia. The model domain included all earth between latitudes [–72, 65], with periodic east–west and reflective north–south boundary conditions (for details see Davies & Griffin 2018). Model time-series were stored at all DART buoy locations in Fig. 1. The tsunami associated with any earthquake scenario was computed as a linear combination of the unit-source tsunamis, each multiplied by the associated earthquake slip. This is mathematically exact because the Okada (1985) model is linear in slip, while the linear shallow water equations are linear in the water surface elevation above sea level and the depth-integrated velocity (Thio et al. 2007; Burbidge et al. 2008).
3.2 Three stochastic earthquake scenario models
The scenario database includes three earthquake scenario generation models with increasing complexity, termed fixed-area-uniform-slip (FAUS), variable-area-uniform-slip (VAUS) and heterogeneous-slip (HS) (Fig. 2). All earthquake scenarios were generated as described below, with magnitudes in [7.2, 7.3, ..., 9.8] (for details see Davies & Griffin 2018). The full database contains approximately 50 000 FAUS scenarios and 750 000 VAUS and HS scenarios, but only a subset of these will ultimately be compared with observations herein (Section 4.1).
3.2.1 FAUS
The FAUS model assumes earthquake slip is spatially uniform, and rupture area is a deterministic function of magnitude (Figs 2a and b). The length and width are approximated with a collection of nx × ny unit-sources in the along-strike and downdip directions based on the source-zone’s average unit-source dimensions and the subduction interface scaling relations of Strasser et al. (2010) (Figs 2a and b, full details described in Davies et al. 2017). On any given source-zone there exist a finite number of possible locations for an nx × ny block of unit-sources, which together form the complete set of FAUS scenarios with the specified magnitude. The slip is derived from the seismic moment and area assuming a rigidity of 3 × 1010 Pascals. The use of alternative scaling relations (e.g. Blaser et al. 2010; Allen & Hayes 2017; Thingbaijam et al. 2017) may alter the FAUS tsunami-size versus magnitude relationship because smaller rupture area implies greater slip which can promote larger tsunamis, but this is not investigated herein. The differences between widely used scaling relations are generally less than their prediction errors (Allen & Hayes 2017) so for simplicity this study focusses on the latter (Section 3.2.2).
Scenarios with uniform-slip and deterministic size (magnitude-dependent) are often employed in tsunami hazard assessments (e.g. Løvholt et al. 2012; Davies et al. 2017; Kalligeris et al. 2017) and to infer the magnitudes of pre-instrumental earthquakes (Liu & Harris 2013; Ioualalen et al. 2017). Considering the popularity of this approach it is of interest to see whether FAUS scenarios adequately represent tsunamis in our test set (Table 1). The FAUS approach also provides a baseline for assessing the benefits of more complex models.
3.2.2 VAUS
The VAUS model retains the simplicity of uniform-slip but accounts for stochasticity in the earthquake dimensions and mean slip using scaling relation prediction errors (Figs 2c and d). Fifteen VAUS scenarios are generated per FAUS scenario to allow adequate sampling of the rupture size variability. This was adequate to achieve convergence of tsunami maximum-stage-versus-return-period curves in an associated hazard study (Davies & Griffin 2018). By construction each VAUS scenario has a ‘parent’ FAUS scenario (i.e. with 15 ‘children’ for each of the latter). This relationship will be exploited later when associating VAUS scenarios to historical test events (Section 4.1).
Eq. (2) implies arbitrarily small rupture lengths and widths can occur at all magnitudes. To control the impact of such ‘small area, high slip’ scenarios on our results, random samples from the normal distribution in eq. (2) were clipped to ±2σ. For example this allows a Mw 9.0 event to have length within [267, 1406] km and width within [85, 420] km, although there is some additional variation caused by rounding to the nearest number of unit-sources.
The length and width are also limited by the source-zone size (Kalligeris et al. 2017). If the full source-zone width is less than the width from eq. (2), then the latter is truncated. A random decision is then made as to whether the scenario length should also be adjusted. In 50 per cent of cases the length is increased to preserve the originally desired area and mean slip (as suggested by Thingbaijam et al. 2017), while in the other 50 per cent of these cases the length is unchanged (as assumed in the scaling relations of Allen & Hayes 2017) and the mean slip increases to preserve the desired seismic moment. This reflects uncertainty in how scaling relations should be adapted for such extreme earthquakes.
3.2.3 HS
The HS model accounts for both rupture size variability and slip heterogeneity. One HS scenario was generated for every VAUS scenario, where the nx × ny unit-sources of the latter define the rupture extent. The slip for each HS scenario was created randomly using the SNCF model of Davies et al. (2015). This is a variant on the widely used k−2 HS model (e.g. Herrero & Bernard 1994; Geist 2002; Gallović & Brokes̆ová 2004; Causse et al. 2009; Li et al. 2016), and showed the best performance among eight alternative HS models that Davies et al. (2015) compared with finite-fault inversions.
In addition to the rupture dimensions and magnitude, the SNCF model requires two corner-wavenumber parameters (kcx, kcy), a specified constant grid size, and a specified peak-slip location (Davies et al. 2015). The peak-slip location corresponds to the randomly chosen unit-source U defined above for the VAUS scenarios. U also defines the grid size for the SNCF algorithm (Davies et al. 2015). The corner-wavenumbers were randomly simulated using Mw-based regression relations in Davies et al. (2015, their table 2), including the random residual term and correlation between the kcx and kcy residuals.
Limits were imposed on the scenario’s maximum-slip (Lavallee et al. 2011; Goda et al. 2014; Mori et al. 2017a). Scenarios were discarded if their maximum-slip was more than 7.5 times the mean slip of a hypothetical subduction interface earthquake with the same magnitude and a median scaling-relation area (based on Strasser et al. (2010) with rigidity of 3 × 1010 Pa). This is a fairly weak constraint, noting a factor ≥5.5 is required for uniform-slip earthquakes with −2σ prediction errors in length and width. The value of 7.5 was chosen to allow slip-maxima that are consistent with a range of published finite fault inversions. It permits a Mw 9.1 scenario to have maximum slip ≤81 m, consistent with values of 35–75 m among all inversions of the 2011 Tohoku earthquake reviewed by Goda et al. (2014). The approach also permits recent inversions of the 2004 Sumatra event (Bletery et al. 2016; Gopinathan et al. 2017) which have higher maximum slip than earlier inversions (Poisson et al. 2011), as well as recent inversions of the 1960 Chile earthquake (Fujii & Satake 2013; Ho et al. 2019).
4 TESTING SCENARIO GENERATION MODELS
4.1 Corresponding family of model scenarios
For each observed event e (Table 1) and each model type m ∈ {FAUS, VAUS, HS} (Section 3.2) a ‘corresponding family of model scenarios’ |$C_{e}^{m}$| was algorithmically defined as detailed below. The idea is that |$C_{e}^{m}$| includes all database scenarios of model type m for which the earthquake has ‘similar location and magnitude’ as the observed event e. Tsunami data is deliberately not used to define |$C_{e}^{m}$|. Each model’s performance will subsequently be evaluated with reference to an ‘ideal model’ for which the observed tsunami e behaves like a randomly selected scenario from its |$C_{e}^{m}$|. For an ideal model the 18 observed tsunamis (Table 1) would thus appear like a random sample of 18 modelled tsunamis, with one scenario drawn randomly from each of the 18 |$C_{e}^{m}$|. In applications it is often desirable to treat modelled tsunami scenarios as an unbiased representation of future tsunamis generated by earthquakes with similar location and magnitude, so the ideal model provides a pragmatic basis for model testing.
To assess biases in the distribution of stochastic tsunamis generated by any model m, it is necessary to compare multiple events e with their |$C_{e}^{m}$|. Comparison of a single event and its |$C_{e}^{m}$| is inadequate, even if one or more modelled scenarios are similar to observations, because the variability of the modelled scenarios cannot be tested. Multi-event testing solves this problem; for example if 18 tsunamis are observed and all produce ‘above median’ wave heights relative to random scenarios with similar location and magnitude (as defined by their |$C_{e}^{m}$|), then the model m likely produces small tsunamis too often. Note individual events e are only compared against scenarios in their own |$C_{e}^{m}$|; however, tests for model bias involve multiple such comparisons.
Although multi-event testing is essential to quantify stochastic model biases, the author is unaware of prior attempts to do this. Several studies compared a set of stochastic scenarios (equivalent to our ‘corresponding family of model scenarios’) with observations from one tsunami, often the 2011 Tohoku event (e.g. Goda et al. 2014, 2015; Murphy et al. 2016; Mori et al. 2017a), although Sepúlveda et al. (2017) used the 2014/04/01 Mw 8.2 Iquique tsunami. Other studies compared their models with a tsunami simulated from a finite-fault inversion (e.g. Li et al. 2016; Mori et al. 2017b), while Davies et al. (2015) compared the statistical properties of their models with 66 oceanic earthquake inversions. The latter approach is limited by possible bias in finite-fault inversions (Mai & Thingbaijam 2014), which the current study avoids by testing against tsunami observations. Although the sample size of 18 herein (Table 1) is not large, it is sufficient to demonstrate statistical biases in several of the models (Sections 5.3 and 5.4).
In this study models are deliberately tested without event-specific calibration of |$C_{e}^{m}$| (beyond filtering database scenarios by earthquake location and magnitude, defined precisely below). This is because event-specific calibration is impossible for some applications of interest. For example, tsunami hazard assessment often requires the design of hypothetical scenarios with prescribed earthquake location and magnitude, where the latter are specified separately from context-specific factors (e.g. Lynett et al. 2016; Kalligeris et al. 2017). In this situation the model is used to represent possible future tsunamis and event-specific tuning is impossible, because data for the ‘future event’ of interest does not exist. Some applications are not similarly constrained, such as far-field tsunami early warning where DART buoy observations can be assimilated (Tang et al. 2009, 2016). The design of tests for such applications is outside the scope of this study.
To construct |$C_{e}^{m}$| for each observed event e and model type m, first a reference magnitude and location for e are specified using the GCMT seismic moment and hypocentre (Table 1). Next |$C_{e}^{\textrm{FAUS}}$| is constructed (the VAUS and HS models will be treated subsequently). FAUS database scenarios are included if they occur on the same source-zone as the GCMT earthquake, and have:
magnitude within ±0.15 of the event magnitude (Table 1), and
at least one unit-source within |$\frac{1}{2}$| a scaling relation length and width of the unit-source nearest the hypocentre (based on the interface scaling relations of Strasser et al. (2010), rounding up to an integer number of unit-sources). This accounts for the fact that high magnitude earthquakes tend to have larger lengths and widths, so distances considered ‘near’ grow accordingly.
For example the 2011 Tohoku earthquake (Event ID ’KJ2’ in Table 1) occurred on our Kurils-Japan source-zone and has GCMT Mw = 9.1. Figs 2(a) and (b) show the northern-most and southern-most scenarios in |$C_{\textrm{KJ2}}^{\textrm{FAUS}}$| with Mw 9.1 (other scenarios with Mw = 9.0 and 9.2 are also included in |$C_{\textrm{KJ2}}^{\textrm{FAUS}}$|). The southern-most scenario is limited by the southern extent of the source-zone (Fig. 2b), while the northern-most scenario (Fig. 2a) is north of the actual rupture (Satake et al. 2013). In general model scenarios which best agree with observations are expected to occur within this algorithmically defined range.
For the VAUS and HS models, |$C_{e}^{\textrm{VAUS}}$| and |$C_{e}^{\textrm{HS}}$| are defined to include all scenarios which have a parent FAUS scenario in |$C_{e}^{\textrm{FAUS}}$| (examples in Figs 2c–f). This definition was used in preference to directly selecting scenarios based on their location and magnitude, because it avoids biases in scenario selection. For example, if scenarios were included on the basis of containing a unit-source near the event hypocentre (as for the FAUS scenarios), there would be a greater chance of selecting large-area scenarios than small-area scenarios. A biased representation of the model would result.
The central idea of this study is to assess stochastic tsunami model biases by comparing observed tsunamis with random model scenarios having ‘similar location and magnitude’. Thus the definition of |$C_{e}^{m}$| could affect the results. For instance if the magnitude range (±0.15) was much smaller than the uncertainty in 2006+ GCMT magnitudes, even an ideal model may never generate scenarios in |$C_{e}^{m}$| which resemble the observed event e. However if the observed location and magnitude are approximately unbiased, the magnitude and distance thresholds used to define |$C_{e}^{m}$| should not systematically bias the median tsunami size. Thus any evidence of consistent positive or negative bias in the modelled tsunami size should be robust to details of the |$C_{e}^{m}$| definition.
4.2 Scenario goodness-of-fit
For each scenario |$s \in C_{e}^{m}$| a ‘scenario goodness-of-fit statistic’ (|$G_{e}^{s}$|) summarizes agreement with DART buoy observations from event e, accounting for results at multiple DART buoys if available (Table 1). The goodness-of-fit statistic serves two roles in this study:
It provides an objective method for identifying scenarios in |$C_{e}^{m}$| which agree relatively well with observations (Fig. 3, Section 5.1).
Biases in model m’s representation of earthquakes can be uncovered by comparing the statistical properties of the ‘best-fitting’ earthquake scenarios with stochastic scenarios in their |$C_{e}^{m}$|, over many events e (Section 5.4). For an ‘ideal model’ the set of best-fitting scenarios should be statistically indistinguishable from a random sample containing one scenario from each |$C_{e}^{m}$| (aside from their goodness-of-fit). Deviations in other scenario properties would suggest model biases. For example, if over many events e the best-fitting earthquake scenario consistently had low area and high slip compared with scenarios in its |$C_{e}^{m}$|, it implies the model m produces high area, low slip earthquakes too often. This reasoning can be generalized to develop a bias adjustment for m (Section 5.4).

Stage time-series from five modelled scenarios at DART buoys, all drawn from the heterogeneous-slip ‘corresponding family of model scenarios’ for the 2016 Mw 7.8 Solomon earthquake-tsunami (ID ’So2’, Table 1). Scenario A has the best scenario goodness-of-fit statistic based on eq. (4) (|$G_{\textrm{So2}}^{\textrm{A}} = 0.183$|), which by definition is the median of the individual gauge statistics. Scenarios B, C and D have better fits at one individual station, but their scenario goodness-of-fit statistics are worse than Scenario A (|$G_{\textrm{So2}}^{\textrm{B}}=0.221, G_{\textrm{So2}}^{\textrm{C}}=0.358, G_{\textrm{So2}}^{\textrm{D}}=0.257)$|. Scenario E was chosen at random and has a poorer scenario goodness-of-fit (|$G_{\textrm{So2}}^{\textrm{E}} = 0.731$|).
|$G_{e,d}^{s} \in [0,2]$| with lower values indicating a better fit (Fig. 3).
ti is a time sequence with 15 s spacing covering up to 3 hr of the tsunami signal at the DART buoy, but limited to between the beginning and end of high-frequency DART sampling (2 min or better). Interpolation of the data is used to evaluate he(t|d) and hs(t|d) at any ti;
δt is a small time-offset which accounts for unmodelled processes that lead to time-delays between the model and observations. This includes the finite rupture duration, wave dispersion, elastic loading, and seawater compressibility (Baba et al. 2017). The value of δt is numerically computed to optimize |$G_{e,d}^{s}$| (Lorito et al. 2008) but is restricted to a physically reasonable range, with absolute values ≤2 min near the source, extending up to 1/50 of the tsunami travel time for DART buoys far from the source, with an overall upper limit of 15 min (Watada et al. 2014).
wi is a weight which allows the statistic to focus on larger waves. It is proportional to the absolute value of he(ti|d), except is never less than one-third of the maximum value of he(ti|d) over all i.
Scenarios with low values of eq. (3) show better visual fit to the station data than scenarios with high values (Fig. 3). Eq. (3) is similar to that used by Lorito et al. (2008) and Romano et al. (2016) for finite-fault inversion, with the most important difference being that temporally variable weights wi are used herein to focus on larger waves.

Properties of the heterogeneous-slip ‘corresponding family of model scenarios’ for the 2016 Mw 7.8 Solomon earthquake-tsunami (ID ‘So2’ in Table 1). Plots above the main diagonal illustrate intergauge dependence in the scenario stage-ranges, with the observed values plotted in purple. Plots below the main diagonal illustrate intergauge dependence in the gauge-specific scenario goodness-of-fit statistics |$G_{\textrm{So2},d}^{s}$|. Plots on the main diagonal show the stage-range distribution for modelled scenarios at each DART buoy (boxplot), along with the observed value, and value of the |$F_{e,d}^{m}$| statistic (i.e. |$F_{\textrm{So2},d}^{\textrm{HS}}$| in this case, where d varies for each DART). In the boxplots, the box defines the inter-quartile range (25–75 percentile), with the median plotted as a line. The ‘whiskers’ extend from the box over at most 1.5 interquartile ranges, but never exceed the range of the data. Any points not covered by the whiskers are plotted directly. Because the modelled stage-range distribution is right-skewed, the definition implies such points mostly corresponding to high stage-range values.
4.3 Analysis of the modelled tsunami ‘size’ distribution
To analyse biases in the representation of tsunami ‘size’ by the FAUS, VAUS and HS models, the statistical properties of the tsunami stage-range (i.e. difference between the maximum and minimum water-level) for scenarios in |$C_{e}^{m}$| are compared with the de-tided stage-range observed during the associated event e, for the time-period with high-frequency DART measurements (i.e. times ti used to define eq. 3). For each tsunami event e, model m, and DART buoy site d, define the ‘coverage-statistic’ |$F_{e,d}^{m}$| as the fraction of scenarios in |$C_{e}^{m}$| which have stage-range less than the observed value at d (diagonal panels of Fig. 4). Clearly |$F_{e,d}^{m}$| describes how ‘big’ the observed tsunami e is at site d relative to scenarios in |$C_{e}^{m}$|. Values near 1 indicate the observation is relatively large, values near 0 indicate the observation is relatively small, and values near 0.5 indicate the observation is mid-sized (Fig. 4).
By analysing the coverage-statistics for many tsunami events, we can test for bias in the model’s representation of stochastic tsunami stage-ranges. Intuitively, suppose a model m generates too many small tsunamis given the earthquake location and magnitude. With enough observed events e we expect see |$F_{e,d}^{m}$| close to 1 more often than for an ideal model. More generally, the statistical techniques developed below allow inferences about model bias by comparing the |$F_{e,d}^{m}$| values with expectations from an ideal model. The coverage-statistic provides a particularly useful basis for these tests because there is no need to make any parametric assumptions about the modelled stage-range distributions.
To construct these tests, a key idea is that if m is an ideal model and the location d is fixed, then for any observed event e the coverage statistic |$F_{e,d}^{m}$| is like a random sample from a uniform distribution in [0, 1] (albeit with some discretization due to the finite number of scenarios in |$C_{e}^{m}$|). This follows because by definition, for an ideal model m the tsunami associated with event e behaves like a randomly selected tsunami scenario from |$C_{e}^{m}$|. Furthermore, for such random tsunami scenarios the stage-range at d may have a complicated distribution, but the coverage-statistic transforms this to a uniform distribution in [0, 1] (because it is derived from normalized ranks of the scenario stage-ranges). Thus for an ideal model m, if we have a set of 18 observed events e and for each consider a single|$F_{e,d}^{m}$| at a single random location d (which may vary from event to event), the resulting set of 18 |$F_{e,d}^{m}$| values will behave like an independent random sample from a uniform distribution. However, if for each event we include the |$F_{e,d}^{m}$| values at all observation sites d (normally more than one, Table 1), then the |$F_{e,d}^{m}$| values for the same event would not be independent. This intra-event dependence is due to spatial relationships among the DART buoy locations (leading to positive or negative dependence in scenario stage-ranges, Fig. 4), and because some earthquakes are more efficient tsunami generators than others (e.g. depending on the earthquake depth, ocean depth at the source, slip distribution, etc).
Supposing the intra-event dependence is accounted for, any non-uniformity in the |$F_{e,d}^{m}$| distribution is indicative of model bias: for example a downward-biased model would show a preference for high |$F_{e,d}^{m}$|; while a model which underestimates the variability of tsunami size would produce both high and low |$F_{e,d}^{m}$| more often than expected; conversely a model which overestimates the variability of tsunami size would produce too few extreme |$F_{e,d}^{m}$|. For an ideal model |$F_{e,d}^{m}$| should also be independent of the earthquake magnitude, and of the time required for the tsunami to reach d. Otherwise the model would have different biases for small and large magnitude earthquakes, or near-field versus far-field sites, which is clearly undesirable.
To account for the intra-event dependence when analysing |$F_{e,d}^{m}$|, a simple approach is to construct a sample containing one randomly selected site d per event e. Obviously this removes the possibility of intra-event dependence, which requires more than one site per event. In practice the properties of many such samples should be investigated to integrate over the random site selection. While having the benefit of simplicity, this makes somewhat weak use of the multisite observations for each event.
The latter result facilitates statistical hypothesis testing, which is the primary motivation for defining the |$R_{e}^{m}$| statistic (otherwise |$F_{e,d}^{m}$| or |$F_{e}^{m}$| can be used directly, and have a more intuitive interpretation). Considering all 18 events e, the values of |$R_{e}^{m}$| for an ideal model m should behave like a random sample of size 18 from a uniform distribution. The latter result can be tested statistically using null-hypothesis significance tests for uniformity. Herein the Anderson–Darling test statistic is used (Marsaglia & Marsaglia 2004), which is more powerful and sensitive to concentrations of observations near the distribution tails compared with the commonly used Komolgorov–Smirnov test (Stephens 1979). As an additional test, the standard deviation of the 18 |$R_{e}^{m}$| values is compared with the standard deviation expected from a sample of size 18 from a uniform distribution. It is straightforward to show that such random standard deviations would have a median of 0.288, with 95 per cent of values between [0.219, 0.349] (derived by simulating 10 million random datasets). The standard deviation of |$R_{e}^{m}$| is considered in addition to the Anderson–Darling test because, although the latter is effective for detecting samples with median bias or unusual concentrations near extremes, it is less effective in identifying samples that are underdispersed yet have an approximately correct median. Note the use of two tests increases the likelihood that one or other suggests the |$R_{e}^{m}$| are unusual; using simulation we found around 9 per cent of random uniform samples (n=18) would have either Anderson–Darling p < 0.05 and/or a standard deviation outside the 95 per cent interval.
To double check that the statistical tests are not negatively affected by our numerical approximations (e.g. the finite number of scenarios in |$C_{e}^{m}$|, which is limited to scenarios in our database), the approach was tested using simulation. Thousands of synthetic datasets satisfying the null hypothesis were simulated, each by drawing one scenario at random from |$C_{e}^{m}$| for all 18 events. It was confirmed empirically that the distribution of Anderson–Darling p-values well approximate a uniform distribution (e.g. ≃5 per cent of the synthetic p-values had p ≤ 0.05), and furthermore, the standard deviations of the synthetic |$R_{e}^{m}$| values behaved as expected.
5 RESULTS
5.1 Case studies
5.1.1 2011/03/11 Tohoku Tsunami, Mw 9.1 (Event ID ‘KJ2’)
For the Tohoku event the best-fitting members of the VAUS and HS ‘corresponding family of model scenarios’ show generally good agreement with DART buoy observations (Fig. 5). In contrast the best-fitting FAUS scenario produces distinctly different waves with smaller amplitude and longer period (Fig. 5). The scenarios in Fig. 5 represent a single earthquake for each model type and may not reflect the best scenario at each gauge. However even if the best scenario is selected on a per-gauge basis, the previous statements hold (Fig. 6).

De-tided stage time-series (m) observed at DART buoys during the 2011 Tohoku tsunami (Mw 9.1), with best-fitting model scenarios from |$C_{\textrm{KJ2}}^{\textrm{FAUS}}, C_{\textrm{KJ2}}^{\textrm{VAUS}}$| and |$C_{\textrm{KJ2}}^{\textrm{HS}}$| (i.e. those which minimize eq. 4). See Fig. 1 for DART buoy locations. Modelled time-series have been shifted by the optimal δt (eq. 3). The number in the legend is a scenario ID.
Clearly the FAUS model is unable to generate tsunami time-series similar to the Tohoku event, whereas the other models perform quite well. The Tohoku earthquake is known to have exhibited a small rupture length compared with median scaling relation predictions (Allen & Hayes 2017). Although previous studies have modelled the resulting tsunami observations using a uniform-slip earthquake source (MacInnes et al. 2013; Allen & Greenslade 2016; An et al. 2018), they used a small rupture area which cannot be reproduced with the FAUS approach.
The poor performance of the FAUS approach is also reflected in the distribution of scenario stage-ranges both near and far from the earthquake source (Fig. 7). At 18 of the 28 DART buoys, every scenario in |$C_{\textrm{KJ2}}^{\textrm{FAUS}}$| has stage-range below the observed value, often by a significant margin (Fig. 7). The downward bias in the FAUS model is pronounced not only for stations near to the source, but also stations on the other side of the Pacific Ocean near South America (e.g. 32411, 32413) and North America (e.g. 46409, 46411, see Fig. 1 for DART locations). This suggests that even in the far-field, the tsunami stage-range can be sensitive to variations in the earthquake area, consistent with previous results derived from idealized simulations of trans-Pacific tsunamis (Gica et al. 2007; Butler et al. 2017). It is often suggested that far-field tsunamis are insensitive to rupture details beyond seismic moment (e.g. Okal 1988; Geist 2002; Okal & Synolakis 2008), but this is not supported by the results herein.

Distribution of modelled stage-ranges at each DART buoy site (boxplots) in each ‘corresponding family of model scenarios’ associated with the 2011 Tohoku tsunami (i.e. |$C_{\textrm{KJ2}}^{\textrm{FAUS}}, C_{\textrm{KJ2}}^{\textrm{VAUS}} \textrm{and} C_{\textrm{KJ2}}^{\textrm{HS}}$|). The observed stage-range is depicted as a vertical line.
In contrast to the FAUS scenarios, both the VAUS and HS scenarios exhibit greater stage-range variability and consistently envelope the observed Tohoku stage-ranges (Fig. 7). The HS model generates higher waves on average than the VAUS model, but both models have similar relative stage-range variability (emphasized by the logarithmic x-axis in Fig. 7). The variability may be quantified at any individual gauge using the standard-deviation of log10(stage-range) over all scenarios, which is approximately 0.18 for HS, 0.2 for VAUS and 0.1 for FAUS (using the average value over all DART buoys in Fig. 7). This represents each model’s aleatory variability, that is the irreducible stage-range variation when the earthquake’s location and magnitude are approximately specified, as embedded in the definition of |$C_{e}^{m}$|. The HS and VAUS values are broadly consistent with the log10 residual standard deviations reported elsewhere when predicting DART buoy amplitudes using only crude earthquake magnitude and location information [0.17 in Reymond et al. (2012); 0.2–0.25 in Okal et al. (2013); although these estimates do not account for intra-event dependence].
The results in Fig. 7 also permit an assessment of the ‘size’ of the observed tsunami, relative to the corresponding scenarios from each model. The Tohoku observations mostly fall within the interquartile range of the HS scenarios, with median-coverage-statistic |$F_{\textrm{KJ2}}^{\textrm{HS}}=0.42$| suggesting the event size is ‘intermediate’ relative to this model (Fig. 7). For the VAUS model most Tohoku observations fall in the upper quartile of scenarios, with median-coverage-statistic |$F_{\textrm{KJ2}}^{\textrm{VAUS}} = 0.76$|, both indicating the observations are ‘large’ but not ‘unprecedented’ relative to the VAUS model. The contrast with the FAUS model (|$F_{\textrm{KJ2}}^{\textrm{FAUS}} = 1.0$|) highlights the importance of rupture size variability for uniform-slip tsunami scenarios. Furthermore, the fact that HS scenarios tend to be larger than VAUS scenarios in both the near and far field supports the conclusions of Li et al. (2016) that slip heterogeneity affects tsunamis even far from the earthquake source.
5.1.2 2010 Mw 8.8 Maule tsunami (Event ID ‘SA3’)
In contrast to the Tohoku event, for the Maule event all models produced a scenario in |$C_{\textrm{SA3}}^{m}$| with reasonable agreement to observations (Figs 8). Differences among the models show up more clearly in their scenario stage-range distributions (Fig. 9), which echo the results for the Tohoku tsunami. The FAUS model exhibits less stage-range variability than the other models (Fig. 9), and the relative variability is similar to the Tohoku case (average standard-deviation of log10(stage-range) = 0.19 for HS, 0.2 for VAUS and 0.12 for FAUS). It is clear that the HS model has a higher median stage-range than either of the uniform-slip models (Fig. 9). DART observations of the Maule event appear intermediate relative to the HS model (median-coverage-statistic |$F_{\textrm{SA3}}^{\textrm{HS}} = 0.50$|), and moderately high compared with the VAUS model (|$F_{\textrm{SA3}}^{\textrm{VAUS}} = 0.71$|) and the FAUS model (|$F_{\textrm{SA3}}^{\textrm{FAUS}} = 0.76$|).

De-tided stage time-series (m) observed at DART buoys during the 2010 Maule tsunami (Mw 8.8), with best-fitting model scenarios from |$C_{\textrm{SA3}}^{\textrm{FAUS}}, C_{\textrm{SA3}}^{\textrm{VAUS}}$| and |$C_{\textrm{SA3}}^{\textrm{HS}}$| (i.e. those which minimize eq. 4). See Fig. 1 for DART buoy locations. Vertical lines denote 1 hr spacing, and modelled time-series have been shifted by the optimal δt (eq. 3). The number in the legend is an scenario ID.

Distribution of modelled stage-ranges at each DART buoy site (boxplots) in each ‘corresponding family of model scenarios’ associated with the 2010 Maule tsunami. The observed stage-range during the 2010 Maule tsunami is depicted as a vertical line.
5.2 Goodness-of-fit for the ‘best-fitting’ scenarios for all events
Fig. 10 shows the ‘best-fitting’ scenario goodness-of-fit statistic for all events and models (i.e. minimum of eq. (4) over all scenarios in |$C_{e}^{m}$|). The ‘best-fitting’ HS and VAUS scenarios tend to show comparably good fit to the observed data, whereas for some events the FAUS model achieves similar performance, and for other events it is significantly worse (Fig. 10). These results further emphasize the importance of rupture-area variability in allowing the generation of stochastic scenarios that ‘look like’ observations, as shown for the Tohoku event (Section 5.1). Slip-heterogeneity has limited additional benefit in terms of the quality of the ‘best-fitting’ scenarios, although it is stressed that these results refer to uncalibrated model scenarios. For applications involving source calibration, it is likely that the benefits of slip-heterogeneity would be more pronounced if enough data is used (e.g Yamazaki et al. 2018). However An et al. (2018) found that for a range of recent tsunamis, the leading wave of near-field DART buoy and tidal gauge data could be simulated almost equally well using calibrated source models with either heterogeneous-slip or variable-area uniform-slip. This is qualitatively consistent with the results herein, although our results are more focussed on regional and far-field distances (due to the distribution of DART buoys).

Best value of the scenario goodness-of-fit statistic (eq. 4) for scenarios in |$C_{e}^{m}$|, by model type and event (see Table 1 for event ID’s). Lower values of |$G_{e}^{s}$| correspond to a better fit. The inset shows the modelled and observed (black) time-series for event NH1 which has a relatively poor ‘best scenario’ goodness-of-fit and was recorded at only 1 DART buoy.
Fig. 10 also suggests that some tsunami events are easier to model well than others. The events with poorer goodness-of-fit often exhibit relatively short-period tsunami waveforms, including the 2009 Samoa event KT2 (Zhou et al. 2012; Hossen et al. 2018), the 2006 Kurils event KJ1 (Kowalik et al. 2008), or the 2009 New Hebrides event NH1 (Fig. 10). The use of finer unit-sources and/or a dispersive hydrodynamic model may improve the fit in these cases. Improvements may also be sought by treating earthquake-doublet type scenarios, noting the 2009 Samoa event was generated by a doublet normal-thrust source (Lay et al. 2010), while the 2009 New Hebrides event was preceded by an Mw 7.6 thrust 15 min earlier (according to the GCMT catalogue). Tests along these lines are beyond the scope of this study because they would be computationally expensive to support for the entire scenario database.
5.3 Model representation of tsunami ‘size’ variability
Section 4.3 showed that for an ‘ideal model’ m the ‘coverage-statistic’ |$F_{e,d}^{m}$| (giving the fraction of scenarios with stage-range below the observation) should behave like a uniformly distributed random variable at an individual DART, albeit with dependence between different DARTs d for the same event e. Graphically the HS model seems consistent with the ideal behaviour, with no obvious preference for high or low values of |$F_{e,d}^{\textrm{HS}}$| over the 18 tsunami events (top-left-hand panel of Fig. 11). If a random sample with one site d per event e is taken from |$F_{e,d}^{\textrm{HS}}$| (to remove the intra-event dependence), then on average it has a standard-deviation of 0.25 and mean of 0.49, which is quite consistent with expectations from a uniform distribution. The intra-event dependence is visually evident (Fig. 11), with some tsunamis appearing ‘large’ at most gauges relative to the HS scenarios (e.g. the 2013/02/06 Mw 7.9 New Hebrides event), while other events are ‘small’ at most gauges (e.g. the 2014/04/01 Mw 8.2 South America event). Importantly there is no obvious relationship between |$F_{e,d}^{\textrm{HS}}$| and the tsunami propagation time (Fig. 11), which suggests the HS model reasonably represents tsunami ‘size’ variability at both near-field and far-field sites. In terms of the median-coverage-statistic |$F_{e}^{\textrm{HS}}$| and the associated |$R_{e}^{\textrm{HS}}$| values (the latter being uniformly distributed under the null-hypothesis that HS is an ideal model), the Anderson–Darling test described in Section 4.3 cannot reject the null-hypothesis that the observed stage-ranges were generated by randomly sampling one scenario from each |$C_{e}^{HS}$| (p = 0.50). In addition the standard deviation of the |$R_{e}^{\textrm{HS}}$| values is within the expected 95 per cent range (0.228 ∈ [0.219 − 0.349], Section 4.3), although close to the lower limit. There is no obvious magnitude-dependence to the coverage statistics, with |$F_{e}^{\textrm{HS}}$| and Mw not showing any clear relation (Fig. 12), and no evidence of strong correlation between Mw and |$R_{e}^{\textrm{HS}}$| (Spearman rank-correlation ρ = 0.1, p = 0.68; to treat ties the latter values reflect the mean of 1000 random tie-breaks). In summary, the tests do not identify obvious biases in the HS model’s representation of tsunami size.
![‘Time for the tsunami to reach each DART buoy’ versus $F_{e,d}^{m}$, for all 18 tsunami events e in our test set, each model type m, and each DART buoy d. For an ‘ideal model’ the values should be uniformly distributed in [0, 1], with some dependence at different DARTs for the same event.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/gji/218/3/10.1093_gji_ggz260/1/m_ggz260fig11.jpeg?Expires=1747901037&Signature=P0bDM0NUaS7ks~hazaEa6tNwQqfMRq259TXuj8BfR6Yfn2t6-LvKOje1Aucge4x6HihyuEpD-EgMiUnSNL07D9sMyOA9yBvJzckQkTtFhG9um5ZIgGMGt8SYmFacZgnyMnW7SzssKFmJhercq6zAEvojGKM9K04oG-TetyWxJreFxlxAH8oZSWA1jWS~vgiJhwW9Lfd5E3HIeywDkHpB7X9yPYCKrXbcdOcq8o0r5at2VTui34AAdnL16xYnoB-B-URsgfoh9fqEZlUS5MuOoiW2a6myB4ELeerEuo3Awl2xayBw9eJNhYoKgNcjCVmt~~nYTZpJ2-zqiwVFnYWp3Q__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
‘Time for the tsunami to reach each DART buoy’ versus |$F_{e,d}^{m}$|, for all 18 tsunami events e in our test set, each model type m, and each DART buoy d. For an ‘ideal model’ the values should be uniformly distributed in [0, 1], with some dependence at different DARTs for the same event.

Median-coverage-statistic versus magnitude for the 18 test events.
The performance of the uniform-slip models is much poorer. Both the FAUS and VAUS models exhibit a clear majority of |$F_{e,d}^{m}$| values above 0.5, indicating these models generate small tsunamis too often (bottom panels of Fig. 11). If a random sample with one site d per event e is selected from |$F_{e,d}^{m}$| to remove the intra-event dependence, then on average it has a high mean value (0.66 FAUS; 0.64 VAUS). The biases are similar at both near-field and far-field sites (Fig. 11). Although it is often suggested that slip-heterogeneity is most significant in the near-field (e.g. Okal 1988; Geist 2002; Okal & Synolakis 2008), these results demonstrate that statistical biases in the FAUS and VAUS scenarios persist into the far field. This is consistent with previous suggestions by Li et al. (2016) based on comparison of uniform and HS models, but the current results considerably strengthen these conclusions because they are based on direct comparison of models with observations, over much larger spatial scales than the latter study.
These results are supported by the Anderson–Darling test (Section 4.3), which indicates that the median-coverage-statistic values would be unlikely if either uniform-slip model were correct (VAUS p = 0.037; FAUS p = 0.003). The biases in our uniform-slip models do not have any obvious magnitude dependence (Fig. 12), although the Spearman rank correlations between Mw and |$R_{e}^{m}$| are weakly suggestive of increased underprediction at high magnitudes (FAUS ρ = 0.43, p = 0.08; VAUS ρ = 0.43, p = 0.08).
An important difference between the FAUS and VAUS models is that only the former regularly fails to envelope the observed stage-ranges (i.e. often |$F_{e,d}^{\textrm{FAUS}} = 1$|, Fig. 11). This behaviour was seen in the Tohoku example (Section 5.1), and also occurs for other events (Fig. 11). Although the VAUS model has uniform-slip, it generates tsunamis with much greater stage-range variability than FAUS because of variation in rupture area and mean slip. This allows the VAUS tsunami stage-ranges to consistently envelope the data, despite having an overall downward bias (Fig. 11).
5.4 Earthquake properties of good-fitting vs random scenarios
To better understand the drivers of model bias, the earthquake properties of good-fitting and random scenarios in |$C_{e}^{m}$| are compared. Good-fitting scenarios are defined as those with the smallest few values of the scenario goodness-of-fit statistic |$G_{e}^{s}$| (eq. 4) among all scenarios |$s \in C_{e}^{m}$|. The idea is that if a model is biased but can still produce some scenarios that reasonably agree with observations, then good-fitting scenarios are likely to have earthquake properties which differ systematically from random scenarios over the 18 test events. In addition to illuminating model biases such differences can be used to derive model-specific bias-adjustments, as illustrated below.
Focussing first on the VAUS model, significant biases are suggested by comparison of the rupture area and slip for good-fitting and random scenarios in |$C_{e}^{\textrm{VAUS}}$| for the 18 test events (top row of Fig. 13). Good-fitting VAUS scenarios are more likely to have small area and high slip compared to random scenarios in their associated |$C_{e}^{\textrm{VAUS}}$| (Fig. 13). This can be quantified using the percentiles of slip of the good-fitting scenarios for each event e, relative to random scenarios in their |$C_{e}^{\textrm{VAUS}}$| (top-right-hand panel of Fig. 13). For an ideal model the slip percentiles for the 18 best-fitting scenarios should behave like a random sample from a uniform distribution (a similar statement obviously holds for percentiles of any earthquake statistic, because for an ideal model the observations behave like randomly selected scenarios). However slip percentiles of the 18 best-fitting VAUS scenarios show a clear preference toward high values (Fig. 13), which is very unlikely to have occurred at random if they were truely drawn from a uniform distribution (p = 0.0006 using an Anderson–Darling test). A similar result holds using the ‘2nd best-fitting’ VAUS scenarios (p = 0.0012) and the ‘3rd best-fitting’ VAUS scenarios (p = 0.023, Fig. 13). Clearly the VAUS model generates too many large-area, low slip earthquake scenarios, at least for the purpose of stochastic tsunami simulation. This is qualitatively consistent with recent findings by An et al. (2018) that compact-area uniform-slip sources better simulate the leading wave of near-field DART and tidal gauge data.

Properties of earthquakes for random and good-fitting scenarios in |$C_{e}^{m}$|. Top-left-hand panel: Mw versus rupture area for earthquake scenarios in |$C_{e}^{\textrm{VAUS}}$| for all 18 test events, with the best three scenarios for each test event highlighted. Top-middle panel: Mw versus slip for earthquake scenarios in |$C_{e}^{\textrm{VAUS}}$| for all 18 test events, with the best three scenarios for each test event highlighted. Top-right-hand panel: QQ-plot comparing the percentiles of slip for the best three scenarios in |$C_{e}^{\textrm{VAUS}}$| against uniformly-distributed percentiles. Bottom-left-hand panel: Mw versus rupture area for earthquake scenarios in |$C_{e}^{\textrm{HS}}$| for all 18 test events, with the best three scenarios for each test event highlighted. Bottom-middle panel: Mw versus maximum-slip for earthquake scenarios in |$C_{e}^{\textrm{HS}}$| for all 18 test events, with the best three scenarios for each test event highlighted. Top-right-hand panel: QQ-plot comparing the percentiles of slip for the best three scenarios in |$C_{e}^{\textrm{HS}}$| against uniformly distributed percentiles.
Similar reasoning can be applied to the HS model scenarios (bottom row of Fig. 13). In contrast to the VAUS model, for the HS model the good-fitting scenarios show similar variability as random scenarios in terms of earthquake area and maximum-slip (Fig. 13). Graphically the good-fitting scenarios show a weak preference towards lower maximum-slip percentiles (bottom right-hand panel of Fig. 13), but for the best-fitting HS scenarios this does not represent a statistically significant deviation from a uniform distribution (p = 0.212 using an Anderson–Darling test). A similar result holds using the ‘2nd best-fitting’ HS scenarios (p = 0.635), and the ‘3rd best-fitting’ HS scenarios (p = 0.181). The standard-deviations were checked in all cases and are consistent with these results.
An implication is that for the purposes of generating realistic stochastic tsunamis, VAUS scenarios should be assigned a ‘relatively compact’ length and width distribution as compared with the scaling relation predictions used herein (eq. 2, Strasser et al. 2010). A bias-adjustment along these lines can be implemented by assigning non-uniform weights to scenarios, in a way that corrects for the non-uniformity of points in the top-right-hand panel of Fig. 13. For example, based on the bilinear fit in the top-right-hand panel of Fig. 13, only around 20 per cent of the best-fitting VAUS scenarios have slip percentile in the bottom 60 per cent of the stochastic scenarios from |$C_{e}^{\textrm{VAUS}}$|. To correct for this each VAUS scenario should be assigned a weight, with 20 per cent of the total weight assigned to scenarios with slip percentile <60 per cent, and the remaining 80 per cent of the weight assigned to scenarios with slip percentile ≥60 per cent. Mathematically the weights should be proportional to the inverse derivative of a curve describing the non-uniformity of the slip percentiles (e.g. either the bilinear or cubic fit in the top-right-hand panel of Fig. 13). By construction the good-fitting scenarios will appear uniformly distributed relative to such bias-adjusted stochastic scenarios. By reducing the dominance of ‘high-area, low-slip’ scenarios, the associated stochastic tsunami distribution should be more realistic.
To illustrate how bias-adjustment can improve the model’s performance, we resampled VAUS scenarios according to the weights above, and then reran the statistical tests associated with the DART coverage statistics (Section 5.3). In this case the null-hypothesis could not be rejected (Anderson–Darling p = 0.5), and similarly the standard deviation of the |$R_{e}^{\textrm{VAUS}}$| values was within the typical range (0.24 ∈ [0.219 − 0.349]).
The comparison of good-fitting and random VAUS and HS scenarios was informative because they have sufficient variability for biases to become apparent. It turns out that a similar approach is not informative for the FAUS scenarios, which by construction show little variation in area or slip. Thus there is no opportunity to distinguish good-fitting and random scenarios based on these properties. Instead the FAUS scenarios are completely unable to represent some observations well (e.g. the Tohoku tsunami, Section 5.1), and this bias cannot be corrected by simply adjusting the scenario weights. A fundamental advantage of more variable models is that even if they have bias, there remains some capacity to correct for this by reweighting the scenarios, so long as some scenarios are similar to observed tsunamis.
6 EFFECT OF DEPTH-VARYING RIGIDITY
The analyses above assumed a constant rigidity. Although this approach is commonly applied for stochastic tsunami generation (e.g. Goda et al. 2015; Ruiz et al. 2015; Murphy et al. 2016), there is evidence that depth-varying rigidity, with low values at shallow depths, is important for simulating some historical tsunamis (Geist & Bilek 2001; Hébert et al. 2012; Tanioka et al. 2017). To understand how this affects the models tested herein, a new ‘corresponding family of model scenarios’ was defined for each event and model type assuming a depth-varying rigidity profile. The depth-varying rigidity profile was derived from the depth-versus-rigidity data of Bilek & Lay (1999) using their constant-stress-drop end-member model (top-left-hand panel of Fig. 14) and was used to assign a rigidity to each unit-source based on its centroid depth. The magnitude for every model scenario in the database was then recomputed, without changing its slip or area. The resulting ‘depth-varying-rigidity magnitude’ allows redefining |$C_{e}^{m}$| for each model type m and event e. Scenarios were included in the redefined |$C_{e}^{m}$| if their ‘depth-varying rigidity magnitude’ was within ±0.15 of the GCMT magnitude (Table 1), and their parent FAUS scenario was ‘near’ the GCMT hypocentre (Section 4.1). The analyses of Section 5 was then repeated. For brevity the main results are summarized in Fig. 14, while other plots corresponding to Figs 5–13 are provided in the Supporting Information.
For individual events the properties of |$C_{e}^{m}$| can change significantly using the depth-varying rigidity model. For example, the 2010 Mw 7.9 Mentawai earthquake-tsunami (event ID Su3) appeared ‘large’ relative to the constant shear modulus HS scenarios in |$C_{\textrm{Su3}}^{\textrm{HS}}$| (Fig. 11), whereas it has a ‘typical’ size relative to the depth-varying rigidity HS scenarios (Fig. 14). This makes sense considering the event was a tsunami-earthquake that occurred over a shallow part of the Sunda-Arc subduction zone, and so corresponding depth-varying-rigidity scenarios tend to have low rigidity and higher slip, with the latter leading to larger tsunamis overall. Similarly, the higher-magnitude events (e.g. Tohoku, Maule) tend to have higher median-coverage-statistic |$F_{e}^{m}$| values when the rigidity varies with depth (compare Figs 11 and 14). This is because |$C_{e}^{m}$| includes a significant fraction of deeper earthquakes that have higher rigidity and lower slip, with the latter reducing the size of tsunamis overall.
Despite these differences, to first-order the statistical performance of all models is similar to the constant rigidity case (Fig. 14). The best-fitting scenarios show a similar degree of agreement with data, with the FAUS model performing more poorly than the VAUS and HS models (Fig. S6). The uniform-slip models (VAUS and FAUS) continue to produce small tsunamis too often (mean |$F_{e,d}^{m}$| of 0.69 and 0.66, respectively, based on the average of random samples containing a single location for each event, to remove intra-event dependence). The HS model does not show an obvious tendency to over-or-under predict tsunami size in the near or far field (Fig. 14), with the |$F_{e,d}^{m}$| values having a mean of 0.51 and standard-deviation of 0.23 (average of random samples with a single-location per event). These results are supported by the Anderson–Darling test associated with the median-coverage-statistic values (FAUS p = 0.0016; VAUS p = 0.010; HS p = 0.24).
One difference is that for the variable-rigidity HS model, the standard-deviation test now identifies the |$R_{e}^{\textrm{HS}}$| values as underdispersed (standard deviation of 0.204, below the null-hypothesis 95 per cent interval [0.219, 0.349]). Recall that in the constant rigidity case the value was 10 per cent larger (0.228) and inside the 95 per cent interval. Although the use of multiple statistical tests does increase the chance of detecting a ‘significant’ deviation from the ideal model null-hypothesis even if none exists, the result nonetheless suggests the definition of |$C_{e}^{\textrm{HS}}$| may induce too much variance in the stage-range values, which is exacerbated through the use of depth-varying rigidity. If correct, this may reflect limitations in the criteria for scenario inclusion in |$C_{e}^{m}$|, the HS model itself, or the relatively coarse unit-sources and non-dispersive linear tsunami propagation model. While it has not yet been possible to identify a particular cause of this result, in modelling applications it could be pragmatically dealt with by slightly down-weighting scenarios with high or low stage-range quantiles at a location of interest.
Subject to the above qualifications, it appears reasonable to use the HS model with either constant or depth-varying rigidity. However for individual events the model scenarios can vary significantly. Considering that some shallow earthquake-tsunamis are well modelled with constant rigidity (e.g. Newman et al. 2011a) while in other cases low rigidity is necessary (e.g. Geist & Bilek 2001; Hébert et al. 2012), it seems likely that the best approach will vary from site-to-site.
7 SUMMARY AND CONCLUSIONS
Diverse models for stochastic tsunami scenario generation have been used in the literature. Such models will not necessarily give a good representation of the variability of natural tsunamis, because they require assumptions about imperfectly understood aspects of earthquakes (e.g. scaling relations and their predictive uncertainty; rigidity; slip heterogeneity), and usually make additional approximations for computational tractability (e.g. the use of pure thrust events; instantaneous rupture; non-dispersive hydrodynamics). Therefore it is important to directly test the resulting stochastic tsunamis for statistical biases. This study developed an approach to model testing based on the comparison of multiple observed events with objectively defined random samples of model scenarios (i.e. |$C_{e}^{m}$|), where the latter have ‘similar location and magnitude’ as their corresponding observed event e. The model scenarios were tested without calibration to the tsunami observations, to better represent the requirements of tsunami hazard assessment and pre-instrumental seismology. Techniques like this can be used to test stochastic tsunami simulators, understand and partially correct their biases, and provide better justification for their use in applications.
For the models tested herein the key results are:
The uniform-slip models (FAUS and VAUS) produced downward-biased tsunami stage-ranges at all magnitudes, at DART buoys both near and far from the earthquake source. This bias was not evident in the HS model. The results confirm that spatial slip heterogeneity significantly affects the statistical properties of tsunami size, both near and far from the earthquake source (Li et al. 2016). Furthermore, the difference in the FAUS and VAUS scenario stage-range variability confirms that even with uniform-slip, rupture area variations can substantially affect tsunami size in the near and far field (Gica et al. 2007; Butler et al. 2017).
The uniform-slip model with rupture-area variability (VAUS) was able to generate best-fitting scenarios with performance comparable with the best HS scenarios (despite the overall downward bias of random VAUS scenarios). In contrast, the best-fitting FAUS scenarios sometimes showed much poorer agreement to observations. For tsunami modelling applications that do not involve source-calibration, this suggests it may be acceptable to use either a HS approach, or a uniform-slip approach combined with a modified treatment of rupture area variability. The latter should promote relatively compact sources as compared with naive application of earthquake scaling relations, to avoid downward bias in the tsunami size. The bias adjustment of Section 5.4 provides one approach to generating ‘appropriately compact’ uniform-slip sources (see Jamelot & Reymond (2014) and An et al. (2018) for alternative approaches in the context of tsunami early warning). In applications it may be beneficial to use both HS and bias-adjusted VAUS approaches, to represent epistemic uncertainties in scenario generation. Furthermore, the VAUS approach may facilitate greater computational efficiency because it involves fewer degrees of freedom. However, for modelling applications that involve source calibration (e.g. finite-fault inversion) the benefits of HS are expected to become more pronounced with enough data.
The above results hold for both the constant rigidity and depth-varying rigidity models used in this study.
For the depth-varying rigidity HS model, the standard deviation of |$R_{e}^{\textrm{HS}}$| was low, suggesting the stage-range variability may be overestimated. Low values were also attained with both the constant rigidity HS model, and the bias-adjusted VAUS model, although these cases were within the 95 per cent range expected for an ideal model. On the other hand, for all three models the Anderson–Darling test could not reject that the |$R_{e}^{m}$| values were drawn from a uniform distribution, which is theoretically ideal. This result suggests potential for improving the models and/or the definition of |$C_{e}^{m}$|, but additional testing is required to clarify the issue.
The 18 test events used herein were sufficient to identify median biases in the uniform-slip models, but with this sample size it remains challenging to clearly detect magnitude-dependent biases, or biases in the representation of extremes for any given magnitude. Additional tests against earlier historical tsunamis would help further illuminate the models’ performance. It would also be valuable to extend the tests to non-DART data (e.g. tidal gauges and runup observations). The main challenge is that tsunami simulation close to the coast is much more demanding of both input data and computational resources. On the other hand, even if a heterogeneous earthquake source is calibrated to offshore tsunami observations, the resulting nearshore and onshore model predictions may exhibit substantial errors (e.g. MacInnes et al. 2013; An et al. 2018). Thus the onshore biases of stochastic tsunami models may not be analogous to their offshore biases, and tests using non-DART data represent an important avenue for future research.
SUPPORTING INFORMATION
Figure S1 De-tided stage time-series (m) observed at DART buoys during the 2011 Tohoku tsunami (Mw 9.1), with best-fitting model scenarios from |$C_{\textrm{KJ2}}^{\textrm{FAUS}}, C_{\textrm{KJ2}}^{\textrm{VAUS}}$| and |$C_{\textrm{KJ2}}^{\textrm{HS}}$|, defined using depth-varying rigidity.
Figure S2 Same as Fig. 5 for the 2011 Tohoku tsunami, except the per-station best-fitting scenario is plotted (so the model results do not correspond to a single scenario).
Figure S3 Distribution of modelled stage-ranges at each DART buoy site (boxplots) in each ‘corresponding family of model scenarios’ for the 2011 Tohoku tsunami, defined with depth-varying rigidity. The observed stage-range is depicted as a vertical line.
Figure S4 De-tided stage time-series (m) observed at DART buoys during the 2010 Maule tsunami (Mw 8.8), with best-fitting model scenarios from |$C_{\textrm{SA3}}^{\textrm{FAUS}}, C_{\textrm{SA3}}^{\textrm{VAUS}}$| and |$C_{\textrm{SA3}}^{\textrm{HS}}$|, defined using depth-varying rigidity.
Figure S5 Distribution of modelled stage-ranges at each DART buoy site (boxplots) in each ‘corresponding family of model scenarios’ for the 2010 Maule tsunami, defined with depth-varying rigidity.
Figure S6 Best value of the scenario goodness-of-fit statistic for scenarios in |$C_{e}^{m}$|, defined with depth-varying rigidity.
Figure S7 Properties of earthquakes for random and good-fitting scenarios in |$C_{e}^{m}$|, defined with depth-varying rigidity.
Please note: Oxford University Press is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the paper.
ACKNOWLEDGEMENTS
This paper is published with the permission of the CEO of Geoscience Australia. This project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. We appreciate reviews by Jonathan Griffin, Trevor Allen, Jane Sexton, Stefano Lorito and an anonymous reviewer. The tsunami scenario database and source-code used to construct it are freely available, and may be accessed following instructions at https://github.com/GeoscienceAustralia/ptha/tree/master/ptha_access.