Using a custom metadata file¶

The record-selection routines in djura.record_selection never read a ground motion database directly. They operate on a single, in-memory metadata object: a plain Python dict of NumPy arrays in which the record-level fields and intensity-measure (IM) values are pre-populated in a common schema. The bundled NGA_W2_v2.pickle is just one realisation of that schema.

This means a new database can be incorporated by mapping its records onto the same schema. Once the metadata fields and IM values are populated in the common format, the selection routines run without modification. The remainder of this page documents that schema and shows how to build and use your own metadata file.

Note

You do not need to redistribute or modify the package. A custom metadata file is supplied at run time (see Pointing djura at your file), so the bundled dataset is simply replaced by yours.

The schema at a glance¶

The metadata object is a flat dictionary. Conceptually its keys fall into four groups:

Record-level scalar fields — one value per record, stored as a 1-D array of length N (the number of records). These are identifiers, source and site causal parameters, waveform metadata, and period-independent IMs.
Period vectors — 1-D arrays giving the periods at which the period-dependent IMs are tabulated.
Period-dependent IM arrays — 2-D arrays of shape (N, len(period_vector)), one row per record.
Global scalars — e.g. the damping ratio of the response spectra.

Every record-level array must share the same length N and the same row ordering, so that row i refers to the same record across all keys.

Required record-level fields¶

The following keys are read by the selection and reporting routines and must be present. Each is a 1-D array of length N.

Key	dtype	Meaning
`RSN`	int	Unique record identifier (the record sequence number). Used as the primary key throughout selection.
`EQID`	int	Event identifier. Records sharing an `EQID` come from the same earthquake (used to limit how many records are drawn from one event).
`Filename_1`	str	Filename of the first horizontal component.
`Filename_2`	str	Filename of the second horizontal component.
`Filename_vert`	str / object	Filename of the vertical component (may be empty strings if unused).
`EQ_name`	str	Earthquake name.
`EQ_year`	int	Year of the earthquake.
`Station_name`	str	Recording station name.
`magnitude`	float	Moment magnitude.
`mechanism`	int	Fault mechanism code (see Fault mechanism encoding).
`Rjb`	float	Joyner–Boore distance [km].
`Rrup`	float	Rupture distance [km].
`Vs30`	float	Time-averaged shear-wave velocity to 30 m [m/s].
`lowest_usable_freq`	float	Lowest usable frequency [Hz]; used to screen records against the required period range.
`dt`	float	Time step of the waveform [s].
`duration`	float	Record duration [s].
`npts`	int	Number of samples in the waveform.

Optional causal-context fields¶

Additional fields are read only if you pass limits on them via context_limits during selection; otherwise they are ignored. Provide them when you want to filter on them. Common ones include Z1, Z1pt5, Z2pt5, D_hyp, Ds575, Ds595, Tp, rake, dip, strike, Ztor, Rx, rup_width and soil_NEHRP. See djura.record_selection.constants.DB_CAUSAL_PARS for the full list the package recognises by name.

Fault mechanism encoding¶

mechanism is an integer code mapped by djura.record_selection.constants.MECHANISM_MAP:

Code	Mechanism
`0`	strike-slip fault
`1`	normal fault
`2`	reverse fault
`3`	reverse/oblique fault
`4`	normal/oblique fault

Intensity-measure fields¶

IM values follow a strict naming convention so the selection code can locate them automatically. For an IM named <IM>:

Period-independent IMs (PGA, PGV, IA, Ds575, Ds595) are stored directly under the IM name as a 1-D array of length N:

metadata["PGA"]      # shape (N,)
metadata["Ds595"]    # shape (N,)

Period-dependent IMs (SA, Sa_avg2, Sa_avg3, FIV3) require a period vector plus per-component 2-D arrays of shape (N, len(periods)):

Key pattern	Shape	Meaning
`<IM>_1`	`(N, P)`	First horizontal component
`<IM>_2`	`(N, P)`	Second horizontal component
`<IM>_RotD50`	`(N, P)`	RotD50 component (optional, see below)
`<IM>_RotD100`	`(N, P)`	RotD100 component (optional, see below)

Here P is the length of the corresponding period vector. The period vectors expected by the bundled IMs are:

IM family	Period vector key	Used by
`SA`	`Periods_SA`	`SA_1`, `SA_2`, `SA_RotD50`, `SA_RotD100`, `SA_vert`
`Sa_avg2` / `Sa_avg3`	`Periods_Sa_avg`	`Sa_avg2_1` … `Sa_avg3_RotD100`
`FIV3`	`Periods_FIV3`	`FIV3_1`, `FIV3_2`

Important

Periods are matched by value, and the package rounds them to 5 decimal places. Intermediate periods requested during selection are obtained by linear interpolation along the period axis, so the period vectors should span the range you intend to use. Each row of an IM array corresponds to the record in the same row of Filename_1 / RSN.

Component definitions¶

When two horizontal components are available, the selector can build the following component definitions:

geomean — \(\sqrt{IM_1 \cdot IM_2}\)
srss — \(\sqrt{IM_1^2 + IM_2^2}\)
arithmeticmean — \((IM_1 + IM_2)/2\)
rotd50 — read directly from <IM>_RotD50
rotd100 — read directly from <IM>_RotD100

geomean, srss and arithmeticmean are computed on the fly from <IM>_1 and <IM>_2, so they require no extra storage. rotd50 and rotd100 are read directly from precomputed arrays — supply <IM>_RotD50 / <IM>_RotD100 only if you intend to select on those components.

Global scalars¶

Key	Meaning
`damping`	Damping ratio of the tabulated response spectra (e.g. `0.05` for 5 %).

Minimal working example¶

The snippet below assembles a tiny, schema-conformant metadata file from your own data. Replace the placeholder arrays with values mapped from your database. Only SA is populated here; add other IM families the same way.

import pickle
import numpy as np

N = 100                       # number of records
periods_sa = np.array(        # period vector for SA
    [0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0],
    dtype="float32")
P = len(periods_sa)

metadata = {
    # --- required record-level fields (length N) ---
    "RSN":                np.arange(1, N + 1, dtype="int32"),
    "EQID":               np.zeros(N, dtype="int32"),
    "Filename_1":         np.array([f"REC{i}_H1.txt" for i in range(N)]),
    "Filename_2":         np.array([f"REC{i}_H2.txt" for i in range(N)]),
    "Filename_vert":      np.array([""] * N, dtype=object),
    "EQ_name":            np.array(["MyEvent"] * N, dtype=object),
    "EQ_year":            np.full(N, 2020, dtype="int16"),
    "Station_name":       np.array([f"STA{i}" for i in range(N)]),
    "magnitude":          np.random.uniform(5.0, 7.5, N).astype("float32"),
    "mechanism":          np.zeros(N, dtype="int16"),       # 0 = strike-slip
    "Rjb":                np.random.uniform(1, 100, N).astype("float32"),
    "Rrup":               np.random.uniform(1, 100, N).astype("float32"),
    "Vs30":               np.random.uniform(180, 760, N).astype("float32"),
    "lowest_usable_freq": np.full(N, 0.1, dtype="float32"),
    "dt":                 np.full(N, 0.005, dtype="float32"),
    "duration":           np.full(N, 30.0, dtype="float32"),
    "npts":               np.full(N, 6000, dtype="int32"),

    # --- period vector + SA component arrays (shape N x P) ---
    "Periods_SA":         periods_sa,
    "SA_1":               np.random.uniform(0.01, 1.0, (N, P)).astype("float32"),
    "SA_2":               np.random.uniform(0.01, 1.0, (N, P)).astype("float32"),

    # --- global scalar ---
    "damping":            0.05,
}

with open("my_metadata.pickle", "wb") as f:
    pickle.dump(metadata, f)

Tip

Keep the row ordering identical across every array — row i must refer to the same physical record in RSN, Filename_1, magnitude, SA_1 and all other keys. This is the single most common source of error when mapping a new database.

Validating your file¶

A quick sanity check before using the file in a selection run:

import pickle
import numpy as np

with open("my_metadata.pickle", "rb") as f:
    m = pickle.load(f)

N = len(m["RSN"])
for key, val in m.items():
    if isinstance(val, np.ndarray) and val.ndim == 1 and key not in (
            "Periods_SA", "Periods_Sa_avg", "Periods_FIV3"):
        assert len(val) == N, f"{key} has length {len(val)}, expected {N}"

assert m["SA_1"].shape == (N, len(m["Periods_SA"]))
print(f"OK: {N} records, {len(m['Periods_SA'])} SA periods")

Pointing djura at your file¶

Custom metadata is supplied at run time through the DJURA_METADATA_PATH environment variable. When set, it fully bypasses the bundled download and the selection routines use your file instead:

# bash / macOS / Linux
export DJURA_METADATA_PATH=/path/to/my_metadata.pickle

# PowerShell
$env:DJURA_METADATA_PATH = "C:\path\to\my_metadata.pickle"

# CMD
set DJURA_METADATA_PATH=C:\path\to\my_metadata.pickle

After that, the API is unchanged:

from djura.record_selection import GCIM

gcim = GCIM()            # loads my_metadata.pickle via DJURA_METADATA_PATH
gcim.get_metadata_parameters()   # inspect the keys it loaded

Both .pickle/.pkl and .npz files are recognised by the lower-level reader; the pickled dict form shown above is the simplest and is what the bundled dataset uses.

Note

Set the environment variable before importing djura (or before the first call that loads the dataset), since the metadata is loaded at most once per process and cached.

What you can omit¶

You only need to populate the fields and IMs your selection actually uses:

IM families you never select on can be left out entirely (e.g. skip all FIV3_* keys and Periods_FIV3 if you do not use FIV3).
RotD50 / RotD100 arrays are needed only for those component definitions; geomean works from _1 / _2 alone.
Optional causal fields are read only when you pass matching context_limits.

Conversely, the required record-level fields in the table above should always be present, because they are used for identification, event grouping, usable period screening, and the selection report.