Using a custom metadata file¶
The record-selection routines in djura.record_selection never read a
ground motion database directly. They operate on a single, in-memory
metadata object: a plain Python dict of NumPy arrays in which the
record-level fields and intensity-measure (IM) values are pre-populated in a
common schema. The bundled NGA_W2_v2.pickle is just one realisation of
that schema.
This means a new database can be incorporated by mapping its records onto the same schema. Once the metadata fields and IM values are populated in the common format, the selection routines run without modification. The remainder of this page documents that schema and shows how to build and use your own metadata file.
Note
You do not need to redistribute or modify the package. A custom metadata file is supplied at run time (see Pointing djura at your file), so the bundled dataset is simply replaced by yours.
The schema at a glance¶
The metadata object is a flat dictionary. Conceptually its keys fall into four groups:
Record-level scalar fields — one value per record, stored as a 1-D array of length
N(the number of records). These are identifiers, source and site causal parameters, waveform metadata, and period-independent IMs.Period vectors — 1-D arrays giving the periods at which the period-dependent IMs are tabulated.
Period-dependent IM arrays — 2-D arrays of shape
(N, len(period_vector)), one row per record.Global scalars — e.g. the damping ratio of the response spectra.
Every record-level array must share the same length N and the same row
ordering, so that row i refers to the same record across all keys.
Required record-level fields¶
The following keys are read by the selection and reporting routines and must
be present. Each is a 1-D array of length N.
Key |
dtype |
Meaning |
|---|---|---|
|
int |
Unique record identifier (the record sequence number). Used as the primary key throughout selection. |
|
int |
Event identifier. Records sharing an |
|
str |
Filename of the first horizontal component. |
|
str |
Filename of the second horizontal component. |
|
str / object |
Filename of the vertical component (may be empty strings if unused). |
|
str |
Earthquake name. |
|
int |
Year of the earthquake. |
|
str |
Recording station name. |
|
float |
Moment magnitude. |
|
int |
Fault mechanism code (see Fault mechanism encoding). |
|
float |
Joyner–Boore distance [km]. |
|
float |
Rupture distance [km]. |
|
float |
Time-averaged shear-wave velocity to 30 m [m/s]. |
|
float |
Lowest usable frequency [Hz]; used to screen records against the required period range. |
|
float |
Time step of the waveform [s]. |
|
float |
Record duration [s]. |
|
int |
Number of samples in the waveform. |
Optional causal-context fields¶
Additional fields are read only if you pass limits on them via
context_limits during selection; otherwise they are ignored. Provide them
when you want to filter on them. Common ones include Z1, Z1pt5,
Z2pt5, D_hyp, Ds575, Ds595, Tp, rake, dip,
strike, Ztor, Rx, rup_width and soil_NEHRP. See
djura.record_selection.constants.DB_CAUSAL_PARS for the full list the
package recognises by name.
Fault mechanism encoding¶
mechanism is an integer code mapped by
djura.record_selection.constants.MECHANISM_MAP:
Code |
Mechanism |
|---|---|
|
strike-slip fault |
|
normal fault |
|
reverse fault |
|
reverse/oblique fault |
|
normal/oblique fault |
Intensity-measure fields¶
IM values follow a strict naming convention so the selection code can locate
them automatically. For an IM named <IM>:
Period-independent IMs (PGA, PGV, IA, Ds575, Ds595)
are stored directly under the IM name as a 1-D array of length N:
metadata["PGA"] # shape (N,)
metadata["Ds595"] # shape (N,)
Period-dependent IMs (SA, Sa_avg2, Sa_avg3, FIV3) require
a period vector plus per-component 2-D arrays of shape (N, len(periods)):
Key pattern |
Shape |
Meaning |
|---|---|---|
|
|
First horizontal component |
|
|
Second horizontal component |
|
|
RotD50 component (optional, see below) |
|
|
RotD100 component (optional, see below) |
Here P is the length of the corresponding period vector. The period
vectors expected by the bundled IMs are:
IM family |
Period vector key |
Used by |
|---|---|---|
|
|
|
|
|
|
|
|
|
Important
Periods are matched by value, and the package rounds them to 5 decimal
places. Intermediate periods requested during selection are obtained by
linear interpolation along the period axis, so the period vectors should
span the range you intend to use. Each row of an IM array corresponds to the
record in the same row of Filename_1 / RSN.
Component definitions¶
When two horizontal components are available, the selector can build the following component definitions:
geomean— \(\sqrt{IM_1 \cdot IM_2}\)srss— \(\sqrt{IM_1^2 + IM_2^2}\)arithmeticmean— \((IM_1 + IM_2)/2\)rotd50— read directly from<IM>_RotD50rotd100— read directly from<IM>_RotD100
geomean, srss and arithmeticmean are computed on the fly from
<IM>_1 and <IM>_2, so they require no extra storage. rotd50 and
rotd100 are read directly from precomputed arrays — supply
<IM>_RotD50 / <IM>_RotD100 only if you intend to select on those
components.
Global scalars¶
Key |
Meaning |
|---|---|
|
Damping ratio of the tabulated response spectra (e.g. |
Minimal working example¶
The snippet below assembles a tiny, schema-conformant metadata file from your
own data. Replace the placeholder arrays with values mapped from your
database. Only SA is populated here; add other IM families the same way.
import pickle
import numpy as np
N = 100 # number of records
periods_sa = np.array( # period vector for SA
[0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0],
dtype="float32")
P = len(periods_sa)
metadata = {
# --- required record-level fields (length N) ---
"RSN": np.arange(1, N + 1, dtype="int32"),
"EQID": np.zeros(N, dtype="int32"),
"Filename_1": np.array([f"REC{i}_H1.txt" for i in range(N)]),
"Filename_2": np.array([f"REC{i}_H2.txt" for i in range(N)]),
"Filename_vert": np.array([""] * N, dtype=object),
"EQ_name": np.array(["MyEvent"] * N, dtype=object),
"EQ_year": np.full(N, 2020, dtype="int16"),
"Station_name": np.array([f"STA{i}" for i in range(N)]),
"magnitude": np.random.uniform(5.0, 7.5, N).astype("float32"),
"mechanism": np.zeros(N, dtype="int16"), # 0 = strike-slip
"Rjb": np.random.uniform(1, 100, N).astype("float32"),
"Rrup": np.random.uniform(1, 100, N).astype("float32"),
"Vs30": np.random.uniform(180, 760, N).astype("float32"),
"lowest_usable_freq": np.full(N, 0.1, dtype="float32"),
"dt": np.full(N, 0.005, dtype="float32"),
"duration": np.full(N, 30.0, dtype="float32"),
"npts": np.full(N, 6000, dtype="int32"),
# --- period vector + SA component arrays (shape N x P) ---
"Periods_SA": periods_sa,
"SA_1": np.random.uniform(0.01, 1.0, (N, P)).astype("float32"),
"SA_2": np.random.uniform(0.01, 1.0, (N, P)).astype("float32"),
# --- global scalar ---
"damping": 0.05,
}
with open("my_metadata.pickle", "wb") as f:
pickle.dump(metadata, f)
Tip
Keep the row ordering identical across every array — row i must refer to
the same physical record in RSN, Filename_1, magnitude,
SA_1 and all other keys. This is the single most common source of error
when mapping a new database.
Validating your file¶
A quick sanity check before using the file in a selection run:
import pickle
import numpy as np
with open("my_metadata.pickle", "rb") as f:
m = pickle.load(f)
N = len(m["RSN"])
for key, val in m.items():
if isinstance(val, np.ndarray) and val.ndim == 1 and key not in (
"Periods_SA", "Periods_Sa_avg", "Periods_FIV3"):
assert len(val) == N, f"{key} has length {len(val)}, expected {N}"
assert m["SA_1"].shape == (N, len(m["Periods_SA"]))
print(f"OK: {N} records, {len(m['Periods_SA'])} SA periods")
Pointing djura at your file¶
Custom metadata is supplied at run time through the DJURA_METADATA_PATH
environment variable. When set, it fully bypasses the bundled download and the
selection routines use your file instead:
# bash / macOS / Linux
export DJURA_METADATA_PATH=/path/to/my_metadata.pickle
# PowerShell
$env:DJURA_METADATA_PATH = "C:\path\to\my_metadata.pickle"
# CMD
set DJURA_METADATA_PATH=C:\path\to\my_metadata.pickle
After that, the API is unchanged:
from djura.record_selection import GCIM
gcim = GCIM() # loads my_metadata.pickle via DJURA_METADATA_PATH
gcim.get_metadata_parameters() # inspect the keys it loaded
Both .pickle/.pkl and .npz files are recognised by the lower-level
reader; the pickled dict form shown above is the simplest and is what the
bundled dataset uses.
Note
Set the environment variable before importing djura (or before the
first call that loads the dataset), since the metadata is loaded at most
once per process and cached.
What you can omit¶
You only need to populate the fields and IMs your selection actually uses:
IM families you never select on can be left out entirely (e.g. skip all
FIV3_*keys andPeriods_FIV3if you do not useFIV3).RotD50/RotD100arrays are needed only for those component definitions;geomeanworks from_1/_2alone.Optional causal fields are read only when you pass matching
context_limits.
Conversely, the required record-level fields in the table above should always be present, because they are used for identification, event grouping, usable period screening, and the selection report.