Principal Component Analysis
PCA is a useful exploratory tool. It compresses each spectrum into a few principal-component scores so you can inspect dominant spectral variation, outliers, and material separation before building a supervised model.
The downloadable .py files can be run without editing paths by setting
environment variables such as HSI_EXAMPLE_BASE_DIR. See
Running Downloaded Scripts
for the full list of supported overrides.
The usual workflow is:
- Sample spectra from the cube so fitting is fast.
- Fit a scikit-learn PCA model.
- Inspect the explained variance and loadings.
- Apply the fitted PCA model lazily to a crop or full cube with
hs.ml.pca_helper(). - Visualize the first components as images or scatter plots.
The SDK supports strided slicing, for example image[:, :, ::4], and slicing
always uses logical BIP order (lines, samples, bands). These examples sample
training spectra with strided slicing, then apply the fitted PCA model lazily
with pca_helper().
The first example uses an explicit sample_stop aligned to the sample step so
the sampled grid has predictable dimensions.
Apply PCA models only to cubes with the same preprocessing, band count, band order, and wavelength calibration used when fitting the model. PCA components are tied directly to the input feature order, so even a subtle band mismatch can make the scores meaningless.
Fit PCA and Preview Components
Fit PCA from sampled pixels, apply it to a preview crop, and display the first three component score images as an RGB preview.
- Creates:
pca_model.joblib, orHSI_EXAMPLE_PCA_MODELif set. - Used by: the loading, component-image, score-plot, and apply-to-another-cube PCA examples.
- Run first when testing the downloaded PCA scripts.
import joblib
import matplotlib.pyplot as plt
from qtec_hv_sdk.ml import pca_helper
from sklearn.decomposition import PCA
reflectance = open_reflectance_cube()
line_step = 50
sample_step = 20
n_components = 6
sample_stop = (reflectance.shape.samples // sample_step) * sample_step
sample_cube = reflectance[0:200:line_step, 0:sample_stop:sample_step, :].to_numpy_with_interleave(hs.bip)
sample_pixels = sample_cube.reshape(-1, sample_cube.shape[-1])
pca = PCA(n_components=n_components, random_state=42)
pca.fit(sample_pixels)
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.3f}")
# Save the PCA model
joblib.dump(pca, PCA_MODEL_PATH)
print(f"Saved PCA model to {PCA_MODEL_PATH}")
hs_pca = pca_helper(pca)
preview_source = reflectance[0:250, 0:400, :]
pca_image = hs_pca(preview_source)
pca_preview = pca_image.to_numpy_with_interleave(hs.bip)
rgb_preview = pca_preview[:, :, :3].copy()
for channel in range(3):
rgb_preview[:, :, channel] = contrast_stretch(rgb_preview[:, :, channel])
plt.imshow(rgb_preview)
plt.title("PCA RGB preview: PC1, PC2, PC3")
plt.axis("off")
plt.show()
Example output:
Explained variance: [0.72 ...]
Total explained variance: 0.914
Saved PCA model to pca_model.joblib
pca_helper() supports PCA-like models that expose components_ and mean_,
including scikit-learn PCA, FactorAnalysis, FastICA, and SparsePCA.
PCA is not a classifier. It is a way to inspect dominant spectral variation and spot structure before building a supervised model.
Save and load with joblib
Use joblib to save the fitted scikit-learn PCA object and load the same
transform in later scripts:
import joblib
def save_pca_model(pca, model_path):
joblib.dump(pca, model_path)
def load_pca_model(model_path):
if not model_path.exists():
raise SystemExit(f"PCA model not found at {model_path}. Run the first PCA example first.")
return joblib.load(model_path)
Advanced Memory-Efficient Sampling
The first PCA example uses strided slicing, which is simple and usually good
enough. If you need custom sampling logic, such as randomly selecting a small
number of pixels from each line, use ufunc() to sample inside the lazy SDK
pipeline.
This pattern is more advanced because the callback sees data in the chosen
interleave. The example converts to BIL first, so each callback receives one
line as (bands, samples).
Use the strided slicing from the first PCA example when fixed line and sample
steps are enough. Use this ufunc() pattern when the cube is large and you need
custom per-line sampling logic that cannot be expressed as a regular slice.
For supervised ROI workflows, prefer select_mask_from_descriptor() as shown in
the classification and regression examples. Use random sampling there only if
you need to reduce very large training sets.
def sample_pixels_with_ufunc(image, n_samples_per_line=10, random_seed=42):
rng = np.random.default_rng(random_seed)
# BIL gives the ufunc one full line at a time as [bands, samples], so we can
# sample pixels without materializing the whole crop first.
image = image.to_interleave(hs.bil)
def select_pixels(meta, line):
sample_indices = rng.choice(line.shape[1], size=n_samples_per_line, replace=False)
return line[:, sample_indices]
sampled = image.ufunc(select_pixels).to_numpy_with_interleave(hs.bip)
return sampled.reshape(-1, sampled.shape[-1])
reflectance = open_reflectance_cube()
training_crop = reflectance[0:250, :, :]
sample_pixels = sample_pixels_with_ufunc(
training_crop,
n_samples_per_line=10,
random_seed=42,
)
pca = PCA(n_components=6, random_state=42)
pca.fit(sample_pixels)
joblib.dump(pca, PCA_MODEL_PATH)
print(f"Sample pixels: {sample_pixels.shape[0]}")
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Saved PCA model to {PCA_MODEL_PATH}")
Plot PCA Loadings
PCA loadings show which wavelengths contribute most to each component. Peaks, sign changes, or noisy regions in the loading curves can help you understand what the component image is emphasizing.
reflectance = open_reflectance_cube()
wavelengths = wavelengths_for_bands(reflectance.shape.bands)
pca = load_pca_model()
fig, axes = plt.subplots(3, 2, figsize=(12, 10), sharex=True)
axes = axes.ravel()
for component_index, ax in enumerate(axes):
ax.plot(wavelengths, pca.components_[component_index])
ax.set_title(f"PC{component_index + 1} loading")
ax.set_ylabel("Loading")
ax.grid(True, alpha=0.3)
axes[-2].set_xlabel("Wavelength [nm]")
axes[-1].set_xlabel("Wavelength [nm]")
plt.tight_layout()
plt.show()
Large absolute loading values indicate wavelengths that strongly affect that
component. The sign is arbitrary; a component can be multiplied by -1 and
still represent the same axis of variation.
Show Individual PCA Component Images
The RGB preview is convenient, but individual component images are easier to interpret when you want to inspect one source of variation at a time.
reflectance = open_reflectance_cube()
pca = load_pca_model()
hs_pca = pca_helper(pca)
preview_source = reflectance[0:250, 0:400, :]
pca_image = hs_pca(preview_source)
pca_preview = pca_image.to_numpy_with_interleave(hs.bip)
fig, axes = plt.subplots(3, 2, figsize=(10, 12))
axes = axes.ravel()
for component_index, ax in enumerate(axes):
component = pca_preview[:, :, component_index]
im = ax.imshow(component, cmap="gray")
ax.set_title(f"PC{component_index + 1}")
ax.axis("off")
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
plt.tight_layout()
plt.show()
Bright and dark areas show pixels with high and low scores for that component. Use these images to find structure, not to assign final class labels.
PCA Score Scatter Plot
A scatter plot of PCA scores shows whether pixels form clusters in component space. Start with sampled pixels so the plot stays responsive.
def sample_training_pixels(image, line_step=50, sample_step=20):
sample_lines = []
for line_index in range(0, min(200, image.shape.lines), line_step):
line = image.array_plane(line_index, hs.lines).T
sample_lines.append(line[::sample_step])
return np.concatenate(sample_lines, axis=0)
reflectance = open_reflectance_cube()
sample_pixels = sample_training_pixels(reflectance)
pca = load_pca_model()
sample_scores = pca.transform(sample_pixels)
plt.figure(figsize=(8, 6))
plt.scatter(sample_scores[:, 0], sample_scores[:, 1], s=4, alpha=0.25)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Sampled pixel scores")
plt.grid(True, alpha=0.3)
plt.show()
If the first two components do not separate the materials you care about, try
PC1 versus PC3, PC2 versus PC3, or inspect the component images before
moving to supervised classification.
PCA Scatter Plot With ROIs
If you have HV Explorer annotations, project the ROI spectra into PCA space and color them by label. This is a quick way to check whether annotated materials separate before training a classifier.
This example loads HV Explorer annotations with hs.annotations.open() and
uses each annotation descriptor with select_mask_from_descriptor().
For reusable annotation loading, plotting, and ROI extraction patterns, see Annotations and ROIs.
reflectance = open_reflectance_cube()
pca = load_pca_model()
ann_file = load_annotations()
class_colors = {annotation_value(annot.properties["type"]): annot.color for annot in ann_file.annotations}
component_x = 0
component_y = 1
plt.figure(figsize=(8, 6))
for annot in ann_file.annotations:
class_name = annotation_value(annot.properties["type"])
selected = reflectance.select_mask_from_descriptor(annot.descriptor)
spectra = selected.to_numpy_with_interleave(hs.bip)[:, 0, :]
scores = pca.transform(spectra)
plt.scatter(
scores[:, component_x],
scores[:, component_y],
s=8,
alpha=0.25,
color=class_colors[class_name],
edgecolors="none",
label=class_name,
)
handles, labels = plt.gca().get_legend_handles_labels()
unique = dict(zip(labels, handles))
plt.legend(unique.values(), unique.keys())
plt.xlabel(f"PC{component_x + 1}")
plt.ylabel(f"PC{component_y + 1}")
plt.title("ROI pixels in PCA space")
plt.grid(True, alpha=0.3)
plt.show()
Apply It to Another Cube
The saved PCA model from the first example can also be applied to another calibrated cube with the same preprocessing and wavelength order.
pca = load_pca_model()
test_reflectance = open_reflectance_cube(TEST_CUBE)
hs_pca = pca_helper(pca)
test_crop = test_reflectance[0:250, 0:400, :]
test_pca = hs_pca(test_crop)
test_preview = test_pca.to_numpy_with_interleave(hs.bip)
rgb_preview = test_preview[:, :, :3].copy()
for channel in range(3):
rgb_preview[:, :, channel] = contrast_stretch(rgb_preview[:, :, channel])
plt.imshow(rgb_preview)
plt.title("Test cube projected with saved PCA")
plt.axis("off")
plt.show()
Fit PCA From ROIs
Sometimes you want PCA to focus only on annotated materials instead of the whole image background. Fit PCA from ROI spectra, save the fitted model, then reuse that exact transform on ROI visualizations or another calibrated datacube.
- Creates:
roi_pca_model.joblib, orHSI_EXAMPLE_ROI_PCA_MODELif set. - Used by: the ROI PCA score-plot example.
- Run this ROI fitting example before the ROI PCA score-plot script.
Use the same preprocessing, band selection, and wavelength order for both cubes. In this example both cubes are reflectance cubes with the same 900-band setup.
For reusable annotation loading, plotting, and ROI extraction patterns, see Annotations and ROIs.
reflectance = open_reflectance_cube()
ann_file = load_annotations()
rng = np.random.default_rng(42)
max_pixels_per_roi = 1000
roi_pixels_list = []
for annot in ann_file.annotations:
selected = reflectance.select_mask_from_descriptor(annot.descriptor)
spectra = selected.to_numpy_with_interleave(hs.bip)[:, 0, :]
if spectra.shape[0] > max_pixels_per_roi:
selected = rng.choice(spectra.shape[0], size=max_pixels_per_roi, replace=False)
spectra = spectra[selected]
roi_pixels_list.append(spectra)
roi_pixels = np.concatenate(roi_pixels_list)
roi_pca = PCA(n_components=3, random_state=42)
roi_pca.fit(roi_pixels)
joblib.dump(roi_pca, ROI_PCA_MODEL_PATH)
print(f"ROI PCA explained variance: {roi_pca.explained_variance_ratio_}")
print(f"Saved ROI PCA model to {ROI_PCA_MODEL_PATH}")
You can inspect the fitted ROI PCA in score space:
Run the ROI fitting example first. The score plot loads roi_pca_model.joblib,
or the path set with HSI_EXAMPLE_ROI_PCA_MODEL, and stops with a clear message
if the model file does not exist.
reflectance = open_reflectance_cube()
ann_file = load_annotations()
rng = np.random.default_rng(42)
max_pixels_per_roi = 1000
roi_pixels_list = []
roi_labels_list = []
for annot in ann_file.annotations:
class_name = annotation_value(annot.properties["type"])
selected = reflectance.select_mask_from_descriptor(annot.descriptor)
spectra = selected.to_numpy_with_interleave(hs.bip)[:, 0, :]
if spectra.shape[0] > max_pixels_per_roi:
keep = rng.choice(spectra.shape[0], max_pixels_per_roi, replace=False)
spectra = spectra[keep]
roi_pixels_list.append(spectra)
roi_labels_list.append(np.full(spectra.shape[0], class_name))
roi_pixels = np.concatenate(roi_pixels_list)
roi_labels = np.concatenate(roi_labels_list)
roi_pca = load_roi_pca_model()
roi_scores = roi_pca.transform(roi_pixels)
class_colors = {annotation_value(annot.properties["type"]): annot.color for annot in ann_file.annotations}
plt.figure(figsize=(8, 6))
for class_name, color in class_colors.items():
class_mask = roi_labels == class_name
if not class_mask.any():
continue
plt.scatter(
roi_scores[class_mask, 0],
roi_scores[class_mask, 1],
s=8,
alpha=0.6,
color=color,
label=class_name,
)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("ROI-fitted PCA scores")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Separated clusters are a good sign, but PCA is still unsupervised. Use the classification examples when you need an actual pixel classifier.