Skip to main content

Principal Component Analysis

PCA is a useful exploratory tool. It compresses each spectrum into a few principal-component scores so you can inspect dominant spectral variation, outliers, and material separation before building a supervised model.

Downloaded scripts

The downloadable .py files can be run without editing paths by setting environment variables such as HSI_EXAMPLE_BASE_DIR. See Running Downloaded Scripts for the full list of supported overrides.

The usual workflow is:

  1. Sample spectra from the cube so fitting is fast.
  2. Fit a scikit-learn PCA model.
  3. Inspect the explained variance and loadings.
  4. Apply the fitted PCA model lazily to a crop or full cube with hs.ml.pca_helper().
  5. Visualize the first components as images or scatter plots.

The SDK supports strided slicing, for example image[:, :, ::4], and slicing always uses logical BIP order (lines, samples, bands). These examples sample training spectra with strided slicing, then apply the fitted PCA model lazily with pca_helper().

The first example uses an explicit sample_stop aligned to the sample step so the sampled grid has predictable dimensions.

Data requirements

Apply PCA models only to cubes with the same preprocessing, band count, band order, and wavelength calibration used when fitting the model. PCA components are tied directly to the input feature order, so even a subtle band mismatch can make the scores meaningless.

Fit PCA and Preview Components

Fit PCA from sampled pixels, apply it to a preview crop, and display the first three component score images as an RGB preview.

Saved PCA model
  • Creates: pca_model.joblib, or HSI_EXAMPLE_PCA_MODEL if set.
  • Used by: the loading, component-image, score-plot, and apply-to-another-cube PCA examples.
  • Run first when testing the downloaded PCA scripts.
import joblib
import matplotlib.pyplot as plt
from qtec_hv_sdk.ml import pca_helper
from sklearn.decomposition import PCA


reflectance = open_reflectance_cube()

line_step = 50
sample_step = 20
n_components = 6

sample_stop = (reflectance.shape.samples // sample_step) * sample_step
sample_cube = reflectance[0:200:line_step, 0:sample_stop:sample_step, :].to_numpy_with_interleave(hs.bip)
sample_pixels = sample_cube.reshape(-1, sample_cube.shape[-1])

pca = PCA(n_components=n_components, random_state=42)
pca.fit(sample_pixels)

print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.3f}")

# Save the PCA model
joblib.dump(pca, PCA_MODEL_PATH)
print(f"Saved PCA model to {PCA_MODEL_PATH}")

hs_pca = pca_helper(pca)

preview_source = reflectance[0:250, 0:400, :]
pca_image = hs_pca(preview_source)
pca_preview = pca_image.to_numpy_with_interleave(hs.bip)

rgb_preview = pca_preview[:, :, :3].copy()
for channel in range(3):
rgb_preview[:, :, channel] = contrast_stretch(rgb_preview[:, :, channel])

plt.imshow(rgb_preview)
plt.title("PCA RGB preview: PC1, PC2, PC3")
plt.axis("off")
plt.show()

Download this script

Example output:

Explained variance: [0.72 ...]
Total explained variance: 0.914
Saved PCA model to pca_model.joblib
info

pca_helper() supports PCA-like models that expose components_ and mean_, including scikit-learn PCA, FactorAnalysis, FastICA, and SparsePCA.

PCA is not a classifier. It is a way to inspect dominant spectral variation and spot structure before building a supervised model.

Save and load with joblib

Use joblib to save the fitted scikit-learn PCA object and load the same transform in later scripts:

import joblib


def save_pca_model(pca, model_path):
joblib.dump(pca, model_path)


def load_pca_model(model_path):
if not model_path.exists():
raise SystemExit(f"PCA model not found at {model_path}. Run the first PCA example first.")
return joblib.load(model_path)

Advanced Memory-Efficient Sampling

The first PCA example uses strided slicing, which is simple and usually good enough. If you need custom sampling logic, such as randomly selecting a small number of pixels from each line, use ufunc() to sample inside the lazy SDK pipeline.

This pattern is more advanced because the callback sees data in the chosen interleave. The example converts to BIL first, so each callback receives one line as (bands, samples).

When to use this

Use the strided slicing from the first PCA example when fixed line and sample steps are enough. Use this ufunc() pattern when the cube is large and you need custom per-line sampling logic that cannot be expressed as a regular slice.

For supervised ROI workflows, prefer select_mask_from_descriptor() as shown in the classification and regression examples. Use random sampling there only if you need to reduce very large training sets.

def sample_pixels_with_ufunc(image, n_samples_per_line=10, random_seed=42):
rng = np.random.default_rng(random_seed)
# BIL gives the ufunc one full line at a time as [bands, samples], so we can
# sample pixels without materializing the whole crop first.
image = image.to_interleave(hs.bil)

def select_pixels(meta, line):
sample_indices = rng.choice(line.shape[1], size=n_samples_per_line, replace=False)
return line[:, sample_indices]

sampled = image.ufunc(select_pixels).to_numpy_with_interleave(hs.bip)
return sampled.reshape(-1, sampled.shape[-1])


reflectance = open_reflectance_cube()
training_crop = reflectance[0:250, :, :]

sample_pixels = sample_pixels_with_ufunc(
training_crop,
n_samples_per_line=10,
random_seed=42,
)

pca = PCA(n_components=6, random_state=42)
pca.fit(sample_pixels)

joblib.dump(pca, PCA_MODEL_PATH)

print(f"Sample pixels: {sample_pixels.shape[0]}")
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Saved PCA model to {PCA_MODEL_PATH}")

Download this script

Plot PCA Loadings

PCA loadings show which wavelengths contribute most to each component. Peaks, sign changes, or noisy regions in the loading curves can help you understand what the component image is emphasizing.

reflectance = open_reflectance_cube()
wavelengths = wavelengths_for_bands(reflectance.shape.bands)
pca = load_pca_model()

fig, axes = plt.subplots(3, 2, figsize=(12, 10), sharex=True)
axes = axes.ravel()

for component_index, ax in enumerate(axes):
ax.plot(wavelengths, pca.components_[component_index])
ax.set_title(f"PC{component_index + 1} loading")
ax.set_ylabel("Loading")
ax.grid(True, alpha=0.3)

axes[-2].set_xlabel("Wavelength [nm]")
axes[-1].set_xlabel("Wavelength [nm]")
plt.tight_layout()
plt.show()

Download this script

tip

Large absolute loading values indicate wavelengths that strongly affect that component. The sign is arbitrary; a component can be multiplied by -1 and still represent the same axis of variation.

Show Individual PCA Component Images

The RGB preview is convenient, but individual component images are easier to interpret when you want to inspect one source of variation at a time.

reflectance = open_reflectance_cube()
pca = load_pca_model()

hs_pca = pca_helper(pca)
preview_source = reflectance[0:250, 0:400, :]
pca_image = hs_pca(preview_source)
pca_preview = pca_image.to_numpy_with_interleave(hs.bip)

fig, axes = plt.subplots(3, 2, figsize=(10, 12))
axes = axes.ravel()

for component_index, ax in enumerate(axes):
component = pca_preview[:, :, component_index]
im = ax.imshow(component, cmap="gray")
ax.set_title(f"PC{component_index + 1}")
ax.axis("off")
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)

plt.tight_layout()
plt.show()

Download this script

tip

Bright and dark areas show pixels with high and low scores for that component. Use these images to find structure, not to assign final class labels.

PCA Score Scatter Plot

A scatter plot of PCA scores shows whether pixels form clusters in component space. Start with sampled pixels so the plot stays responsive.

def sample_training_pixels(image, line_step=50, sample_step=20):
sample_lines = []
for line_index in range(0, min(200, image.shape.lines), line_step):
line = image.array_plane(line_index, hs.lines).T
sample_lines.append(line[::sample_step])
return np.concatenate(sample_lines, axis=0)


reflectance = open_reflectance_cube()
sample_pixels = sample_training_pixels(reflectance)
pca = load_pca_model()

sample_scores = pca.transform(sample_pixels)

plt.figure(figsize=(8, 6))
plt.scatter(sample_scores[:, 0], sample_scores[:, 1], s=4, alpha=0.25)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Sampled pixel scores")
plt.grid(True, alpha=0.3)
plt.show()

Download this script

tip

If the first two components do not separate the materials you care about, try PC1 versus PC3, PC2 versus PC3, or inspect the component images before moving to supervised classification.

PCA Scatter Plot With ROIs

If you have HV Explorer annotations, project the ROI spectra into PCA space and color them by label. This is a quick way to check whether annotated materials separate before training a classifier.

This example loads HV Explorer annotations with hs.annotations.open() and uses each annotation descriptor with select_mask_from_descriptor().

For reusable annotation loading, plotting, and ROI extraction patterns, see Annotations and ROIs.

reflectance = open_reflectance_cube()
pca = load_pca_model()
ann_file = load_annotations()

class_colors = {annotation_value(annot.properties["type"]): annot.color for annot in ann_file.annotations}
component_x = 0
component_y = 1

plt.figure(figsize=(8, 6))

for annot in ann_file.annotations:
class_name = annotation_value(annot.properties["type"])
selected = reflectance.select_mask_from_descriptor(annot.descriptor)
spectra = selected.to_numpy_with_interleave(hs.bip)[:, 0, :]
scores = pca.transform(spectra)

plt.scatter(
scores[:, component_x],
scores[:, component_y],
s=8,
alpha=0.25,
color=class_colors[class_name],
edgecolors="none",
label=class_name,
)

handles, labels = plt.gca().get_legend_handles_labels()
unique = dict(zip(labels, handles))
plt.legend(unique.values(), unique.keys())
plt.xlabel(f"PC{component_x + 1}")
plt.ylabel(f"PC{component_y + 1}")
plt.title("ROI pixels in PCA space")
plt.grid(True, alpha=0.3)
plt.show()

Download this script

Apply It to Another Cube

The saved PCA model from the first example can also be applied to another calibrated cube with the same preprocessing and wavelength order.

pca = load_pca_model()

test_reflectance = open_reflectance_cube(TEST_CUBE)
hs_pca = pca_helper(pca)

test_crop = test_reflectance[0:250, 0:400, :]
test_pca = hs_pca(test_crop)
test_preview = test_pca.to_numpy_with_interleave(hs.bip)

rgb_preview = test_preview[:, :, :3].copy()
for channel in range(3):
rgb_preview[:, :, channel] = contrast_stretch(rgb_preview[:, :, channel])

plt.imshow(rgb_preview)
plt.title("Test cube projected with saved PCA")
plt.axis("off")
plt.show()

Download this script

Fit PCA From ROIs

Sometimes you want PCA to focus only on annotated materials instead of the whole image background. Fit PCA from ROI spectra, save the fitted model, then reuse that exact transform on ROI visualizations or another calibrated datacube.

Saved ROI PCA model
  • Creates: roi_pca_model.joblib, or HSI_EXAMPLE_ROI_PCA_MODEL if set.
  • Used by: the ROI PCA score-plot example.
  • Run this ROI fitting example before the ROI PCA score-plot script.

Use the same preprocessing, band selection, and wavelength order for both cubes. In this example both cubes are reflectance cubes with the same 900-band setup.

For reusable annotation loading, plotting, and ROI extraction patterns, see Annotations and ROIs.

reflectance = open_reflectance_cube()
ann_file = load_annotations()

rng = np.random.default_rng(42)
max_pixels_per_roi = 1000

roi_pixels_list = []

for annot in ann_file.annotations:
selected = reflectance.select_mask_from_descriptor(annot.descriptor)
spectra = selected.to_numpy_with_interleave(hs.bip)[:, 0, :]
if spectra.shape[0] > max_pixels_per_roi:
selected = rng.choice(spectra.shape[0], size=max_pixels_per_roi, replace=False)
spectra = spectra[selected]

roi_pixels_list.append(spectra)

roi_pixels = np.concatenate(roi_pixels_list)
roi_pca = PCA(n_components=3, random_state=42)
roi_pca.fit(roi_pixels)

joblib.dump(roi_pca, ROI_PCA_MODEL_PATH)

print(f"ROI PCA explained variance: {roi_pca.explained_variance_ratio_}")
print(f"Saved ROI PCA model to {ROI_PCA_MODEL_PATH}")

Download this script

You can inspect the fitted ROI PCA in score space:

info

Run the ROI fitting example first. The score plot loads roi_pca_model.joblib, or the path set with HSI_EXAMPLE_ROI_PCA_MODEL, and stops with a clear message if the model file does not exist.

reflectance = open_reflectance_cube()
ann_file = load_annotations()

rng = np.random.default_rng(42)
max_pixels_per_roi = 1000

roi_pixels_list = []
roi_labels_list = []

for annot in ann_file.annotations:
class_name = annotation_value(annot.properties["type"])
selected = reflectance.select_mask_from_descriptor(annot.descriptor)
spectra = selected.to_numpy_with_interleave(hs.bip)[:, 0, :]
if spectra.shape[0] > max_pixels_per_roi:
keep = rng.choice(spectra.shape[0], max_pixels_per_roi, replace=False)
spectra = spectra[keep]

roi_pixels_list.append(spectra)
roi_labels_list.append(np.full(spectra.shape[0], class_name))

roi_pixels = np.concatenate(roi_pixels_list)
roi_labels = np.concatenate(roi_labels_list)

roi_pca = load_roi_pca_model()
roi_scores = roi_pca.transform(roi_pixels)

class_colors = {annotation_value(annot.properties["type"]): annot.color for annot in ann_file.annotations}

plt.figure(figsize=(8, 6))
for class_name, color in class_colors.items():
class_mask = roi_labels == class_name
if not class_mask.any():
continue

plt.scatter(
roi_scores[class_mask, 0],
roi_scores[class_mask, 1],
s=8,
alpha=0.6,
color=color,
label=class_name,
)

plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("ROI-fitted PCA scores")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Download this script

data grouping

Separated clusters are a good sign, but PCA is still unsupervised. Use the classification examples when you need an actual pixel classifier.