I’m working on a project using the PyTorch C++ frontend with the MNIST dataset. My goal is to work with only a portion of the full dataset instead of loading everything.
In regular PyTorch Python, I could easily use torch.utils.data.Subset or similar tools to get a smaller chunk of data. But I can’t find equivalent functionality in the C++ API.
Currently my dataset loading looks like this:
auto dataset = torch::data::datasets::MNIST(data_path)
.map(torch::data::transforms::Normalize<>(0.1307, 0.3081))
.map(torch::data::transforms::Stack<>());
const size_t dataset_size = dataset.size().value();
I attempted to build a custom sampler to handle this:
class PartialSampler : public torch::data::samplers::Sampler<> {
public:
explicit PartialSampler(std::vector<size_t> idx)
: sample_indices(std::move(idx)) {}
c10::optional<std::vector<size_t>> next(size_t batch_count) override {
std::vector<size_t> result;
while (result.size() < batch_count && position < sample_indices.size()) {
result.push_back(sample_indices[position++]);
}
if (result.empty()) {
return c10::nullopt;
}
return result;
}
void reset() {
position = 0;
}
c10::optional<size_t> size() {
return sample_indices.size();
}
private:
std::vector<size_t> sample_indices;
size_t position = 0;
};
When I try to use this sampler with the data loader, I get compilation errors about missing BatchRequestType. What’s the proper way to work with dataset subsets in PyTorch C++?