The BTAA-GIN Geodata Collection Groundwork Phase (Q1 2024- Q1 2025) established the technical and metadata infrastructure required for a scalable geospatial data repository. This phase focused on pilot collection development, metadata expansion, workflow refinement, and platform enhancements.
We curated a pilot collection of 35 datasets on our development site to test our processes for gathering, describing, and sharing geospatial information.
We added new metadata fields and entry guidelines to provide technical and administrative documentation for datasets.
We created new methods for programmatically generating metadata and derivative files from collected datasets.
We introduced new ways for users to evaluate datasets with visual previews and data dictionaries.
To accommodate the new metadata fields and improve overall consistency, we have augmented our Metadata Entry Guidelines. The revised guidelines offer instructions on populating the new fields and provide updated recommendations for existing fields, especially for internal assets.
The input guidelines for items in the BTAA Geodata Collection differ from the general registry collection.
All records for the GeoData Collection will have an ID prefixed by “btaa_”. This will help identify them and differentiate them from the ID of the original source record.
The Subject field is reserved for Library of Congress authority terms only.
The Description field no longer needs to have the scale, spatial resolution, or provenance concatenated, as special fields are now dedicated to that.
Geometry will reflect the outline of the dataset instead of a duplication of the bounding box
Provenance Statement should indicate when the resource was obtained and from where
Publisher (previously unused for datasets) will be the original distributor/provider
Provider will be Big Ten Academic Alliance Geospatial Information Network (or BTAA-GIN during the pilot)
Internal assets should always have a value for the following fields: (unlike external assets, for which this information is optional or difficult-to-fill)
As we expand our geodata collection and storage capabilities, creating derivative files has become an integral part of our workflow. Derivatives enhance the user experience by offering streamlined access and improved visualization of datasets. These files can be batch-generated using desktop scripts, making the process efficient and scalable.
Purpose: Thumbnails provide a quick visual preview of a dataset, displayed directly on the search results page. These images help users assess the relevance of a dataset at a glance, improving discovery and usability.
Although we have had thumbnail functionality available for scanned maps and some web services, we have generally not been able to provide them for external datasets.
PMTiles is a single-file archive format designed for storing pyramids of tiled geospatial data. The format supports datasets addressed by Z/X/Y coordinates, including vector tiles, imagery, and remote sensing data. PMTiles archives can be hosted on platforms like Amazon S3, where they are accessed via HTTP range requests. This enables low-cost, zero-maintenance web mapping applications by minimizing the overhead of traditional web servers.
Why Use PMTiles?
PMTiles allows us to create web-friendly datasets that can be embedded into web maps without requiring users to download the data. Unlike traditional geospatial web servers like GeoServer or ArcGIS, PMTiles simplifies deployment and management.
Current Limitations
While GeoBlacklight can display PMTiles as overlays, features such as styled layers or querying datasets are not yet supported. These enhancements are part of GeoBlacklight’s future development roadmap.
A Cloud Optimized GeoTIFF (COG) is an enhanced version of the standard GeoTIFF format. It is internally organized to optimize data access in cloud environments, leveraging features like tiling, overviews, and streaming capabilities. Clients can retrieve only the parts of a file they need using HTTP range requests, which improves performance and reduces bandwidth.
Why Use COGs?
COGs serve the same purpose as PMTiles for raster datasets, enabling browser-based visualization without the need for a dedicated web server. They are ideal for high-resolution imagery and other large raster files.
Key Features
Backward compatibility: All COG files are valid GeoTIFFs.
Open standard: Developed by the Open Geospatial Consortium (OGC), the COG specification is free to use and implement.
To improve resource interpretation, we have added data dictionaries as supplemental tables. The dictionaries include field names, types, definitions, and definitions sources. The tables also support nested values.
The Geodata Collection initiative has allowed us to address a long-standing concern: documenting data dictionaries. Often referred to as attribute table definitions or codebooks, data dictionaries describe a dataset’s internal structure and provide information for understanding its contents.
In the past, our efforts have focused on descriptive metadata (titles, authors, dates, subjects, etc.) to help users find resources. We have devoted less attention to helping users evaluate resources. However, user feedback has consistently shown that data dictionaries are highly desired. Many users even assume this information is already included in metadata, leading to confusion when it is not readily available.
The challenge of documenting data dictionaries has persisted for years. The earlier geospatial metadata standard, FGDC, provided a structured format for this information through its Entities and Attributes section. However, the subsequent ISO 191xx standards replaced this section with ISO 19110, a standalone format for data dictionaries. Despite this shift, ISO 19110 files remain rare, likely because tools like ArcGIS do not export field definitions in this format. One exception is Stanford Libraries, which include ISO 19110 files in their dataset downloads. However, these files are not available as previews in their geoportal and are encoded in XML, making them difficult for users to read.
Accessing data dictionaries is often a frustrating process for users, as the information is inconsistently stored across FGDC files, plain text files, or external websites. To address this, we have consolidated the information into a standardized table format. Users can now access data dictionaries directly from the item view page via a clearly labeled link—eliminating the need to search through XML files or external sources.
Our data dictionary format is modeled on the FGDC standard. Using Python scripts, we extract field information from datasets or existing FGDC files and document it in simple CSV files. These CSVs are designed to accommodate nested entries by including a parent field identifier.
While this approach has improved accessibility, many of our data dictionaries remain incomplete. In cases where field definitions are unavailable, we at least provide field names and types as extracted by our scripts. By storing this information in tables rather than static files, we retain the flexibility to update the dictionaries as new information becomes available. Even in their current state, these tables help users gain a clearer understanding of a dataset’s contents simply by browsing the field names.
We developed several Python scripts to enhance metadata and data processing capabilities. These tools support the collection, documentation, and curation of geospatial datasets, aligning with geospatial metadata standards and improving user experience.
Jupyter Notebook for Extract Technical MetadataThis script extracts spatial metadata from geospatial datasets stored in a directory and exports the results as a CSV file. Modules Used: `geopandas`, `pandas`, `rasterio`, `shapely`
Supported Geospatial Formats:
Vector: Shapefiles, GeoJSON
Raster: GeoTIFF
Geodatabases and Geopackages (records file name and size only)
Metadata extracted
File folder and name
File size (as Kb or Mb)
Resource (geometry) type (i.e. “Point data”)
Coordinate reference system (CRS) as a URI
Bounding box (rectangular extent)
Geometry shape as WKT outline
Area (square kilometers, calculated from bounding box)
Generate ThumbnailsThis script generates thumbnail images for geospatial datasets. It processes files (Shapefiles, GeoJSONs, or GeoTIFFs) in a folder and outputs configurable images (width, height, and DPI). Modules Used: `geopandas`, `matplotlib.pyplot`, `rasterio`
Create PMTilesThis script creates PMTiles for vector datasets, with an intermediate step to reproject to EPSG:4326 (required for web visualization). The minimum and maximum zoom levels are configurable. Modules Used: `GDAL`
COG GeneratorThis script processes GeoTIFFs, reprojects them to EPSG:3857, and outputs Cloud Optimized GeoTIFFs (COGs), optimized for web browser compatibility. Modules Used: `GDAL`
Create COG with 3 BandsThis script processes GeoTIFFs, reprojects them to EPSG:3857, and outputs Cloud Optimized GeoTIFFs (COGs), optimized for web browser compatibility. It also adds 3 bands of color. Modules Used: `GDAL`
To ensure data consistency and compatibility, these scripts address common preprocessing needs.
Add Projection FilesThis script inserts `.prj` (projection) files into shapefiles within a directory. The projection must be defined in the script and is essential for extracting CRS, bounding box, geometry, area, and spatial resolution values.
Attribute Table DefinitionsThis script parses FGDC metadata fields to extract attribute labels, definitions, and sources into a CSV. These results can optionally merge with tables generated by the Extract Technical Metadata script.
To support the collection, storage, and management of geodata, we enhanced the GeoBlacklight (GBL) Admin tool. Originally designed for metadata workflows, GBL Admin now includes functionality to upload and manage assets, centralize access links, and document data dictionaries. These improvements address the expanded needs of our program and align with best practices for geospatial data management.
The enhancements fall into three key areas:
Asset Upload and Management: Integration with Amazon S3 enables storage and management of assets, including uploads, file-level documentation, and thumbnail harvesting.
Dedicated Distribution Table for Links: A new “Distributions” table consolidates all external and internal links. This improves organization and provides administrative fields for custom labels.
Data Dictionaries: A new table for data dictionaries provides users with immediate access to attribute table definitions. It also allows administrators to dynamically update the field-level metadata instead of relying upon static documents.
To streamline the management of access links (referred to as “References” in GeoBlacklight), we created a dedicated Distributions table. This unified approach replaces scattered interface views across:
Main record views for external links.
A secondary table for “Multiple Downloads.”
The new assets panel for uploading and attaching files.
Benefits
Alignment with Standards: Renaming this part of the metadata as “Distributions” aligns with terminology in the ISO metadata standard as well as the DCAT profile.
Improved Import/Export:
Links are now stored separately, facilitating batch updates to descriptive metadata without affecting links.
Simplified CSV imports/exports, with smaller metadata files and the ability to handle arrays and multiple values for more fields.
How It Works
Option 1: Manual Entry
From the item page, click “Distributions.”
Navigate to the External - Document Distributions section.
Enter the reference type and URL.
Click “Create Download URL.”
Option 2: Batch Upload
Prepare a CSV with the following columns:
friendlier_id: ID of the main record.
reference_type: One of the reference codes (see table below).
distribution_url: The asset’s URL.
label (optional): Custom label for the Download button.
Data dictionaries are essential for documenting field names, types, and values. Previously, these were only stored as static files (e.g., XML, CSV). With the enhancements to GBL Admin, data dictionaries can now be managed in a relational database, offering flexibility and improved displays.
Benefits
Editable and Dynamic: Data dictionaries can be updated in the database as new information becomes available.
Enhanced Presentation: Information can be displayed dynamically on item pages.
Export Options: Users can export the data dictionary in a structured format.
How It Works
Prepare a CSV with the following headers:
friendlier_id: ID of the parent record.
label: Field label.
type: Field type.
values: Sample or defined values.
definition: Field definition.
definition_source: Source of the definition.
parent_field: The parent field’s name
From the item page, click “Data Dictionary.”
Provide a title and optionally a description of the data dictionary.
Select “Upload a CSV” and upload your file.
Once the data dictionary has been created, administrators can add, edit, or delete the fields.
Workflow Steps for Processing Metadata and Derivatives
With the Geodata Pilot Collection Workgroup concluding, we are moving into the Foundation Phase (Phase 2) of the program. This stage marks a transition from internal experimentation to external collaboration and public-facing operations. We will focus on four key areas: partnerships, collections, technology, and the publication of our first formal Curation Plan.
The primary goal for this phase is to collaborate with at least two data providers to establish a sustainable workflow for dataset exchange and curation. These partnerships may inform the creation of sharing agreements that ensure clear expectations and communication. By working directly with data providers, we will also gain practical insights into their needs, which will help us refine our metadata guidelines, data ingestion workflows, and curation strategies.
We will continue to iteratively improve our technology. On the front end, this will include BTAA Geoportal enhancements to item view pages. On the backend, we will implement batch ingest functionality.
The Foundation Phase will result in the development and publication of Version 1 of the Curation Plan. This document will outline the collection’s scope, metadata framework, and workflows, serving as a reference for both internal and external stakeholders.