BTAA-GIN Geodata Pilot Workgroup Report

By:Geodata Pilot Collection WorkgroupFebruary 28, 2025Created: 2025

reports

Summary: Final report of Geodata Pilot Collection Working Group

The BTAA-GIN Geodata Pilot Workgroup met from Q1 2024-Q1 2025 to develop preliminary recommendations for building a geodata collection. The workgroup laid the groundwork for a geodata collection program, focusing on technical infrastructure, metadata standards, and workflows.

Scope

The Workgroup prioritized technical and metadata development over policy discussions, such as data provider agreements, which were assigned to a future group.

We realized that collecting the data provided an opportunity to expand our metadata documentation. Instead of relying solely on discovery and descriptive metadata fields, we introduced fields for technical and administrative metadata. By examining and running scripts on the downloaded data, we documented elements such as file size, resolution, and projection. As part of a processing workflow, we further added information about the provenance of the dataset.

This proactive approach will allow us to document as much as possible at the outset, minimizing the need to revisit datasets for additional information later. While our metadata profile does not currently conform to established standards such as ISO 191xx, we established a framework that captures the majority of information such standards require. This positions us well for future crosswalking to internationally recognized standards.

Activities

Expanded and refined the metadata profile to support more detailed documentation.
Oversaw the implementation of asset storage and data ingest functionality into the technology stack.
Vetted Python scripts for automating metadata extraction, data processing, and technical validation.
Established a plan to use web-friendly data derivatives, such as Cloud Optimized GeoTIFFs (COGs), PM Tiles, and visual thumbnails to enhance accessibility.
Assembled a pilot collection of 35 curated items, showcasing diverse formats and the implementation of new metadata features. This collection will be used as a proof of concept when communicating with potential partners.
Developed a prototype model for documenting dataset fields and values in simple tables based on the “Entity and Attributes” section of the FGDC standard.

Challenges

Meeting user needs for metadata with the varying quality of documentation from data providers.
Applying single-layer archiving principles to multipart items, such as Geodatabases.
Addressing the verbosity of certain metadata fields, such as lineage or contact information, which are difficult to reformat into lightweight Solr fields as key:value pairs.
Providing data previews for static downloads in the absence of a dedicated geospatial web service platform, such as GeoServer or ArcGIS.
Ensuring users have access to data dictionaries. Although few standards exist for structuring this information and it is rarely included in geodata metadata, these details are essential for evaluating the dataset’s contents. Only the legacy FGDC metadata standard provides a structured place for data dictionaries.

Models and Tools

Whenever possible, we built on existing models and tools to avoid reinventing the wheel, leveraging prior art to inform and streamline our work.

The BTAA Geoportal uses a local profile of OGM Aardvark metadata, which already provides a strong foundation for geospatial metadata. Following Aardvark’s established guidelines, we extended the profile with additional fields from DCAT and Dublin Core. In particular, we integrated elements from:
- DCAT-3 US Profile: A schema tailored for the United States, designed for interoperability.
- GeoDCAT: A geospatial adaptation of DCAT, incorporating several geospatial-specific fields.
Insights from Jaime Martindale’s work at UW-Madison’s Robinson Map Library provided valuable insights, including establishing minimum metadata requirements and best practices for engaging with data providers.
For metadata processing, we adapted scripts based on existing tools:
GeoDCT by Melinda Kernik: For creating file inventories and documenting fields.
Geospatial Metadata Toolkit by Kim Durante: For extracting technical metadata values.

Areas of Discussion and Decisions

What datasets to include in the pilot collection

The workgroup considered what kind of data would be included in the pilot collection to be test and represent our goals. They scanned multiple data clearinghouses and portals and chose resources with a range across the following characteristics:

Geographic coverage: Maryland, New Jersey, Indiana, Illinois, and Minnesota
Administrative levels: State, county, and municipal
Temporal coverage: Current (2024) and historical (1970s, 2000s)
Formats: Vector, raster, and databases
Theme: Foundational layers (such as roads, parcels, and zoning) and high value layers (land cover/land use).

What to use for a staging area

A Box account was established as the primary staging area for data processing and temporary storage. While this solution provided an immediate workspace, discussions highlighted the need for a scalable and sustainable staging infrastructure to support long-term data management.

How to approach geodata processing

To maintain data integrity, the workgroup decided to retain datasets in their original formats, avoiding reprojections or format conversions whenever possible. Additionally, new Python scripts were reviewed to facilitate the extraction of technical metadata, including file attributes, coordinate systems, and bounding box information, ensuring that essential metadata could be captured efficiently.

What additional metadata to generate or collect

Metadata discussions emphasized the importance of incorporating technical and administrative details, such as projections and data sources. Several new fields were proposed to enhance metadata completeness:

Provenance (dct_provenanceStatement_sm)
Coordinate reference system (dct_conformsTo_sm)
Spatial resolution (dcat_spatialResolutionInMeters_s)
Scale (geocat_spatialResolutionInText_sm)
Geographic extent (b1g_dct_extent_sm)

The group noted that numeric spatial resolution might only be useful for raster formats, as vector data often lacks standardized measurement criteria. For vector data, they approved a plain text field to hold scale descriptions. The group also explored strategies for organizing records into dataset series based on geographic coverage or temporal range, but did not recommend a formal solution.

How to enhance GeoBlacklight (GBL) Admin

The workgroup evaluated improvements to GBL Admin that introduced new functionalities for uploading, attaching, and managing files through Amazon S3. They also identified elements to be included in a user download package, including the original dataset, metadata, and optional data dictionaries.

How well the pilot collection represents our capabilities

The group conducted a final review of the pilot collection within the development geoportal to determine if it demonstrated our core capabilities. The workgroup assessed field structures and labels, making recommendations for refinements to improve clarity and usability.

Next steps: The Foundation Phase (2025–2026)

With the pilot phase concluding, the workgroup recommended sunsetting its activities and transitioning to a new phase. The “Foundations” phase, set to begin in early 2025, will focus on collaborating with data providers to establish exchange workflows and build an initial collection. The new workgroup will explore potential partnerships with other institutions to expand and strengthen the collection.

Partnerships: We will partner with data providers to develop a process for exchanging resources.
Technology: We will continue to enhance the BTAA Geoportal and backend management tools.
Collections: We will curate real-world data submissions to build our first official collections.
Curation Plan: We will release a formal plan that documents our goals, workflows, and guidelines.