Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ckan-php-manager tagging cli script gives 409 error #2413

Closed
FuhuXia opened this issue Nov 10, 2020 · 12 comments
Closed

ckan-php-manager tagging cli script gives 409 error #2413

FuhuXia opened this issue Nov 10, 2020 · 12 comments
Assignees
Labels
support Issues from agency requests or affecting users

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Nov 10, 2020

Running ckan-php-manager tagging cli script on latest Climate spreadsheet, more than half datasets failed with error message 409: Conflict (e.g. name already exists).

How to reproduce

  1. Save this sample as assign_climate_test.csv in ./data folder in your local ckan-php-manager repo directory.
california-ocean-uses-atlas,Climate,"Ecosystem Vulnerability;Land Cover and Land Use"
cdc-national-environmental-public-health-tracking-network-tracking-network,Climate,"Human Health"
  1. Run command $ php cli/tagging/assign_groups_and_tags.php

Expected behavior

Two datasets updated successfully.

Actual behavior

Second dataset updated successfully, but first one gives error:

Screen Shot 2020-11-10 at 1 35 07 PM

The log generated in ./results did not capture the error. It considered both updates successful.

@mogul
Copy link
Contributor

mogul commented Nov 12, 2020

We want to check that this is still a problem with catalog 2.8+ before we put effort into fixing it.

@mogul
Copy link
Contributor

mogul commented Nov 12, 2020

(Iceboxing inside the catalog 2.8 epic for potential review after catalog 2.8+ is live)

@FuhuXia
Copy link
Member Author

FuhuXia commented Feb 24, 2021

Same issue on Catalog with CKAN 2.8.

@mogul mogul added the support Issues from agency requests or affecting users label Jul 22, 2021
@jbrown-xentity jbrown-xentity self-assigned this Oct 5, 2021
@jbrown-xentity
Copy link
Contributor

So by hitting the API directly I was able to get to the bottom of this. The test case has a tag, doc/noaa/nos/nms . CKAN only allows alphanumeric characters or symbols -_. in tags. Because the harvester is able to bypass CKAN validation logic, these datasets are created but cannot ever be updated.

Need to research

  1. Why tags of this format are being used (seems to denote hierarchy, is it widespread?)
  2. Why CKAN doesn't allow symbols in tags

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Oct 5, 2021

Turns out there are ~2K tags with a / in the name (api call). Making these tags compliant with CKAN is non-trivial. Some examples:

  • biomass densities(kg/sq km) denotes a unit of measure
  • big bear/baby bear bays state marine park denotes multiple regions, but in conjunction (shouldn't split on /)
  • 2013/12 denoting a date
  • 1/3 arc-second dem denoting a geographic zoom level (in ideal world would be a legitimate tag)

Splitting on / as we do with ,;> would result in some strange behavior. Changing to a valid character .-_ would also result in some user confusion.

There are also a large number of other bad tag values that could be used that should be cleaned up as well. The api /api/action/tag_list takes too long to return, will need to use a DB query to find complete list.

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Oct 6, 2021

Tracked down where all this logic occurs.
The spatial harvester overrides the tag schema validation during harvesting, so tags that are not valid by default are inserted. Spatial defined tag validation vs CKAN default tag validation.
This is made possible by this logic in ckan core for package_create (similar in package_update).
We have a few options:

  1. Create our own tag validation function that overrides the current default CKAN that allows for tags that currently exist, so that we can edit datasets (specific issue that this ticket addresses)
  2. Make a PR upstream to remove the tag customization on harvesting and add to our currently used fork. This would not solve the issue in the short term, and we would have to figure out how to handle tags that are invalid in the long term (should we remove/ignore them? Try to make them compliant? etc).

I propose we attempt option 1 now (use geodatagov to overwrite the create_package_schema, well documented how to do this), and make a new ticket for option 2 as the long term fix.

@jbrown-xentity
Copy link
Contributor

For future, found logic by tracing from the error message backwards:

tag_name_validator: https://github.com/ckan/ckan/blob/ckan-2.9.4/ckan/logic/validators.py#L453-L459
tag_string_convert: https://github.com/ckan/ckan/blob/6731c5a821a6a5f4bdaa20f4e793e0b6ba44f823/ckan/logic/validators.py#L468-L487
default_create_package_schema: https://github.com/ckan/ckan/blob/ckan-2.9.4/ckan/logic/schema.py#L134 (I believe this is where the validator is defined)
create_package_schema: https://github.com/ckan/ckan/blob/ckan-2.9.4/ckan/lib/plugins.py#L339-L340
package_create: https://github.com/ckan/ckan/blob/ckan-2.9.4/ckan/logic/action/create.py#L157

This last is the key point, either it takes the default or takes the schema defined in the context. Need to examine if geodatagov, spatial, or harvester override this variable.

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Oct 6, 2021

For reference, closely related to #1827 and #1828

@FuhuXia
Copy link
Member Author

FuhuXia commented Oct 6, 2021

Option 1 sounds good, solving the issue without any side effect. Just document it well so that once option 2 is completed and all existing tag fixed, we can remove option1 code.

@nickumia-reisys
Copy link
Contributor

This is a messy topic. As a sort of summary, here's what I gather from above,

  • CKAN tagging only supports alphanumeric character and .-_
  • ckanext-spatial tagging overrode this functionality to allow uppercase characters, and (accidentally?) opened up additional characters, specifically the / in question.
  • Users have been (unknowingly?) utilizing the bug to create improper tags.
  • We have a couple of options,
    • (Create a new feature) Support the new bug, customize extensions to accommodate the extra characters.
    • (Fix the bug) Figure out how to convey the same information with less characters in the tags.

We already do custom validation in ckanext_usmetadata, I'm not opposed to creating the new feature. I think it has to be decided whether it helps our mission to restrict the tagging capability. It also is a consideration of how quick this bug needs to be fixed.

@jbrown-xentity All in all, great find!

@jbrown-xentity
Copy link
Contributor

The changes that were made to resolve this ticket:

  • Update ckan-php-manager script to log dataset update failures
  • Update spatial on py3 to use tag validation appropriately

Since this tagging process is not a heavily utilized feature, we will leave this to be fully live on Py3 on cloud.gov. Attaching to the epic to close once live.

@FuhuXia
Copy link
Member Author

FuhuXia commented Aug 30, 2022

Verified it is fixed on catalog on cloud.gov.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support Issues from agency requests or affecting users
Projects
Archived in project
Development

No branches or pull requests

5 participants