Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Yarn 3+] fallback to inspecting package.json if no git remote origin defined #855

Open
nickboldt opened this issue Mar 6, 2025 · 13 comments

Comments

@nickboldt
Copy link

nickboldt commented Mar 6, 2025

In the case of Yarn 3+, cachi2 uses git remote show origin to define the VCS url of the main package (the one that is being prefetched) and all the other packages that might exist in the repository.

This means that you can only run cachi2 locally for a yarn3+ resolution if you're in a git repo.

Without a remote repo configured, you might see:

InvalidGitRepositoryError for file: URL

The workaround (if you're not already in a git repo) is to:

git init; git remote add origin git@url

But as suggested in this thread it might be better if a fallback behaviour was implemented where the package.json could be introspected instead, if no git origin was defined.

@nickboldt
Copy link
Author

See also this PR to update the docs with a known pitfall + workaround, until the above is implemented:

https://github.com/hermetoproject/cachi2/pull/856/files

@brunoapimentel
Copy link
Contributor

brunoapimentel commented Mar 6, 2025

This behavior is consistent across all package managers, since there are different needs for the metadata that is tied to a git repo (such as tags). The suggestion I brought up was that we changed this fundamental proposition, and allowed users to prefetch content from a folder that is not a Git repository.

This would result in less accurate data in the SBOM (we wouldn't know the VCS url for the repository packages for instance), but that is probably a not a big issue for local testing, and the fallback behavior is probably better than the strict failure Cachi2 produces today.

@eskultety
Copy link
Member

but that is probably a not a big issue for local testing, and the fallback behavior is probably better than the strict failure Cachi2 produces today.

I disagree, the suggested fallback is only a ticking time bomb IMO leading to more errors in the future when we actually try to query metadata which would simply not be present, IOW we'd have to make sure on the global level that we first run a git fetch to download all git objects before doing anything else, because the suggested workaround of adding a git remote will only cover a very small portion of the invocations and metadata we query across all package manager backend.
I don't think the workaround here provides the level of UX justifying the change itself in the name of decreasing (if everything works) the quality of our SBOM.
Having a git repository on the input is a common requirement for any CI/CD pipeline solution and in our case it's also a vital source of metadata information which we would not be able to gather without asking the user for it (another trust problem), so if anything, we need to improve upon our docs to make this expectation of ours regarding input sources in a big fat bold text.

@brunoapimentel
Copy link
Contributor

I disagree, the suggested fallback is only a ticking time bomb IMO leading to more errors in the future when we actually try to query metadata which would simply not be present, IOW we'd have to make sure on the global level that we first run a git fetch to download all git objects before doing anything else, because the suggested workaround of adding a git remote will only cover a very small portion of the invocations and metadata we query across all package manager backend. I don't think the workaround here provides the level of UX justifying the change itself in the name of decreasing (if everything works) the quality of our SBOM. Having a git repository on the input is a common requirement for any CI/CD pipeline solution and in our case it's also a vital source of metadata information which we would not be able to gather without asking the user for it (another trust problem), so if anything, we need to improve upon our docs to make this expectation of ours regarding input sources in a big fat bold text.

I think we should strive to make Cachi2 work with the less amount of hassle from the user. I don't think the argument of less quality SBOM is strong here. Cachi2 should simply work with the data it can gather. If it is a git repo, it potentially will have more relevant data, otherwise, it has less relevant data. It is quite different from having arbitrary code execution and getting unknown components into the output folder. Local use cases are also valid, and one could be using another VCS than git.

This is just one of many things I'd like to change regarding UX, but in my mind, the overall goal is to make Cachi2 easy and intuitive to use. I should be able to download Cachi2 and process my local project without having to go over documentation. We should strive to give a hard failure only on things that are really causing security issues.

@eskultety
Copy link
Member

eskultety commented Mar 7, 2025

I think we should strive to make Cachi2 work with the less amount of hassle from the user

If the user is able to do git init; git remote add then I'm sure should also be able to do git clone , I'm failing to see what kind of hassle we're talking about in this particular case. The only argument I can see justifying this would be if the user were unable to access the repo in which case though I'm asking how is it possible that they got the sources, but were not able to access the source repo itself.

I should be able to download Cachi2 and process my local project without having to go over documentation.

This is an unachievable goal the more features you pack. Sooner or later a user will be required to go over the documenation and that is expected and encouraged, hence we need to invest much more time and effort into improving our docs such that many pitfalls/gotchas would be covered and most of the issues would ideally be resolved via "self-service". Other than that though, I think this statement is very bold and dangerous at the same time, because you very much need to strike the balance of putting in code just for user convenience for the sake of UX and not turn it into a living maintenance nightmare at the same time. Docs is great and I don't think engineers are afraid of it, they just get frustrated with their lack of quality OR its complete lack of, hence my point.

to make Cachi2 easy and intuitive to use

Intuitive to use doesn't necessarily have to mean going against our mission of providing highly accurate SBOMs which is our top selling point ATM and the user needs to understand that - again, if they don't -> we need to improve the docs to make it clear.

We should strive to give a hard failure only on things that are really causing security issues.

Security issues are not the only types of issues that may impede with our ability to process an input, e.g. repository misconfigurations are typically not security issues, but it definitely does get in our way of processing inputs, requiring a hard failure in order for the user to go and address it before re-running, we simply cannot have a workaround for every combination of output that we could not use/process in a straightforward manner, that itself is a practice which may result into injecting more bugs and security holes on our side, in the worst case scenario leading to compromising the whole Secure Software Supply Chain pipeline, NB we are at the very beginning of it.

Local use cases are also valid, and one could be using another VCS than git

Yep, this is a valid point. However, the overall sentiment of the comment being quoted and the workaround discussed in this issue don't IMHO go in the same direction as this very statement alone. I agree that there may be other types of VCS (although not so common nowadays I think) which means that we can simply add a tracking issue to explore other VCS types other than git, keep it at low priority and when the need comes, we can start investigating how we can go about it, because a lot of our current functionality would be impacted by it.

@brunoapimentel
Copy link
Contributor

If the user is able to do git init; git remote add then I'm sure should also be able to do git clone

This is the hassle. I think the project should just work, not make the user bang his head, go over to docs and try to figure out why it is giving a InvalidGitRepositoryError. You may argue it is not too much of a hassle, but I think these small things pile up to make the project a real pain to use.

Intuitive to use doesn't necessarily have to mean going against our mission of providing highly accurate SBOMs

I don't think dropping the requirement of a folder being a Git repo goes against it. Cachi2 is being as accurate as it can with the data it has access to.

I don't want to drag this into a huge discussion. I think the situation here speaks for itself: users are creating workarounds, potentially feeding fake data to generate a Git repo just to make the project work, and I think that is a clear sign of bad UX.

Let's hear what other folks have to say, and then we can decide to close or go ahead with this.

@nickboldt
Copy link
Author

nickboldt commented Mar 7, 2025

The other option (as suggested in #856 (comment) ) would be to have in bold print somewhere in the statement of what Cachi2 is and what it's for... that you need to refer to a git repo because the whole reason it exists to to pull information from a cached copy of sources for a given repo and commit SHA.

If that's the case then the "initial setup / how to use" readme would just explain that you need to either:

  • git clone your-source-repo

or

  • git init && git add -A && git commit -m "initial commit" && git remote add origin https://github.com/someorg/somerepo

@eskultety
Copy link
Member

eskultety commented Mar 7, 2025

NB the following

git init && git add -A && git commit -m "initial commit" && git remote add origin https://github.com/someorg/somerepo

would still not be enough unless fetch is called at the end because ^this will still not give you the actual git objects from the origin which may get used during our git queries, still leading to an error. Once you combine it with fetch then the question immediately becomes: wouldn't have git clone been easier and more straight forward all along? :)

@eskultety
Copy link
Member

The other option (as suggested in #856 (comment) ) would be to have in bold print somewhere in the statement of what Cachi2 is and what it's for... that you need to refer to a git repo because the whole reason it exists to to pull information from a cached copy of sources for a given repo and commit SHA.

@nickboldt if you as our end user don't mind reading docs and would have appreciated ^this particular bit of info in the docs that would have prevented you from opening the issue in the first place, then I think we have the answer we needed :) .

@brunoapimentel
Copy link
Contributor

wouldn't have git clone been easier and more straight forward all along? :)

If I understand correctly, the project that triggered this issue is not a Git repository at all, it is generated as part of a build process.

@gashcrumb
Copy link

gashcrumb commented Mar 7, 2025

If I understand correctly, the project that triggered this issue is not a Git repository at all, it is generated as part of a build process.

Correct, in our case the build consists of the main application for which we're currently building a cache for and then a number of additional packages that are artifacts that could be loaded at runtime. In this case the artifacts are currently part of the source tree here and each of these wrappers contain a command like this which creates a yarn project (from there our tooling usually does a yarn install --immutable). For many of these wrappers the cache we create before going through the build seems to be fine and this process all works when building from only the cache with a couple of exceptions. I was hoping to figure out what dependencies are being hidden due to our tooling by creating a cache from one of these derived packages.

@eskultety
Copy link
Member

@nickboldt I noticed you linked a private Slack thread in your issue description - this is a public OSS project so we can't reference private/unresolvable resources, please extract the relevant bits of information and put it in the description :), thanks.

@eskultety
Copy link
Member

@gashcrumb @nickboldt I'm sorry for my ignorance, I'm not well versed at the Yarn ecosystem at all, so I'll need more help understanding the problem here because what I'm seeing based on the links provided, these plugins are still part of the main git tree, so what is the motivation to prefetch for these from extracted tarballs (i.e. git metadata missing) instead of doing prefetch on over the main repo, but specifying the subpath to these plugins individually which would result in the prefetch working, wouldn't it?
I must be missing something vital here that I'm following the issue filed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants