-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPRUN-3692: Olmv1-catalogd tests for API endpoints #29580
base: main
Are you sure you want to change the base?
Conversation
@anik120: This pull request references OPRUN-3692 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
test/extended/olm/olmv1.go
Outdated
} | ||
}) | ||
|
||
g.It("[OCPFeatureGate:NewOLMCatalogdAPIV1Metas] should serve the /v1/api/metas API endpoint", func(ctx g.SpecContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this valid? Can [OCPFeatureGate:NewOLMCatalogdAPIV1Metas]
be added to an It block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you should put it in it's own g.Describe()
Yes, you'll possibly duplicate a bunch of stuff, but that's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put each of these g.It()
you've added as their own g.Describe()
so they count as new tests... since you need to have 5 for each Feature Gate.
c09fb4d
to
902037a
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: anik120 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
test/extended/olm/olmv1.go
Outdated
func verifyAPIEndpoint(ctx g.SpecContext, urlPath string) { | ||
var lastError error | ||
|
||
// Retry API call a few times to handle potential connection issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should not be allowing ourselves to retry. IIUC, this would mean that as long as our endpoint worked 1/15 times, this test would pass.
Maybe instead, we should implement a load test with 100 concurrent requests and verify:
- 100% response rate with 200 status code (for valid requests)
- Some reasonable 99 percentile response time threshold check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into writing a load test now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also no retry in the new Job
, it's a one and done operation after which the result is checked with a timeout (exit 0/exit 1)
verifyAPIEndpoint(ctx, "https://localhost:8443/catalogs/openshift-community-operators/api/v1/metas?schema=olm.package") | ||
}) | ||
}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place to implement upgrade interruption tests?
Another useful test to add (but not specific to the new metas feature gate):
- Start a loop that on every iteration: gets a ClusterCatalog's status.URLs value and makes a query to a valid endpoint under that URL.
- Do the upgrade
The expectation would be that the loop begins to get connection errors or 404 responses as the new pod starts up, but within a certain threshold (1m?) begins responding with 200 responses again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By upgrade do you mean openshift
upgrade?
Cursory look did not reveal anything of that nature in this repository, I'll have to research a little more. I know openshift does upgrade tests, I'm not sure if those tests are written in this repository.
test/extended/olm/olmv1.go
Outdated
// Wait for port-forward to be established | ||
time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The port-forward isn't ready by the time the oc.Run("port-forward")
command returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just being extra cautious
test/extended/olm/olmv1.go
Outdated
// Start port-forward in background | ||
_, _, _, err := oc.AsAdmin().Run("port-forward").Args("-n", "openshift-catalogd", "svc/catalogd-service", "8080:443").Background() | ||
o.Expect(err).NotTo(o.HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could imagine port-forward hiding issues that in-cluster clients would encounter (e.g. in-cluster DNS).
Doing a port-forward based test is probably a good thing to do if that is functionality we expect to work, but I also think we should run a Job
workload that can perform a similar test using cluster networking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup a Job
is actually perfect for this use case, and that totally slipped my mind 😮💨 updated the PR to use a Job
instead of port forwarding
test/extended/olm/olmv1.go
Outdated
func(ctx context.Context) (bool, error) { | ||
// Create a custom HTTP client that skips TLS verification | ||
tr := &http.Transport{ | ||
TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not skip TLS verification. We expect real clients to connect using a trusted CA chain, so our test should use that same CAs as we expect our clients to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a job and curling the http endpoint directly now 👍🏽
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: 902037a
|
6dbc1d4
to
d02be06
Compare
}) | ||
|
||
var _ = g.Describe("[sig-olmv1][OCPFeatureGate:NewOLM][Skipped:Disconnected] OLMv1 Catalogs /v1/api/all endpoint load test", func() { | ||
defer g.GinkgoRecover() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead, we should implement a load test with 100 concurrent requests and verify:
100% response rate with 200 status code (for valid requests)
Some reasonable 99 percentile response time threshold check.
@joelanford added one load test each for both end points. Was able to just use verifyAPIEndpoint
that checks for 200 OK status, otherwise throws error. So that takes care of the 100% response rate clause.
For the "resonable time threshold" check, I've reduced the timeout to 30s for each Job
to report Completed
here (was previously 1 min).
Which means we're making a 100 request to each endpoint concurrently, waiting for each request to complete within 30s, and report 200 OK status.
for range 100 { | ||
wg.Add(1) | ||
go func(catalogIdx int) { | ||
defer wg.Done() | ||
|
||
g.By(fmt.Sprintf("Testing api/v1/all endpoint for catalog %s", providedCatalogs[catalogIdx])) | ||
verifyAPIEndpoint(ctx, oc, oc.Namespace(), providedCatalogs[catalogIdx], "all") | ||
}(idx) | ||
idx = (idx + 1) % len(providedCatalogs) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a really hard time grokking the way the index is working here. For clarity, can we change this to something like:
var wg sync.WaitGroup
for _, catalog := range providedCatalogs {
for range 100 {
wg.Add(1)
go func() {
verifyAPIEndpoint(ctx, oc, oc.Namespace(), catalog, "all")
}()
}
}
wg.Wait()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I don't think you need separate Jobs. Might take some experimentation, but you may be able to set
spec:
activeDeadlineSeconds: 30 # how long the job is allowed to be active
backoffLimit: 0 # don't retry failed pods
completions: 100 # required 100 completions
maxFailedIndices: 0 # allow 0 failures
parallelism: 100 # run 100 pods at a time
ttlSecondsAfterFinished: 10 # make the job eligible for deletion 10 seconds after it finishes
|
||
// verifyAPIEndpoint runs a job to validate the given service endpoint of a ClusterCatalog | ||
func verifyAPIEndpoint(ctx g.SpecContext, oc *exutil.CLI, namespace, catalogName, endpoint string) { | ||
jobName := fmt.Sprintf("test-catalog-%s-%s-%s", catalogName, endpoint, rand.String(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see metas?schema=olm.package
being passed as endpoint and then used in the job name. Seems like that's an invalid name? Maybe pass url.Params
as a separate argument in the function?
if job.Status.Succeeded > 0 { | ||
return true, nil | ||
} | ||
if job.Status.Failed > 0 { | ||
return false, fmt.Errorf("job failed") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the Job conditions, it seems like we can check for type: Complete
(success) or type: Failed
(failed). Maybe we should also consider looking at the difference between status.completionTime
and status.startTime
for reporting/checking duration.
test/extended/olm/olmv1.go
Outdated
set -ex | ||
response=$(curl -s -k "%s" || echo "ERROR: Failed to access endpoint") | ||
if [[ "$response" == ERROR* ]]; then | ||
echo "$response" | ||
exit 1 | ||
fi | ||
echo "$response" > /tmp/api-response | ||
|
||
# check if response can be parsed as new line delimited JSON | ||
if cat /tmp/api-response | jq -s . > /dev/null 2>&1; then | ||
echo "Valid JSON response detected" | ||
exit 0 | ||
fi | ||
echo "ERROR: Invalid JSON response" | ||
exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes just a single request per pod, which probably isn't quite as efficient as if we could use an http load testing program to make a bunch of concurrent requests from a single process.
Not sure where to find an image with an http load tester available to the origin suite, but if we can find one, it would likely provide much more useful information (e.g. histogram of durations of the requests)
Job Failure Risk Analysis for sha: d02be06
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: d02be06
New tests seen in this PR at sha: d02be06
|
Introduces tests for the new `api/v1/metas` endpoint when NewOLMCatalogdAPIV1Metas feature gate in enabled. Signed-off-by: Anik Bhattacharjee <[email protected]>
d02be06
to
c93d38e
Compare
@anik120: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: c93d38e
New tests seen in this PR at sha: c93d38e
|
Introduces tests for the new
api/v1/metas
endpoint when NewOLMCatalogdAPIV1Metas feature gate in enabled.