Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OVN EIP, FIP External Gateway is not configured. #5050

Open
inyongma1 opened this issue Mar 4, 2025 · 18 comments
Open

[BUG] OVN EIP, FIP External Gateway is not configured. #5050

inyongma1 opened this issue Mar 4, 2025 · 18 comments
Labels
bug Something isn't working

Comments

@inyongma1
Copy link

inyongma1 commented Mar 4, 2025

Kube-OVN Version

v1.13

Kubernetes Version

v1.30

Operation-system/Kernel Version

"Rocky Linux 8.10 (Green Obsidian)"

Description

vnode-103-177, vnode-117-46 are the external gw nodes and also worker nodes and vnode-103-176 is the master node and I deleted taint

NAME STATUS ROLES AGE VERSION
vnode-103-176 Ready control-plane 3d20h v1.30.10
vnode-103-177 Ready 3d20h v1.30.10
vnode-117-46 Ready 3d20h v1.30.10

vnode-103-177 ovn.kubernetes.io/external-gw=true
vnode-117-46 ovn.kubernetes.io/external-gw=true

vpc1 starter-backend-7ff5f85b46-8d9gh 1/1 Running 0 16m 192.168.0.4 vnode-103-176

starter-backend-7ff5f85b46-8d9gh is the pod which is running on the master node and non-external gw node.

my question is that I reboot two external gw nodes(vnode-103-177, vnode-117-46) but starter-backend-7ff5f85b46-8d9gh is still communicated, and I also checked ofctl of master node(non-external gw node vnode-103-176). I found some nat rule in there.

cookie=0x8f0de392, duration=1176.940s, table=15, n_packets=64, n_bytes=4994, idle_age=466, priority=100,ip,reg14=0x1,metadata=0x5,nw_dst=10.9.101.9 actions=ct(commit,table=16,zone=NXM_NX_REG11[0..15],nat(dst=192.168.0.4))

I thought that if all external gw is down, I should not be communicated by OEIP.

these are my configuration.

apiVersion: kubeovn.io/v1
kind: ProviderNetwork
metadata:
  name: external0
spec:
  defaultInterface: eth0
---
apiVersion: kubeovn.io/v1
kind: Vlan
metadata:
  name: vlan0
spec:
  id: 0
  provider: external0
---
apiVersion: kubeovn.io/v1
kind: Subnet
metadata:
  name: external0
spec:
  protocol: IPv4
  cidrBlock: 10.9.0.0/16
  gateway: 10.9.0.1
  vlan: vlan0
  excludeIps:
  - 10.9.0.1..10.9.101.1
---
apiVersion: v1
kind: Namespace
metadata:
  name: vpc1
---
kind: Vpc
apiVersion: kubeovn.io/v1
metadata:
  name: vpc1
spec:
  namespaces:
  - vpc1
  enableExternal: true
---
apiVersion: kubeovn.io/v1
kind: Subnet
metadata:
  name: vpc1-subnet1
spec:
  cidrBlock: 192.168.0.0/24
  default: false
  disableGatewayCheck: false
  disableInterConnection: true
  enableEcmp: true
  gatewayNode: ""
  gatewayType: distributed
  #gatewayType: centralized
  natOutgoing: false
  private: false
  protocol: IPv4
  provider: ovn
  vpc: vpc1
  namespaces:
  - vpc1
---
kind: OvnEip
apiVersion: kubeovn.io/v1
metadata:
  name: eip-static
spec:
  externalSubnet: external0
  type: nat
---
kind: OvnFip
apiVersion: kubeovn.io/v1
metadata:
  name: eip-static
spec:
  ovnEip: eip-static
  ipName: starter-backend-7ff5f85b46-8d9gh.vpc1

Steps To Reproduce

  1. I referenced this page and configuration is same as https://kubeovn.github.io/docs/stable/en/advance/ovn-eip-fip-snat/
  2. oeip, ofip is confiugured.

Current Behavior

If two external gw nodes are rebooted, communicated by OEIP

curl 10.9.101.9
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Expected Behavior

If two external gw nodes are rebooted, I should not be communicated by OEIP

@inyongma1 inyongma1 added the bug Something isn't working label Mar 4, 2025
Copy link

dosubot bot commented Mar 4, 2025

The continued communication via OEIP even after rebooting both external gateway nodes is likely due to a fix implemented in Kube-OVN. This fix ensures that IP and route information is properly transferred during a node reboot, allowing for uninterrupted communication. The specific fix is mentioned in the changelog entry:

This means that even if the external gateway nodes are rebooted, the system is designed to maintain communication by transferring necessary IP and route information to ensure continuity.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@zbb88888
Copy link
Collaborator

zbb88888 commented Mar 4, 2025

Expected Behavior
If two external gw nodes are rebooted, I should not be communicated by OEIP

do you mean ? If two external gw nodes are rebooted, I should be communicated by OEIP

@inyongma1
Copy link
Author

inyongma1 commented Mar 4, 2025

I mean If two external gw nodes are down, external client should not be communicated by oeip.
Because packet from external should be passed through external gw node

@inyongma1
Copy link
Author

There are 3 nodes in the cluster: one control plane node and two worker nodes. I removed the taint from the control plane node so that pods can run on it. However, if the two worker nodes are set up as external gateways and both of these external gateway nodes go down, then there should no longer be an external gateway in the cluster, meaning that external communication should fail. Yet, in this case, it was confirmed that communication still occurs via the control plane.

@zbb88888
Copy link
Collaborator

zbb88888 commented Mar 5, 2025

you have 3 nodes, only two nodes are gw node. there two gw nodes both down, the eip should be failed to connect.

do you mean? the eip is still working?

you can use k ko nbctl show to check the gw nodes before your test

@inyongma1
Copy link
Author

inyongma1 commented Mar 5, 2025

do you mean? the eip is still working?
Answer: yes It still working.

2nodes are down but, nat rule is still remain
router 2d09e8e5-856f-4487-9cf5-d01d18a4aa0b (vpc1)
port vpc1-vpc1-subnet1
mac: "3a:aa:5f:38:fd:5f"
networks: ["192.168.0.1/24"]
port vpc1-external0
mac: "36:98:75:25:01:a4"
networks: ["10.9.101.8/16"]
gateway chassis: [df13bc9c-f6bb-4162-ba49-8c8798746d31 b9e3320b-b025-4b2f-b7ba-0520725de523]
nat a023aada-62d9-4df1-882a-191dc3fd4d6f
external ip: "10.9.101.9"
logical ip: "192.168.0.4"
type: "dnat_and_snat"

@zbb88888
Copy link
Collaborator

zbb88888 commented Mar 6, 2025

where is your client to test ping to the eip?

@inyongma1
Copy link
Author

inyongma1 commented Mar 6, 2025

Another node that is not in the Kubernetes cluster.
The node's ip and public ip's cidr is same.

@zbb88888
Copy link
Collaborator

zbb88888 commented Mar 6, 2025

Does node down mean the node has no power and is not in the restart process?

cloud you tcpdump host <10.9.101.9> in the only one (non-gw) node? and see the packets?

@inyongma1
Copy link
Author

01:58:51.730161 3c:ec:ef:7f:49:23 > 8a:72:bb:b8:26:3f, ethertype IPv4 (0x0800), length 98: 10.9.20.130 > 10.9.101.9: ICMP echo request, id 10492, seq 8, length 64
01:58:51.730210 8a:72:bb:b8:26:3f > 3c:ec:ef:7f:49:23, ethertype IPv4 (0x0800), length 98: 10.9.101.9 > 10.9.20.130: ICMP echo reply, id 10492, seq 8, length 64
01:58:52.754429 3c:ec:ef:7f:49:23 > 8a:72:bb:b8:26:3f, ethertype IPv4 (0x0800), length 98: 10.9.20.130 > 10.9.101.9: ICMP echo request, id 10492, seq 9, length 64
01:58:52.754496 8a:72:bb:b8:26:3f > 3c:ec:ef:7f:49:23, ethertype IPv4 (0x0800), length 98: 10.9.101.9 > 10.9.20.130: ICMP echo reply, id 10492, seq 9, length 64
01:58:53.778219 3c:ec:ef:7f:49:23 > 8a:72:bb:b8:26:3f, ethertype IPv4 (0x0800), length 98: 10.9.20.130 > 10.9.101.9: ICMP echo request, id 10492, seq 10, length 64
01:58:53.778285 8a:72:bb:b8:26:3f > 3c:ec:ef:7f:49:23, ethertype IPv4 (0x0800), length 98: 10.9.101.9 > 10.9.20.130: ICMP echo reply, id 10492, seq 10, length 64
01:58:54.802328 3c:ec:ef:7f:49:23 > 8a:72:bb:b8:26:3f, ethertype IPv4 (0x0800), length 98: 10.9.20.130 > 10.9.101.9: ICMP echo request, id 10492, seq 11, length 64
01:58:54.802375 8a:72:bb:b8:26:3f > 3c:ec:ef:7f:49:23, ethertype IPv4 (0x0800), length 98: 10.9.101.9 > 10.9.20.130: ICMP echo reply, id 10492, seq 11, length 64
01:58:55.826239 3c:ec:ef:7f:49:23 > 8a:72:bb:b8:26:3f, ethertype IPv4 (0x0800), length 98: 10.9.20.130 > 10.9.101.9: ICMP echo request, id 10492, seq 12, length 64
01:58:55.826295 8a:72:bb:b8:26:3f > 3c:ec:ef:7f:49:23, ethertype IPv4 (0x0800), length 98: 10.9.101.9 > 10.9.20.130: ICMP echo reply, id 10492, seq 12, length 64
01:58:56.850282 3c:ec:ef:7f:49:23 > 8a:72:bb:b8:26:3f, ethertype IPv4 (0x0800), length 98: 10.9.20.130 > 10.9.101.9: ICMP echo request, id 10492, seq 13, length 64
01:58:56.850336 8a:72:bb:b8:26:3f > 3c:ec:ef:7f:49:23, ethertype IPv4 (0x0800), length 98: 10.9.101.9 > 10.9.20.130: ICMP echo reply, id 10492, seq 13, length 64

@zbb88888
Copy link
Collaborator

zbb88888 commented Mar 6, 2025

router 2d09e8e5-856f-4487-9cf5-d01d18a4aa0b (vpc1)
port vpc1-vpc1-subnet1
mac: "3a:aa:5f:38:fd:5f"
networks: ["192.168.0.1/24"]
port vpc1-external0
mac: "36:98:75:25:01:a4"
networks: ["10.9.101.8/16"]
gateway chassis: [df13bc9c-f6bb-4162-ba49-8c8798746d31 b9e3320b-b025-4b2f-b7ba-0520725de523]
nat a023aada-62d9-4df1-882a-191dc3fd4d6f
external ip: "10.9.101.9"
logical ip: "192.168.0.4"
type: "dnat_and_snat"

the nat rule is exist is not a problem.

please run k ko sbctl show to check the gateway chassis: [df13bc9c-f6bb-4162-ba49-8c8798746d31 b9e3320b-b025-4b2f-b7ba-0520725de523]

check the chassis's hostname (node name), make sure the node is already shut down,

Please show the kubectl get nodes.,
The shutdown node is in not ready status?

@inyongma1
Copy link
Author

yes, node is not ready,
[root@vnode-103-176 ~]# k get nodes
NAME STATUS ROLES AGE VERSION
vnode-103-176 Ready control-plane 6d5h v1.30.10
vnode-103-177 NotReady 6d5h v1.30.10
vnode-117-46 NotReady 6d5h v1.30.10

[root@vnode-103-176 ~]# k ko sbctl show
Chassis "df13bc9c-f6bb-4162-ba49-8c8798746d31"
hostname: vnode-103-177
Encap geneve
ip: "10.9.103.177"
options: {csum="true"}
Port_Binding kube-ovn-pinger-xqpfk.kube-system
Port_Binding node-vnode-103-177
Chassis "257fcb36-ca3d-4da8-a931-d9e73b2ff76e"
hostname: vnode-103-176
Encap geneve
ip: "10.9.103.176"
options: {csum="true"}
Port_Binding starter-backend-7ff5f85b46-9rhcp.vpc1
Port_Binding kube-ovn-pinger-5mnlm.kube-system
Port_Binding coredns-55cb58b774-tfwwl.kube-system
Port_Binding starter-backend-7ff5f85b46-8d9gh.vpc1
Port_Binding starter-backend-7ff5f85b46-6x8g2.vpc1
Port_Binding starter-backend-7ff5f85b46-tvsxr.vpc1
Port_Binding starter-backend-7ff5f85b46-khc7b.vpc1
Port_Binding starter-backend-7ff5f85b46-5qp27.vpc1
Port_Binding coredns-55cb58b774-m7bp5.kube-system
Port_Binding starter-backend-7ff5f85b46-r7gqg.vpc1
Port_Binding starter-backend-7ff5f85b46-rkz4f.vpc1
Port_Binding starter-backend-7ff5f85b46-jknq8.vpc1
Port_Binding starter-backend-7ff5f85b46-wmw4q.vpc1
Port_Binding node-vnode-103-176
Chassis "b9e3320b-b025-4b2f-b7ba-0520725de523"
hostname: vnode-117-46
Encap geneve
ip: "10.9.117.46"
options: {csum="true"}
Port_Binding cr-ovn-cluster-external0
Port_Binding kube-ovn-pinger-bkj4k.kube-system
Port_Binding cr-vpc1-external0
Port_Binding node-vnode-117-46

@zbb88888
Copy link
Collaborator

zbb88888 commented Mar 7, 2025

ok, it is an issue.

[root@vnode-103-176 ~]# k get nodes
NAME STATUS ROLES AGE VERSION
vnode-103-176 Ready control-plane 6d5h v1.30.10
vnode-103-177 NotReady 6d5h v1.30.10
vnode-117-46 NotReady 6d5h v1.30.10

Chassis "257fcb36-ca3d-4da8-a931-d9e73b2ff76e"
hostname: vnode-103-176

Chassis "b9e3320b-b025-4b2f-b7ba-0520725de523"
hostname: vnode-117-46

router 2d09e8e5-856f-4487-9cf5-d01d18a4aa0b (vpc1)
port vpc1-vpc1-subnet1
mac: "3a:aa:5f:38:fd:5f"
networks: ["192.168.0.1/24"]
port vpc1-external0
mac: "36:98:75:25:01:a4"
networks: ["10.9.101.8/16"]
gateway chassis: [df13bc9c-f6bb-4162-ba49-8c8798746d31 b9e3320b-b025-4b2f-b7ba-0520725de523]
nat a023aada-62d9-4df1-882a-191dc3fd4d6f
external ip: "10.9.101.9"
logical ip: "192.168.0.4"
type: "dnat_and_snat"

vnode-103-176 is not a gw nodes, but eip traffic via it.

@oilbeater could you please help take a look at this issue?

@inyongma1
Copy link
Author

inyongma1 commented Mar 11, 2025

@oilbeater @zbb88888 please help me.

@oilbeater
Copy link
Collaborator

Can you check which external IP address the Pod use when visit the external network. And it's better to run kubectl ko trace to see the logical flows for the external traffic.

@zbb88888
Copy link
Collaborator

Oh, ovn fip can bond to the pod located node, which is distributed

@oilbeater
Copy link
Collaborator

@zbb88888 can you find the related document about distributed dnat_and_snat, I also remember it's distributed but I cannot find the document.

@zbb88888
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants