Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventually a removed flow is recreated when Kytos is under high usage (possible race condition) #12

Closed
ajoaoff opened this issue Jun 7, 2021 · 1 comment · Fixed by #18
Labels
2022.1 Kytos-ng 2022.1 bug Something isn't working

Comments

@ajoaoff
Copy link

ajoaoff commented Jun 7, 2021

Original issue opened by @italovalcy at kytos#128.

Hi folks,

The consistency check routine eventually recreates a flow when Kytos is under high usage (many requests in a short period of time), which indicates a possible race condition.

More specifically, the following end-to-end tests sometimes pass and sometimes fail:

_______________________________ TestE2EMefEline.test_070_repeat_creation_and_removal_of_ten_circuits_ten_times ________________________________

self = <tests.test_e2e_10_mef_eline.TestE2EMefEline instance at 0x7f757b5bd5f0>

    def test_070_repeat_creation_and_removal_of_ten_circuits_ten_times(self):
        for x in range(0, 10):
            evcs = {}
            for i in range(400, 410):
                payload = {
                    "name": "evc_%s" % i,
                    "enabled": True,
                    "dynamic_backup_path": True,
                    "uni_a": {
                        "interface_id": "00:00:00:00:00:00:00:01:1",
                        "tag": {"tag_type": 1, "value": i}
                    },
                    "uni_z": {
                        "interface_id": "00:00:00:00:00:00:00:02:1",
                        "tag": {"tag_type": 1, "value": i}
                    }
                }
                api_url = KYTOS_API + '/mef_eline/v2/evc/'
                response = requests.post(api_url, data=json.dumps(payload), headers={'Content-type': 'application/json'})
                assert response.status_code == 201
                data = response.json()
                assert 'circuit_id' in data
                evcs[i] = data['circuit_id']

            time.sleep(10)

            # make sure the evcs are active and the flows were created
            s1, s2 = self.net.net.get('s1', 's2')
            flows_s1 = s1.dpctl('dump-flows')
            flows_s2 = s2.dpctl('dump-flows')
            for vid in evcs:
                evc_id = evcs[vid]
                api_url = KYTOS_API + '/mef_eline/v2/evc/' + evc_id
                response = requests.get(api_url)
                assert response.status_code == 200
                evc = response.json()
                #should be active
                assert evc["active"] is True
                # search for the vlan id
                assert "dl_vlan=%s" % vid in flows_s1
                assert "dl_vlan=%s" % vid in flows_s2
                # search for the cookie, should have two flows
                assert len(re.findall(evc['id'], flows_s1, flags=re.IGNORECASE)) == 2
                assert len(re.findall(evc['id'], flows_s2, flags=re.IGNORECASE)) == 2

            # Delete the circuits
            for vid in evcs:
                evc_id = evcs[vid]
                api_url = KYTOS_API + '/mef_eline/v2/evc/' + evc_id
                response = requests.delete(api_url)
                assert response.status_code == 200

            time.sleep(10)

            # make sure the circuits were deleted
            api_url = KYTOS_API + '/mef_eline/v2/evc/'
            response = requests.get(api_url)
            assert response.status_code == 200
            assert response.json() == {}
            flows_s1 = s1.dpctl('dump-flows')
            flows_s2 = s2.dpctl('dump-flows')
>           assert len(flows_s1.split('\r\n ')) == 1, "round=%d - should have only 1 flow but had: \n%s" % (x, flows_s1)
E           AssertionError: round=8 - should have only 1 flow but had:
E              cookie=0xaa28bc57b4ba8f4a, duration=8.110s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=405 actions=strip_vlan,mod_vlan_vid:3461,output:"s1-eth3"
E              cookie=0xaa28bc57b4ba8f4a, duration=8.079s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth3",dl_vlan=3461 actions=strip_vlan,mod_vlan_vid:405,output:"s1-eth1"
E              cookie=0x0, duration=220.814s, table=0, n_packets=144, n_bytes=6048, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
E
E           assert 3 == 1
E            +  where 3 = len([' cookie=0xaa28bc57b4ba8f4a, duration=8.110s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=405 actions=...20.814s, table=0, n_packets=144, n_bytes=6048, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'])
E            +    where [' cookie=0xaa28bc57b4ba8f4a, duration=8.110s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=405 actions=...20.814s, table=0, n_packets=144, n_bytes=6048, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'] = <built-in method split of str object at 0x564093f85fd0>('\r\n ')
E            +      where <built-in method split of str object at 0x564093f85fd0> = ' cookie=0xaa28bc57b4ba8f4a, duration=8.110s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=405 actions=s...220.814s, table=0, n_packets=144, n_bytes=6048, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'.split

As noted, the test basically creates 10 EVCs, waits a couple of seconds, makes sure the flows were created, removes the 10 evcs, waits a couple of seconds, makes sure the flows were removed, and so on. We repeat this procedure 10 times.

This test was run using the AmLight Kytos image (https://hub.docker.com/r/amlight/kytos), which is basically a master branch daily build image, with some PRs applied (e.g., #117 and more: https://github.com/amlight/kytos-docker/tree/master/patches). Also, we change the default STATS_INTERVAL setting to 3s and leave the CONSISTENCY_INTERVAL=0 (which means the consistency_check will run every 3s). The LINK_UP_TIMER is also changed to 1s.

This problem seems to be a race-condition situation because while the consistency check is running, other requests to remove EVCs are received for different switches and each thread seems to change the self.stored_flows in a uncontrolled manner (the shared resource).

I've enabled some logging and this is how it looks like (logs for the above test, cookie 0xaa28bc57b4ba8f4a == 12261257070396477258 ==> evc ID 28bc57b4ba8f4a):

May 11 01:35:28 da6a340a8416 kytos.napps.kytos/mef_eline:INFO main:246:  Removing EVC(28bc57b4ba8f4a, evc_405)

May 11 01:35:28 da6a340a8416 kytos.napps.kytos/flow_manager:INFO main:375:  _install_flows switches=[Switch('00:00:00:00:00:00:00:01')] command=delete flows={'flows': [{'cookie': 12261257070396477258, 'cookie_mask': 18446744073709551615}]}

May 11 01:35:29 da6a340a8416 kytos.napps.kytos/flow_manager:INFO main:375:  _install_flows switches=[Switch('00:00:00:00:00:00:00:02')] command=delete flows={'flows': [{'cookie': 12261257070396477258, 'cookie_mask': 18446744073709551615}]}

May 11 01:35:31 da6a340a8416 kytos.napps.kytos/flow_manager:INFO main:375:  _install_flows switches=[Switch('00:00:00:00:00:00:00:01')] command=add flows={'flows': [{'match': {'in_port': 1, 'dl_vlan': 405}, 'cookie': 12261257070396477258, 'actions': [{'action_type': 'pop_vlan'}, {'action_type': 'push_vlan', 'tag_type': 's'}, {'action_type': 'set_vlan', 'vlan_id': 3461}, {'action_type': 'output', 'port': 3}]}]}

May 11 01:35:31 da6a340a8416 kytos.napps.kytos/flow_manager:INFO main:375:  _install_flows switches=[Switch('00:00:00:00:00:00:00:01')] command=add flows={'flows': [{'match': {'in_port': 3, 'dl_vlan': 3461}, 'cookie': 12261257070396477258, 'actions': [{'action_type': 'pop_vlan'}, {'action_type': 'push_vlan', 'tag_type': 'c'}, {'action_type': 'set_vlan', 'vlan_id': 405}, {'action_type': 'output', 'port': 1}]}]}

In between those events, I could see that the function _store_changed_flows() sometimes overwrite a content previously changed. Attached the full log.
syslog-2021-05-10.log.gz

@ajoaoff ajoaoff added the bug Something isn't working label Jun 7, 2021
@italovalcy
Copy link

Another end-to-end test specific for flow_manager which explicitly show this race condition error is:

@@ -897,3 +899,44 @@ class TestE2EFlowManager:
     def test_075_flow_table_0_restarting(self):
         self.flow_table_0(restart_kytos=True)
+
+    def create_flow(self, vlan_id):
+        payload = {
+            "flows": [
+                {
+                    "priority": 10,
+                    "match": {
+                        "in_port": 1,
+                        "dl_vlan": vlan_id
+                    },
+                    "actions": [
+                        {
+                            "action_type": "output",
+                            "port": 2
+                        }
+                    ]
+                }
+            ]
+        }
+        api_url = KYTOS_API + '/flow_manager/v2/flows/00:00:00:00:00:00:00:01'
+        response = requests.post(api_url, data=json.dumps(payload),
+                                 headers={'Content-type': 'application/json'})
+        assert response.status_code == 200
+        data = response.json()
+        assert 'FlowMod Messages Sent' in data['response']
+
+    def test_080_many_concurrent_flows(self):
+        threads = list()
+        for vlan_id in range(100, 200):
+            t = threading.Thread(target=self.create_flow, args=(vlan_id,))
+            threads.append(t)
+        for thread in threads:
+            thread.start()
+        for thread in threads:
+            thread.join()
+
+        time.sleep(10)
+
+        s1 = self.net.net.get('s1')
+        flows_s1 = s1.dpctl('dump-flows')
+        assert len(flows_s1.split('\r\n ')) == 101

I've run this test 10 times and none of them finished with all 100 flows created: most of the times around 80 flows were created. In fact, the flows are created in the first place, however the storehouse is overwritten due to the race condition and some of them are lost from the storehouse. Thus, in the next run of the consistency check routine, the flows are removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2022.1 Kytos-ng 2022.1 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants