-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat:reset default schema name to real schema name #525
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a very good approach. In current design, schema.raw_name is part of the hash function for both Table and Column. Mutate the schema.raw_name changes the hash value, which is very important for networkx to locate the node.
This example shows after changing the schema name, networkx has problems finding the existing node and edge. And this may lead to tricky bugs in the future that I'd like to avoid making changes like this.
from networkx import DiGraph
from sqllineage.core.models import Table
g = DiGraph()
a = Table("a")
b = Table("b")
g.add_edge(a, b)
print(g.has_node(a))
print(g.has_node(b))
print(g.has_edge(a, b))
a.schema.raw_name = "main"
print(g.has_node(a))
print(g.has_node(b))
print(g.has_edge(a, b))
The output is:
True
True
True
False
False
False
from networkx import DiGraph
from sqllineage.core.models import Table
g = DiGraph()
a = Table("a")
b = Table("b")
g.add_edge(a, b)
print(g.has_node(a))
print(g.has_node(b))
print(g.has_edge(a, b))
a.schema.raw_name = "main"
print(g.has_node(a))
for node in g.nodes():
print(node.__class__, node )
node:Table
if node == a:
print('equal===a')
if node ==b:
print('equal===b')
for node in g.nodes():
if node.raw_name=='a' :
if node == a:
print('----')
print(g.has_node(a))
print(g.has_node(node))
print(g.nodes)
print('----')
print(g.has_node(b))
print(g.has_edge(a, b))
print(g.has_node(node)) why result is False? |
直接调试一步步跟进去就知道了,一个是hash值的对比,一个是内容的对比 |
from networkx import DiGraph
from sqllineage.core.models import Table
g = DiGraph()
a = Table("a")
b = Table("b")
g.add_edge(a, b)
for node in g.nodes():
if node in g.nodes():
print(f'find node {node.raw_name}',)
else:
print("can't find node {node.raw_name}" )
a.schema.raw_name='ods'
for node in g.nodes():
if node in g.nodes():
print(f'find node {node.raw_name}', )
else:
print(f"can't find node {node.raw_name}")
What I'm confused about is that getting node from nodes use for loop, but can't be found using in nodes 我感到疑惑的是, 通过for 循环得到的node,用 in nodes 找不到。不管怎么取,获取他们不应该是同一个对象吗? |
因为 |
你可以单步调试第二个for循环跟进去就知道了 |
这次够清楚了吧😜😜😜 |
a.schema.raw_name='ods'
for _ in range(10):
x = y =0
for i in range(10):
for node in g.nodes():
if node in g.nodes():
x=x+1
else:
y=y+1
print(x,y )
多次执行,大概1/10 的几率,偶发性出现这个结果。 按理说无论执行多少次值应该都是 (0 20) conda env : python 3.10.13 networkX==3.2.1 |
这个是hash函数本身的原因,这个我也发现了,但是不太了解Python内置hash函数内部的实现,不好解释 |
根据我观察到现象:好像是进程只要启动了,那在进程结束之前,相同的对象生成的hash值都是一样的。但是如果重新启动进程,那就可能不一样啦 |
我困惑的是,都是从 networkX取的node, 从for 取到的node 居然不在 nodes里面! |
for循环只是一个遍历,永远都能取到。但是in操作就不一定了(因为调用了hash函数)。我跑了好多次,也出来了10 10的结果。我感觉是两个不同的对象,生成的hash值碰撞了,可惜我出现这个结果的时候,没有把hash值都打出来。我加上hash值打印之后没有复现了 |
这玩意,估计得了解一下Python内置hash函数的具体实现,才能解释清楚了 |
这个地方确切地说应该是同一个对象,在对象内容(schema)被修改之后,生成的hash值与修改之前,发生了hash碰撞 |
g.add_node Notes
networkX的内部确实用的hashable, 当 node 发生变更时, hash 不会自动变化。所以 in nodes 方法 通过hash 查找就会查不到。 NetworkX does use hashable internally. When node changes, hash value will not change automatically. Therefore, the in nodes method cannot be found through hash search. |
我更倾向有缓存的说法,但我没法证明。位数也不低 , hash 这么容易碰撞的话那也太不靠谱了。 |
如果是缓存,那对象内容都发生变更了,缓存还不失效么,感觉也不怎么靠谱 |
还有一个点,如果是缓存的话,我觉得结果应该是20 0而不是10 10 |
import string
from networkx import DiGraph
from sqllineage.core.models import Table,Schema
from secrets import token_hex,choice
class NewTable(Table):
def __init__(self, name: str,schema: Schema = Schema(), **kwargs):
super().__init__(name, schema, **kwargs)
self._hash_value = int(''.join(choice(string.digits) for i in range(8)))
# print(self._hash_value)
def __hash__(self):
return self._hash_value
g = DiGraph()
a = NewTable("a")
b = NewTable("b")
g.add_edge(a, b)
print('before change schema')
for node in g.nodes():
if node in g.nodes():
print(f'find node {node.raw_name}', )
else:
print("can't find node {node.raw_name}")
a.schema.raw_name = 'ods'
print('after change schema')
for node in g.nodes():
if node in g.nodes():
print('ok')
else:
print('false')
for _ in range(10):
x = y = 0
for _ in range(10):
for node in g.nodes():
if node in g.nodes():
x = x + 1
else:
y = y + 1
print(x, y)
这样改感觉就OK了。 |
这样改的话,当 holder.node 发生变更时, networkX的 node 同步变更还保持了关联。 Schema() Column 的hash 也是走的raw_name, 想想这样是否合理? If changed in this way, when holder.node changes, networkX's node synchronization changes will still remain relevant. The hash of Schema() Column also uses raw_name. Is this reasonable? |
你这相当于自己实现了hash函数,你这个实现方法hash碰撞会不会很高 |
还有你这个hash实现跟对象内容完全没关系,如果遇到相同内容的两个不同对象,需要hash值一样的时候,咋搞 |
参见 https://docs.python.org/zh-cn/3.10/library/secrets.html |
就是我刚才说的问题,SQL语句里面的对象,如果重复,在图里面,应该是一个还是多个的问题。 当然,就目前情况下,怎么做我没主意,毕竟对SQLLineage了解不透彻。 |
脱离sqllineage我感觉一个对象的hash值跟这个对象内容完全没关系,应该不是最佳实践。至少我没见过这样的实现 |
在我的设想, |
你这个想法,我也有,但是没去实践。目前的血缘解析是由外而内的,我也在想由内而外(也就是先解析最内层查询)会不会是更好的设计 |
正确的做法,应该是在 networkX中,node 仅仅是个 id,不是对象,其他所有业务信息都在 propery里面。 我还担心这样有没有内存泄漏的问题, 每次新增了新的Schema,然后hash 也一样;那新增的表对象,他的schema是指向之前的还是新增出来的? 那多出来schema类,没地方用?跑哪里去? 超过我的能力了。。。 |
是的,这样输入和输出的关联性比较好做 。我想的是用 广度搜索生成队列顺序执行。 |
可以尝试 |
dear reata:
Modify the default schema name to the real schema name