Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform minimal type checking for dictSet function, #5456

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

hummy123
Copy link
Contributor

@hummy123 hummy123 commented Feb 26, 2025

What is the problem/goal being addressed?

There's a performance problem with the dictSet function: the dictionary is a map, which we convert to a list, only for the dictionary type-checking function to convert it back to a map.

Suboptimal performance comes from two things:

  1. Converting a map to a list, only to convert the list back to a map (and adding one key).
  2. Type checking every single element in the dictionary, which is unnecessary because we accept that the map is type safe as a prerequisite for the dictSet function.

What is the solution to this problem?

I think we can perform fewer operations while upholding type safety.

  1. Get any key/value pair from the map (preferably without searching the whole map).
  2. Compare the types of the old key/value pairvwith the new key/value pair we wish to insert
  3. If the old pair and the new pair have the same types, insert them into the existing map (else, raise an error)

No need to check the type of every single element, or convert the map to a list, which are both O(n) operations.

What are the changes here? How do they solve the problem and what other product impacts do they cause?

Commit only contains changes to a bit of purely functional code, and all the Dict.set tests which passed before still pass now.

I also added two new tests, adding different key/value pairs to empty dictionaries.

Has this information been included in the comments?

The general idea is commented.

Other notes

I wanted to add a test to check if a type error is raised when we:

  1. Remove everything from a (string, int) dictionary
  2. Set a key/value pair of some different type like (string, string)

but I wasn't sure how to write multi-line tests as there didn't seem to be any in the files I looked at. I think my code won't introduce any regression in that regard because it passes the exact same arguments to the type-checking function, except the list to type-check is shorter.

As an alternative to this approach, if we want to type check N elements but not to covert the whole map to a list, we can call Maq.toSeq and iterate through a number of elements smaller than the length of the map, but I'm not sure what we gain with that approach.

@hummy123 hummy123 marked this pull request as ready for review February 27, 2025 12:59
@OceanOak
Copy link
Collaborator

I like the idea behind this change! I think we might be able to simplify it a bit while keeping the performance improvements you introduced.
For non-empty dictionaries: create a new dictionary with just the new key-value pair
For non-empty dictionaries:

  • Get the first existing key-value pair to use as a type reference(instead of a random one)
    Something like
       let firstKey = Map.keys o |> Seq.head
       let firstVal = Map.find firstKey o |> Option.get
    
  • Create a minimal test list with both the existing pair and the new pair (as you did)
  • Use this minimal list to verify the type
  • If type matches add the pair

What do you think?

(Stachu will be doing the actual review since he knows this best)

@hummy123
Copy link
Contributor Author

I think that's a good idea! I wasn't sure of how to clean the code up, but I like your suggestion. It gave the following idea for implementation:

if Map.isEmpty map then
  // create from [(key, value)] list as existing commit does
else
  let (oldKey, oldVal) = Map.minKeyValue map
  // use oldKey and oldVal same way existing commit does next

I think that would simplify it by getting rid of the pattern matching, and we would only need one line to find the (key, value) pair instead of making two separate queries to the map as the commit currently does.

If Stachu is fine with the general idea of this commit, I would be happy to update the commit to incorporate that suggestion.

@StachuDotNet
Copy link
Member

StachuDotNet commented Feb 28, 2025

I think this solution puts a lot of responsibility on the specific builtin function. Rather, I'd love to see a solution that could be easily re-used, with the majority of updates likely in TypeChecker.fs.

Right now, the DvalCreator module has:

  let dict
    (threadID : ThreadID)
    (typ : ValueType)
    (entries : List<string * Dval>)
    : Dval =

I think we could add another function

  let dictAddEntry // or ...dictAddEntries
    (threadID : ThreadID)
    (typ : ValueType)
    (entries : DvalMap)
    (newEntry: string * Dval)
    : Dval =

, refactoring the relevant functionality out of dict, into here, and simply call dictAddEntry from the Builtin. (I suppose this kinda reverts the recent PR)

If this solution works well, a similar solution will likely be really useful for adding entries to DLists

What do you think?


Generally speaking, we'd like:

  • as much type-checking stuff as possible to be in the TypeChecker module
  • builtins to be as simple as possible

hummy123 added 3 commits March 2, 2025 01:47
…, as long as it is not the minimum key in the dictionary; this conflicts with previous semantics of Dict.set)
…e (I changed it to True momentarily as the build command mentioned skipping tests, and I wanted to go through the tests)
@hummy123
Copy link
Contributor Author

hummy123 commented Mar 2, 2025

I think this solution puts a lot of responsibility on the specific builtin function. Rather, I'd love to see a solution that could be easily re-used, with the majority of updates likely in TypeChecker.fs.

I think this is a good idea and the latest commit follows it. (I have some tolerance to messy code 😅 but don't want to make things harder for others to maintain, so I appreciate the advice from you and Ocean on how to simplify the code further.)

There's a regression which I've added a test for and it's easily resolved, but I'm not sure what the maintainers' preferences are.

The first of these tests passes fine. The second one would have before, but now the second one doesn't. We're only checking the minimum key/value pair instead of every element which is why the first passes while the second doesn't.

  Stdlib.Dict.set_v0 (Dict { key1 = "val1before" }) "key1" "val1after" =
    Builtin.testDerrorMessage "Cannot add two dictionary entries with the same key `key1`"

  Stdlib.Dict.set_v0 (Dict { key1 = "val1"; key2 = "val2" }) "key2" "newVal2" =
    Builtin.testDerrorMessage "Cannot add two dictionary entries with the same key `key2`"

The fix is easy: we can move up (or duplicate) the call to Map.contains in TypeChecker.DvalCreator.dict, checking either in the builtin function or the dictAdd function if the key we wish to insert already exists in the map.

// exception-raising code in TypeChecker.DvalCreator.dict function
if Map.containsKey k entries then
  RTE.Dicts.Error.TriedToAddKeyAfterAlreadyPresent k
  |> RTE.Error.Dict
  |> raiseRTE threadID

I would personally prefer to put the if statement in the builtin (maybe put the exception-raising code in a new function so it doesn't need to be copy-pasted repeatedly).

Some functions disallow "reinsertion of the same key" while others like Dict.merge or Dict.fromListOverwritingDuplicates allow it, and I think that information should be available at a glance where the builtin is defined, without digging into the implementation of functions called by that builtin.

That's just my preference though. I'd appreciate a second opinion on where to put the duplicate-check since Chesterton's fence is a thing and I'm not as familiar with the code.

Edit: I accidentally clicked "close with comment" instead of the other button but hopefully it's not closed now.

@hummy123 hummy123 closed this Mar 2, 2025
@hummy123 hummy123 reopened this Mar 2, 2025
@StachuDotNet
Copy link
Member

Hi there, thanks for continuing with this. some thoughts:

  • I agree that it's a bit unclear which fns are merging builtins (where we just overwrite upon a dupe key) vs non-merging builtins. Maybe we split dictSet into dictSetMerging and dictSetNonMerging to be super clear.
  • Stdlib functions wrapping these should be updated appropriately, along with the tests (so, that test you mentioned should fail for one of the builtins, but not the other)
  • The changes in DvalCreator still aren't quite what I was expecting. I'd expect the new fn to house pretty much everything that was in the old .dict fn's List.fold's body. Basically:
let dictAddEntry // or ...dictAddEntries
    (threadID : ThreadID)
    (typ : ValueType)
    (entries : DvalMap)
    (newEntry: string * Dval)
    (shouldMergeUponDupeKey: bool) // new
    : Dval =
          let (k, v) = newEntry
          if Map.containsKey k entries && shouldMergeUponDupeKey then
            RTE.Dicts.Error.TriedToAddKeyAfterAlreadyPresent k
            |> RTE.Error.Dict
            |> raiseRTE threadID

          let vt = Dval.toValueType v

          match VT.merge typ vt with
          | Ok merged -> DDict(merged, Map.add k v entries)
          | Error() ->
            RTE.Dicts.Error.TriedToAddMismatchedData(k, typ, vt, v)
            |> RTE.Error.Dict
            |> raiseRTE threadID)

am I missing some reason why that won't work?

@hummy123
Copy link
Contributor Author

hummy123 commented Mar 3, 2025

That looks like it would be perfect actually, and cleaner than the way I previously thought of distinguishing duplicates. I think that kind of approach didn't come to mind before because of a misunderstanding I had of your original comment.

I'll try implementing your suggestions tomorrow and seeing if there's anything to improve. (I think the only thing that sticks out to me is the new bool parameter when the Darklang coding guidelines, if I remember, prefer two-case discriminated unions.)

@StachuDotNet
Copy link
Member

a 2-case DU would be fine. I was mostly lazy to come up with names :)
Alternatively, have 1 private function that takes in that bool, and 2 tiny public wrapper fns with explicit names (dictAddEntryMerging and dictAddEntryNonMerging for example)

hummy123 added 2 commits March 5, 2025 08:08
… when new key already exists in Dict, or whether to raise an exception instead.

2. Add another builtin: Dict.setOverridingDuplicates

3. Make stdlib function "setOverridingDuplicates" in file "packages/darklang/stdlib/dict.dark" use new builtin (it was previously implemented by removing the key from the dict and then setting the key/value pair, which is semantically equivalent)

4. Add a couple of additional tests for "Dict.setOverridingDuplicates", and amend regression (introduced in previous commit) in test for "Dict.set" when setting a key/value pair that (is not the minimum key in the set && key already exists in set)
@hummy123
Copy link
Contributor Author

hummy123 commented Mar 5, 2025

I didn't notice or understand before, but it looks like there was a Dict.setOverridingDuplicates function in packages/darklang/stdlib/dict.dark which was implemented by first calling the remove builtin to delete the key and then the set builtin to insert the key.

I only noticed after implementing the new dictSetOverridingDuplicates function in dict.fs, but I don't think it makes a big difference (probably) since the semantics are the same. I just rewrote the dict.dark function to rely on the builtin in dict.fs and added one new test (since it already had tests before).

I did also learn that the project uses different formatting settings for fantomas, which is nice to know (so I shouldn't just invoke fantomas . from the terminal but use the pre-commit hook instead).

I'm not entirely sure why CI is failing as local tests in the container are running fine, but it looks like I have more to learn about which is kind of a plus. 😅

@OceanOak
Copy link
Collaborator

OceanOak commented Mar 5, 2025

I didn't notice or understand before, but it looks like there was a Dict.setOverridingDuplicates function in packages/darklang/stdlib/dict.dark which was implemented by first calling the remove builtin to delete the key and then the set builtin to insert the key.

As far as I know, we want to keep most of the functionality written in Dark and move as many built-ins as possible to be written in Dark. But there are sometimes exceptions, so Stachu will follow up here :)

I did also learn that the project uses different formatting settings for fantomas, which is nice to know (so I shouldn't just invoke fantomas . from the terminal but use the pre-commit hook instead).

btw, you can run ./scripts/formatting/format format to format

I'm not entirely sure why CI is failing as local tests in the container are running fine, but it looks like I have more to learn about which is kind of a plus.

Sometimes some tests are flaky in CI, I'll rerun the tests and we'll see :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants