Perform minimal type checking for dictSet function, #5456

hummy123 · 2025-02-26T13:17:15Z

What is the problem/goal being addressed?

There's a performance problem with the dictSet function: the dictionary is a map, which we convert to a list, only for the dictionary type-checking function to convert it back to a map.

Suboptimal performance comes from two things:

Converting a map to a list, only to convert the list back to a map (and adding one key).
Type checking every single element in the dictionary, which is unnecessary because we accept that the map is type safe as a prerequisite for the dictSet function.

What is the solution to this problem?

I think we can perform fewer operations while upholding type safety.

Get any key/value pair from the map (preferably without searching the whole map).
Compare the types of the old key/value pairvwith the new key/value pair we wish to insert
If the old pair and the new pair have the same types, insert them into the existing map (else, raise an error)

No need to check the type of every single element, or convert the map to a list, which are both O(n) operations.

What are the changes here? How do they solve the problem and what other product impacts do they cause?

Commit only contains changes to a bit of purely functional code, and all the Dict.set tests which passed before still pass now.

I also added two new tests, adding different key/value pairs to empty dictionaries.

Has this information been included in the comments?

The general idea is commented.

Other notes

I wanted to add a test to check if a type error is raised when we:

Remove everything from a (string, int) dictionary
Set a key/value pair of some different type like (string, string)

but I wasn't sure how to write multi-line tests as there didn't seem to be any in the files I looked at. I think my code won't introduce any regression in that regard because it passes the exact same arguments to the type-checking function, except the list to type-check is shorter.

As an alternative to this approach, if we want to type check N elements but not to covert the whole map to a list, we can call Maq.toSeq and iterate through a number of elements smaller than the length of the map, but I'm not sure what we gain with that approach.

…/value pair to an existing map, instead of creating a new map from scratch

OceanOak · 2025-02-28T09:17:21Z

I like the idea behind this change! I think we might be able to simplify it a bit while keeping the performance improvements you introduced.
For non-empty dictionaries: create a new dictionary with just the new key-value pair
For non-empty dictionaries:

Get the first existing key-value pair to use as a type reference(instead of a random one)
Something like
```
   let firstKey = Map.keys o |> Seq.head
   let firstVal = Map.find firstKey o |> Option.get
```
Create a minimal test list with both the existing pair and the new pair (as you did)
Use this minimal list to verify the type
If type matches add the pair

What do you think?

(Stachu will be doing the actual review since he knows this best)

hummy123 · 2025-02-28T09:56:29Z

I think that's a good idea! I wasn't sure of how to clean the code up, but I like your suggestion. It gave the following idea for implementation:

if Map.isEmpty map then
  // create from [(key, value)] list as existing commit does
else
  let (oldKey, oldVal) = Map.minKeyValue map
  // use oldKey and oldVal same way existing commit does next

I think that would simplify it by getting rid of the pattern matching, and we would only need one line to find the (key, value) pair instead of making two separate queries to the map as the commit currently does.

If Stachu is fine with the general idea of this commit, I would be happy to update the commit to incorporate that suggestion.

StachuDotNet · 2025-02-28T16:29:52Z

I think this solution puts a lot of responsibility on the specific builtin function. Rather, I'd love to see a solution that could be easily re-used, with the majority of updates likely in TypeChecker.fs.

Right now, the DvalCreator module has:

  let dict
    (threadID : ThreadID)
    (typ : ValueType)
    (entries : List<string * Dval>)
    : Dval =

I think we could add another function

  let dictAddEntry // or ...dictAddEntries
    (threadID : ThreadID)
    (typ : ValueType)
    (entries : DvalMap)
    (newEntry: string * Dval)
    : Dval =

, refactoring the relevant functionality out of dict, into here, and simply call dictAddEntry from the Builtin. (I suppose this kinda reverts the recent PR)

If this solution works well, a similar solution will likely be really useful for adding entries to DLists

What do you think?

Generally speaking, we'd like:

as much type-checking stuff as possible to be in the TypeChecker module
builtins to be as simple as possible

…, as long as it is not the minimum key in the dictionary; this conflicts with previous semantics of Dict.set)

…e (I changed it to True momentarily as the build command mentioned skipping tests, and I wanted to go through the tests)

scripts/build/_build-server

hummy123 · 2025-03-02T06:10:02Z

I think this solution puts a lot of responsibility on the specific builtin function. Rather, I'd love to see a solution that could be easily re-used, with the majority of updates likely in TypeChecker.fs.

I think this is a good idea and the latest commit follows it. (I have some tolerance to messy code 😅 but don't want to make things harder for others to maintain, so I appreciate the advice from you and Ocean on how to simplify the code further.)

There's a regression which I've added a test for and it's easily resolved, but I'm not sure what the maintainers' preferences are.

The first of these tests passes fine. The second one would have before, but now the second one doesn't. We're only checking the minimum key/value pair instead of every element which is why the first passes while the second doesn't.

  Stdlib.Dict.set_v0 (Dict { key1 = "val1before" }) "key1" "val1after" =
    Builtin.testDerrorMessage "Cannot add two dictionary entries with the same key `key1`"

  Stdlib.Dict.set_v0 (Dict { key1 = "val1"; key2 = "val2" }) "key2" "newVal2" =
    Builtin.testDerrorMessage "Cannot add two dictionary entries with the same key `key2`"

The fix is easy: we can move up (or duplicate) the call to Map.contains in TypeChecker.DvalCreator.dict, checking either in the builtin function or the dictAdd function if the key we wish to insert already exists in the map.

// exception-raising code in TypeChecker.DvalCreator.dict function
if Map.containsKey k entries then
  RTE.Dicts.Error.TriedToAddKeyAfterAlreadyPresent k
  |> RTE.Error.Dict
  |> raiseRTE threadID

I would personally prefer to put the if statement in the builtin (maybe put the exception-raising code in a new function so it doesn't need to be copy-pasted repeatedly).

Some functions disallow "reinsertion of the same key" while others like Dict.merge or Dict.fromListOverwritingDuplicates allow it, and I think that information should be available at a glance where the builtin is defined, without digging into the implementation of functions called by that builtin.

That's just my preference though. I'd appreciate a second opinion on where to put the duplicate-check since Chesterton's fence is a thing and I'm not as familiar with the code.

Edit: I accidentally clicked "close with comment" instead of the other button but hopefully it's not closed now.

StachuDotNet · 2025-03-03T17:57:38Z

Hi there, thanks for continuing with this. some thoughts:

I agree that it's a bit unclear which fns are merging builtins (where we just overwrite upon a dupe key) vs non-merging builtins. Maybe we split dictSet into dictSetMerging and dictSetNonMerging to be super clear.
Stdlib functions wrapping these should be updated appropriately, along with the tests (so, that test you mentioned should fail for one of the builtins, but not the other)
The changes in DvalCreator still aren't quite what I was expecting. I'd expect the new fn to house pretty much everything that was in the old .dict fn's List.fold's body. Basically:

let dictAddEntry // or ...dictAddEntries
    (threadID : ThreadID)
    (typ : ValueType)
    (entries : DvalMap)
    (newEntry: string * Dval)
    (shouldMergeUponDupeKey: bool) // new
    : Dval =
          let (k, v) = newEntry
          if Map.containsKey k entries && shouldMergeUponDupeKey then
            RTE.Dicts.Error.TriedToAddKeyAfterAlreadyPresent k
            |> RTE.Error.Dict
            |> raiseRTE threadID

          let vt = Dval.toValueType v

          match VT.merge typ vt with
          | Ok merged -> DDict(merged, Map.add k v entries)
          | Error() ->
            RTE.Dicts.Error.TriedToAddMismatchedData(k, typ, vt, v)
            |> RTE.Error.Dict
            |> raiseRTE threadID)

am I missing some reason why that won't work?

hummy123 · 2025-03-03T18:49:09Z

That looks like it would be perfect actually, and cleaner than the way I previously thought of distinguishing duplicates. I think that kind of approach didn't come to mind before because of a misunderstanding I had of your original comment.

I'll try implementing your suggestions tomorrow and seeing if there's anything to improve. (I think the only thing that sticks out to me is the new bool parameter when the Darklang coding guidelines, if I remember, prefer two-case discriminated unions.)

StachuDotNet · 2025-03-03T19:00:10Z

a 2-case DU would be fine. I was mostly lazy to come up with names :)
Alternatively, have 1 private function that takes in that bool, and 2 tiny public wrapper fns with explicit names (dictAddEntryMerging and dictAddEntryNonMerging for example)

… when new key already exists in Dict, or whether to raise an exception instead. 2. Add another builtin: Dict.setOverridingDuplicates 3. Make stdlib function "setOverridingDuplicates" in file "packages/darklang/stdlib/dict.dark" use new builtin (it was previously implemented by removing the key from the dict and then setting the key/value pair, which is semantically equivalent) 4. Add a couple of additional tests for "Dict.setOverridingDuplicates", and amend regression (introduced in previous commit) in test for "Dict.set" when setting a key/value pair that (is not the minimum key in the set && key already exists in set)

hummy123 · 2025-03-05T08:51:17Z

I didn't notice or understand before, but it looks like there was a Dict.setOverridingDuplicates function in packages/darklang/stdlib/dict.dark which was implemented by first calling the remove builtin to delete the key and then the set builtin to insert the key.

I only noticed after implementing the new dictSetOverridingDuplicates function in dict.fs, but I don't think it makes a big difference (probably) since the semantics are the same. I just rewrote the dict.dark function to rely on the builtin in dict.fs and added one new test (since it already had tests before).

I did also learn that the project uses different formatting settings for fantomas, which is nice to know (so I shouldn't just invoke fantomas . from the terminal but use the pre-commit hook instead).

I'm not entirely sure why CI is failing as local tests in the container are running fine, but it looks like I have more to learn about which is kind of a plus. 😅

OceanOak · 2025-03-05T09:44:45Z

I didn't notice or understand before, but it looks like there was a Dict.setOverridingDuplicates function in packages/darklang/stdlib/dict.dark which was implemented by first calling the remove builtin to delete the key and then the set builtin to insert the key.

As far as I know, we want to keep most of the functionality written in Dark and move as many built-ins as possible to be written in Dark. But there are sometimes exceptions, so Stachu will follow up here :)

I did also learn that the project uses different formatting settings for fantomas, which is nice to know (so I shouldn't just invoke fantomas . from the terminal but use the pre-commit hook instead).

btw, you can run ./scripts/formatting/format format to format

I'm not entirely sure why CI is failing as local tests in the container are running fine, but it looks like I have more to learn about which is kind of a plus.

Sometimes some tests are flaky in CI, I'll rerun the tests and we'll see :)

hummy123 added 2 commits February 26, 2025 12:39

perform minimal type checking for dictSet function, by adding new key…

c267abc

…/value pair to an existing map, instead of creating a new map from scratch

add tests checking if we can set a value when a dictionary is empty

2499ab2

hummy123 marked this pull request as ready for review February 27, 2025 12:59

hummy123 added 3 commits March 2, 2025 01:47

move code relevant to type checking in Dict.fs to TypeChecker.fs

0f9ac5a

add failing test (we can add two dictionary entries with the same key…

d956038

…, as long as it is not the minimum key in the dictionary; this conflicts with previous semantics of Dict.set)

change run_tests variable in scripts/build/_build-server back to Fals…

63675d0

…e (I changed it to True momentarily as the build command mentioned skipping tests, and I wanted to go through the tests)

OceanOak reviewed Mar 2, 2025

View reviewed changes

scripts/build/_build-server Outdated Show resolved Hide resolved

hummy123 closed this Mar 2, 2025

hummy123 reopened this Mar 2, 2025

hummy123 added 2 commits March 5, 2025 08:08

format TypeChecker.fs

5004a37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform minimal type checking for dictSet function, #5456

Perform minimal type checking for dictSet function, #5456

hummy123 commented Feb 26, 2025 •

edited

Loading

OceanOak commented Feb 28, 2025

hummy123 commented Feb 28, 2025

StachuDotNet commented Feb 28, 2025 •

edited

Loading

hummy123 commented Mar 2, 2025 •

edited

Loading

StachuDotNet commented Mar 3, 2025

hummy123 commented Mar 3, 2025

StachuDotNet commented Mar 3, 2025

hummy123 commented Mar 5, 2025

OceanOak commented Mar 5, 2025

Perform minimal type checking for dictSet function, #5456

Are you sure you want to change the base?

Perform minimal type checking for dictSet function, #5456

Conversation

hummy123 commented Feb 26, 2025 • edited Loading

What is the problem/goal being addressed?

What is the solution to this problem?

What are the changes here? How do they solve the problem and what other product impacts do they cause?

Has this information been included in the comments?

Other notes

OceanOak commented Feb 28, 2025

hummy123 commented Feb 28, 2025

StachuDotNet commented Feb 28, 2025 • edited Loading

hummy123 commented Mar 2, 2025 • edited Loading

StachuDotNet commented Mar 3, 2025

hummy123 commented Mar 3, 2025

StachuDotNet commented Mar 3, 2025

hummy123 commented Mar 5, 2025

OceanOak commented Mar 5, 2025

hummy123 commented Feb 26, 2025 •

edited

Loading

StachuDotNet commented Feb 28, 2025 •

edited

Loading

hummy123 commented Mar 2, 2025 •

edited

Loading