Repository

dotnet/csharp-notebooks

Get started learning C# with C# notebooks powered by .NET Interactive and VS Code.
405 146 32 19

ML Notebook consumes all the available memory, forcing Windows to close processes

The Training and AutoML notebook is able to consume a lot of memory, causing to hang or crash other processes.

Strangely enough, it usually works fine if you run the notebook only once. So to reproduce the problem, you should:

  1. Open Windows Task Manager, and check your memory usage
  2. Open Training and AutoML notebook image
  3. Run it's snippets one by one, but stop at "Use AutoML to simplify trainer selection and hyper-parameter optimization." image
  4. Run the "Use AutoML to simplify trainer selection and hyper-parameter optimization" code. image
  5. Sometimes it works fine, but last time at this point my system hang and terminated some VS processes and closed my browser unexpectedly. Memory consumption dropped back to ~950 MBs, and the notebook got into a seemingly endless loop of "Starting Kernel". image
  6. When I tried to re-run the "Use AutoML to simplify trainer selection and hyper-parameter optimization" code snippet again, I got the following exception, repeating over and over: image
error: The JSON-RPC connection with the remote party was lost before the request could complete. 
    at StreamJsonRpc.JsonRpc.<InvokeCoreAsync>d__154.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at StreamJsonRpc.JsonRpc.<InvokeCoreAsync>d__143`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.VisualStudio.Notebook.Utils.DetectKernelStatusService.<ExecuteTaskAsync>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.VisualStudio.Notebook.Utils.RepeatedTimeTaskService.<>c__DisplayClass7_0.<<ExecuteAsync>b__1>d.MoveNext()
  1. If you could run the notebook without issues, try to re-run the "Use AutoML to simplify trainer selection and hyper-parameter optimization" code many times, it is inconsistent on my machine as well.

9 Comments

  1. I suspect it's because the trial is still running even after that automl cell finished. Somehow AutoMLExperiment doesn't always succeed in cancelling the last running trial..

  2. We probably also need to clean up some things in our NotebookMonitor -

    https://github.com/dotnet/machinelearning/blob/main/src/Microsoft.ML.AutoML.Interactive/NotebookMonitor.cs

    It could be holding references to a lot of things.

    @andrasfuchs if you "restart kernel" does it free up the memory for you?

    I'll dig more to see if I can find anything.

  3. @andrasfuchs if you’re using latest notebook editor extension there is a restart button in notebook toolbar.

    image

  4. I tried it again today, but after a "Run All", it got crazy again, eating up the memory and closing other running processes.

    image

    The critical part got terminated with an exception.

    image

    The memory was not freed up after the exception, I had to close the Visual Studio process manually.
    I had no chance to test the kernel restart.

  5. I was thinking there's some places we forget to clear trial result and release memory (like hold all models in memory) but I didn't see the memory goes up while training. So now I suspect the crazy memory usage is caused by LightGbm trainer, which is possible to have bad-memory allocation especially when the search space goes big

    @andrasfuchs Can you try disable lgbm trainer by setting
    useLgbm: false next to useSdca:false

    in the following code snippet
    image

    and try the notebook again

  6. And @JakeRadMSFT , maybe it would be helpful to add a system monitor section together with trial Monitor?

  7. I got the gray rectangles instead of the results, but the memory problem seems to be better if I use useLgbm: false.
    image

    10+ GBs of RAM usage is still a lot, I think...
    image

    ...and this memory is not freed up after the notebook run was completed.