unstreamM etc destroys performance; expose unstreamPrimM & document most-performant method #416

infinity0 · 2021-09-11T23:31:04Z

I was messing around with squeezing performance out of this library.

-- utility function used in the rest of the code
fillOrderPart :: (Monad m, Ord a, Num a) => a -> StateT a m a
fillOrderPart c = state $ \r -> let x = min c r in (c - x, r - x)

fillOrder1V :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1V = runState . Data.Vector.Generic.mapM fillOrderPart
{-
benchmarking static/vector-unboxed/fillOrder1
time                 53.16 μs   (53.02 μs .. 53.31 μs)
benchmarking static/vector-storable/fillOrder1
time                 52.89 μs   (52.59 μs .. 53.22 μs)
-}

fillOrder1VM :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VM book order = runST $ flip runStateT order $ do
  book' <- VG.thaw book
  VGM.mapM_ fillOrderPart book'
  VG.unsafeFreeze book'
{-
benchmarking static/vector-unboxed-copy-mut/fillOrder1
time                 11.20 μs   (11.19 μs .. 11.21 μs)
benchmarking static/vector-storable-copy-mut/fillOrder1
time                 11.51 μs   (11.46 μs .. 11.55 μs)
-}

fillOrder1VB2 :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VB2 book order =
  -- FIXME: unstreamM performs horribly to list construction, sadly
  runState (VG.unstreamM $ VFB.mapM fillOrderPart (VG.stream book)) order
{-
benchmarking static/vector-unboxed-bundle-unstreamM/fillOrder1
time                 26.30 μs   (26.24 μs .. 26.35 μs)
benchmarking static/vector-storable-bundle-unstreamM/fillOrder1
time                 46.10 μs   (45.85 μs .. 46.26 μs)
-}

-- bug in vector; defined but not exported. we re-defined it for our use here
unstreamPrimM :: (PrimMonad m, VG.Vector v a) => VFB.MBundle m u a -> m (v a)
{-# INLINE [1] unstreamPrimM #-}
unstreamPrimM s = VGM.munstream s >>= VG.unsafeFreeze

fillOrder1VB :: (Data.Vector.Generic.Vector v i, Ord i, Num i) => v i -> i -> (v i, i)
fillOrder1VB book order =
  runST $ flip runStateT order $ unstreamPrimM $ VFB.mapM fillOrderPart $ VG.stream book
{-
benchmarking static/vector-unboxed-bundle/fillOrder1
time                 1.426 μs   (1.421 μs .. 1.432 μs)
benchmarking static/vector-storable-bundle/fillOrder1
time                 855.9 ns   (852.2 ns .. 859.3 ns)
-}

By contrast, here is the equivalent rust code:

fn fill_order(book: &mut Vec<u64>, order: u64) -> u64 {
  let mut r = order;
  for c in book.iter_mut() {
    let x = r.min(*c);
    r -= x;
    *c -= x;
  }
  r
}
// fill_order              time:   [868.87 ns 870.21 ns 871.76 ns]

As you can see, the properly-written fusion version performs as well as rust. This was a very pleasant surprise for me. HOWEVER - the correct way is not documented anywhere!!! In particular, the convenience function unstreamPrimM is for some reason defined in the source code of Vector.Generic, used no-where else, not exported nor advertised, yet is absolutely vital for reaching this nirvana of performance.

By contrast, unstreamM is what's exposed in the API and destroys the performance so it performs even worse than the manual imperative mutable version. Even worse, all the utility functions are written in terms of unstreamM, e.g. VG.mapM, etc etc. Yes this means they have a convenient Monad m => constraint, but any non-haskell-expert that cares about performance would benchmark it and write off the library as "Haskell is slow". Providing mirror utilities that have a PrimMonad m => constraint that use unstreamPrimM instead of unstreamM, as well as a few examples, would help this effort.

For reference, the above took me about half a day. Not everybody exploring Haskell has that sort of time or patience.

The text was updated successfully, but these errors were encountered:

gksato · 2021-09-12T01:19:16Z

We must face the fact that unstreamPrimM is an utterly unsafe function. Once non-deterministic monad sits in the place of m, immutable buffers can be easily rewritten and one of Haskell's nasal demons pops out everywhere, and I'm not sure it's the only demon living in this function. If we want to expose utilies with constraints PrimMonad, we would need to prefix them with unsafePrim or something and document them with expected undefined behaviors. That way, the situation will not get any easier to non-experts, I suppose. I guess that's the reason why only unstreamPrimM_IO and unstreamPrimM_ST are exported through rewrite rules (We anyway cannot expose unstreamPrimM through rules, though).

Once we are sure that my example is the only undefined behavior of the function, we could expose PrimMonad-related utilities with more strict constraints if we could write such constraint. but what's the constraint we want? With what method in a type class can we prohibit non-determinism?

gksato · 2021-09-12T01:25:34Z

Wait -- what if we employed freeze instead of unsafeFreeze in the definition of unstreamPrimM ? It seems to contain no undefined behavior, but doesn't it really have any? And does it work in the way we expect?

infinity0 · 2021-09-12T11:01:25Z

I see your point about safety. It looks like there's some understanding of this elsewhere e.g. https://www.tweag.io/blog/2021-02-10-linear-base/ "the [linear approach] even makes unsafeFreeze safe".

I didn't consider non-determinism before, but the behaviour would be safe for a linear-in-spirit monad (like State and all the other "common" monads) together with the condition that the caller should not themselves reuse the input of unstreamPrimM, which would be the case e.g. in a redefined mapM like mapPrimM f = unstreamPrimM . Bundle.mapM f . stream that performs the streaming and unstreaming internally already.

we could expose PrimMonad-related utilities with more strict constraints if we could write such constraint. but what's the constraint we want?

Simply PrimMonad m => instead of Monad m => works, if unstreamM is changed to unstreamPrimM - or I suppose waiting for linear-base's linear Monad to be production-ready would be a cleaner & more long-term solution.

Changing unsafeFreeze to freeze (which is just clone >=> unsafeFreeze) gives me the following benchmark numbers:

freeze
benchmarking static/vector-unboxed/fillOrder1VF
time                 1.220 μs   (1.214 μs .. 1.227 μs)
benchmarking static/vector-storable/fillOrder1VF
time                 1.362 μs   (1.358 μs .. 1.365 μs)

unsafeFreeze
benchmarking static/vector-unboxed/fillOrder1VF
time                 1.510 μs   (1.503 μs .. 1.516 μs)
benchmarking static/vector-storable/fillOrder1VF
time                 1.191 μs   (1.189 μs .. 1.192 μs)

(I refactored the code slightly from what I had in the OP; latest version available in https://github.com/infinity0/lang-benchmarks)

It looks like performance is not significantly affected at least in this simple case (perhaps GHC optimises away the clone?), though it's incredibly weird that unsafeFreeze/unboxed is slower than freeze/unboxed (I ran this several times and got the same weird result).

gksato · 2021-09-12T13:17:35Z

Simply PrimMonad m => instead of Monad m => works, if unstreamM is changed to unstreamPrimM -

ListT IO a and ListT (ST s) a (see ListT done right) or a monadic parser transformer applied to IO is a "common" non-deterministic PrimMonad. I wouldn't want to see non-experts puzzled by mysteriously transformed immutable vectors just because they used mapPrimM against ParsecT e s (ST t). I don't know what maintainers think about this, though.

or I suppose waiting for linear-base's linear Monad to be production-ready would be a cleaner & more long-term solution.

That would be a sensible solution if it works. I haven't looked deep into GHC's linear monad yet, but it sounds promising.

It looks like performance is not significantly affected at least in this simple case (perhaps GHC optimises away the clone?), though it's incredibly weird that unsafeFreeze/unboxed is slower than freeze/unboxed (I ran this several times and got the same weird result).

See #409 for another example of behavior you observed. There is also some discussion about that there.

Shimuuar · 2021-09-27T16:13:39Z

unstreamPrimM is more performant and is unsafe. Unsafety alone should not stop us. We have full complement of unsafe functions in the name of performance.

What's crucial is to understand how exactly they're unsafe and when it's safe to use them. unstreamPrimM is fast because it uses single mutable buffer in order to create vector. This of course causes problem in non-determinism monads. Here I'm using list-t package:

import qualified Data.Vector.Unboxed         as U
import qualified Data.Vector.Fusion.Bundle.Monadic as BM
import ListT

bundleList :: BM.Bundle (ListT IO) v Int
bundleList = BM.generateM 2 $ \i -> fromFoldable [i, 100+i]

goBad :: IO [U.Vector Int]
goBad = toList $ unstreamPrimM bundleList

>>> goBad
[[100,101],[100,101],[100,101],[100,101]]

Since all vector share same mutable buffer they all end up same! This causes problem since we continue to modify mutable vector after we called unsafeFreeze. @gksato proposed to use freeze and copy buffer. This does help for proper ListT. It doesn't work in case of ListT from old transformers. This is because we still have only single buffer and writes and freeze interleaved differently.

I suspect that this happens because ListT IO from transformers is not proper monad, it violate monadic laws. So variant of unstreamPrimM which uses freeze could be safe for lawful monads. I however can't prove this.

gksato · 2021-11-18T14:51:33Z

I've been thinking of this safety issues these days. Of course unstreamPrimM with freeze is safe (in the sense can be marked TrustWorthy), since it does nothing unsound. What we have to make sure is that unstreamPrimM and unstreamM are observably equivalent. For that, I think we need lots of laws, including
monad laws:

First of all, monad laws on the PrimMonad m.
The following PrimMonad laws on m, which is undocumented. Note that they resemble MonadIO laws.
1. primitive (\s -> (# s, x #)) = return x, or equivalently stToPrim (return x) = return x
2. primitive (\s0 -> let (# s1, x #) = m# s0 in f# x s1) = primitive m# >>= primitive . f#, or equivalently stToPrim (m >>= f) = stToPrim m >>= stToPrim . f
Undocumented MVector laws on the pair of v and a with MVector v a, which includes but is absolutely not limited to the following. I guess it's appropriate to assume the ST s case, I don't know whether we can prove the general PrimMonad case from that base case (ST s case).
1. (,) <$> read v i <*> read w j = flip (,) <$> read w j <*> read v i
2. write v i a *> read v j = read v j <* write v i a given i /= j.
3. write v i a *> read v i = a <$ write v i a
4. write v i a *> write v j b = write v j b *> write v i a given i /= j.
5. write v i a *> write v i b = write v i b
6. (,) <$> write v i a <*> m = flip (,) <$> m <*> write v i a where m is a "safe" operation that does not refer to a vector that overlaps with v.

gksato · 2021-11-18T21:55:39Z

And thanks to the explanation of @Shimuuar, I guess I got your point better, @infinity0. Obviously It's better if we have unstreamPrimM exposed. Hypothetical beginners are not the only users of this library.

Shimuuar · 2025-02-09T11:34:53Z

I though some more about this. There're two possible interpretation of unsafety here. Below is source of unstreamPrimM

unstreamPrimM :: (PrimMonad m, Vector v a) => MBundle m u a -> m (v a)
unstreamPrimM s = M.munstream s >>= unsafeFreeze

It's possible to break referential transparency. This would mean that some actions in M.munstream may happen after munstream.
All possible execution paths share same buffer and we just get vector with last write from each execution path. It may be useless but it's well defined.

If 2 is the case (as I suspect) then exporting unsafePrimM should be fine, maybe with renaming and documenting its behavior. But showing which scenario is true remains to be done

gksato · 2025-02-10T07:32:59Z

I think I've found an example that unstreamPrimM can break referential transparency. This uses ListT done right. There's a well-known list-t package, but it doesn't implement PrimMonad, so this code includes the definition of ListT. (Edit: Sorry, I've forgotten the details of previous conversation and become overly repetitive)

{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE TypeFamilies #-}
module Main where

import Control.Monad.Primitive
import Control.Monad.Trans.Class
import qualified Data.Vector.Fusion.Bundle as VFB
import qualified Data.Vector.Fusion.Bundle.Monadic as VFBM
import qualified Data.Vector.Generic as VG
import qualified Data.Vector.Generic.Mutable as VGM
import qualified Data.Vector.Unboxed as VU

main :: IO ()
main = do
  lst <- forceListT $ do
    vec <- badForM (VU.singleton ()) $ \() -> listToListT [True, False]
    ioToPrim $ print vec
    return vec
  print lst

badForM :: (PrimMonad m, VG.Vector v a, VG.Vector v b) => v a -> (a -> m b) -> m (v b)
badForM v f = unstreamPrimM $ VFB.mapM f (VG.stream v)

unstreamPrimM :: (PrimMonad m, VG.Vector v a) => VFBM.Bundle m v a -> m (v a)
unstreamPrimM bdl = VGM.munstream bdl >>= VG.unsafeFreeze

listToListT :: Monad m => [a] -> ListT m a
listToListT = foldr (\x xs -> ListT $ return $ Cons x xs) (ListT $ return Nil)

forceListT :: Monad m => ListT m a -> m [a]
forceListT (ListT m) = m >>= \case
  Nil -> return []
  Cons x xs -> (x :) <$> forceListT xs

newtype ListT m a = ListT { unListT :: m (ListTC m a) }
data ListTC m a = Nil | Cons a (ListT m a)

listtFmapImpl
  :: (Functor m)
  => (a -> b)
  -> (ListT m a -> ListT m b, ListTC m a -> ListTC m b)
listtFmapImpl f = (go, goC)
  where
    go (ListT m) = ListT $ fmap goC m
    goC Nil = Nil
    goC (Cons x xs) = Cons (f x) (go xs)

instance (Functor m) => Functor (ListT m) where
  fmap f = fst $ listtFmapImpl f

instance (Functor m) => Functor (ListTC m) where
  fmap f = snd $ listtFmapImpl f

appendListT :: Monad m => ListT m a -> ListT m a -> ListT m a
appendListT m1 (ListT m2) = go m1
  where
    go (ListT m) = ListT $ m >>= \case
      Nil -> m2
      Cons x xs -> return $ Cons x (go xs)

bindListT :: Monad m => ListT m a -> (a -> ListT m b) -> ListT m b
bindListT m0 f = go m0
  where
    go (ListT m) = ListT $ m >>= \case
      Nil -> return Nil
      Cons x xs -> unListT $ appendListT (f x) (go xs)

instance Monad m => Monad (ListT m) where
  (>>=) = bindListT

instance Monad m => Applicative (ListT m) where
  pure x = ListT $ return $ Cons x (ListT $ return Nil)
  mf <*> mx = mf >>= (<$> mx)


instance MonadTrans ListT where
  lift m = ListT $ fmap (\x -> Cons x (ListT $ return Nil)) m

instance PrimMonad m => PrimMonad (ListT m) where
  type PrimState (ListT m) = PrimState m
  primitive = lift . primitive

$ cabal run
[True]
[False]
[[False],[False]]

gksato · 2025-02-10T23:31:42Z

I'd like to summarize, since my code is a bit long. Let ListT mean any ListT done right.

Think in ListT IO monad, and say our execution see x <- listToListT [True, False]. Then the execution path first completes the case x = True until the end of the sequence of monad actions, and then executes the case x = False. If x gets written to a mutable vector after x <- listToListT [True, False], the execution path with x=True observes the written True, and after that the execution overwrites x=False. I think this easily break referential transparency if the vector written to is frozen without copy.

Shimuuar · 2025-02-12T20:38:38Z

Thank you. Looks like I forgot how list-t works. But then source of problems is unsafeFreeze. Streaming doesn't bring anything new to the table. Consider following completely innocuous code:

{-# LANGUAGE TypeFamilies #-}
module Q where

import Control.Applicative
import Control.Monad.Trans.Class
import Control.Monad.Primitive
import Data.Vector.Unboxed as VU
import Data.Vector.Unboxed.Mutable as VUM
import ListT

instance PrimMonad m => PrimMonad (ListT m) where
  type PrimState (ListT m) = PrimState m
  primitive = lift . primitive

horror :: IO [(Int, Vector Int)]
horror = ListT.toList $ do
  mv <- VUM.replicateM 1 $ pure 0 <|> pure 1 <|> pure 2
  v  <- VU.unsafeFreeze mv
  !i <- pure $ VU.head v
  pure (i,v)

terror :: IO [(Int, Vector Int)]
terror = ListT.toList $ do
  mv <- VUM.replicateM 1 $ pure 0 <|> pure 1 <|> pure 2
  v  <- VU.unsafeFreeze mv
  i  <- pure $ VU.head v
  pure (i,v)

Strict terror will produce [(0,[2]),(1,[2]),(2,[2])] and lazy horror: [(2,[2]),(2,[2]),(2,[2])]. So unstreamPrimM (probably it should be named unsafeUnstreamPrimM) doesn't cause any problems that unsafeFreeze doesn't.

Expose unstreamPrimM as unsafeUnstreamPrimM
Create unstreamPrimM which copies vector
Add warning about non-deterministic monads to unsafeFreeze.

gksato · 2025-02-13T23:47:27Z

Yep, I guess it's the correct thing to do. I kinda think like this: since the action f x in m >>= \x -> f x sees the unique value of x for each execution of f x, due to the very fact that f is a function. If the monad law is satisfied, then all the later execution sees only one value, say x0, of x. Then, even if x0 is written to a vector, how can it be overwritten by another value of x while this x0 is visible? So we should be able to prove that unstreamPrimM bdl = munstream bdl >>= freeze does work with ANY PrimMonad, which I haven't been able to.

WinstonHartnett mentioned this issue Jun 23, 2022

Not fusing unless monadic #438

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unstreamM etc destroys performance; expose unstreamPrimM & document most-performant method #416

unstreamM etc destroys performance; expose unstreamPrimM & document most-performant method #416

infinity0 commented Sep 11, 2021 •

edited

Loading

gksato commented Sep 12, 2021 •

edited

Loading

gksato commented Sep 12, 2021 •

edited

Loading

infinity0 commented Sep 12, 2021 •

edited

Loading

gksato commented Sep 12, 2021

Shimuuar commented Sep 27, 2021

gksato commented Nov 18, 2021 •

edited

Loading

gksato commented Nov 18, 2021 •

edited

Loading

Shimuuar commented Feb 9, 2025

gksato commented Feb 10, 2025 •

edited

Loading

gksato commented Feb 10, 2025 •

edited

Loading

Shimuuar commented Feb 12, 2025

gksato commented Feb 13, 2025

unstreamM etc destroys performance; expose unstreamPrimM & document most-performant method #416

unstreamM etc destroys performance; expose unstreamPrimM & document most-performant method #416

Comments

infinity0 commented Sep 11, 2021 • edited Loading

gksato commented Sep 12, 2021 • edited Loading

gksato commented Sep 12, 2021 • edited Loading

infinity0 commented Sep 12, 2021 • edited Loading

gksato commented Sep 12, 2021

Shimuuar commented Sep 27, 2021

gksato commented Nov 18, 2021 • edited Loading

gksato commented Nov 18, 2021 • edited Loading

Shimuuar commented Feb 9, 2025

gksato commented Feb 10, 2025 • edited Loading

gksato commented Feb 10, 2025 • edited Loading

Shimuuar commented Feb 12, 2025

gksato commented Feb 13, 2025

infinity0 commented Sep 11, 2021 •

edited

Loading

gksato commented Sep 12, 2021 •

edited

Loading

gksato commented Sep 12, 2021 •

edited

Loading

infinity0 commented Sep 12, 2021 •

edited

Loading

gksato commented Nov 18, 2021 •

edited

Loading

gksato commented Nov 18, 2021 •

edited

Loading

gksato commented Feb 10, 2025 •

edited

Loading

gksato commented Feb 10, 2025 •

edited

Loading