Is it possible to compose parsers and state monad transformers with different input and output types? where a parser/state monad is the form:

` data M m i o a = M { uM :: i -> m (o, a) }`

such that:

```
given M m A B () and M m B C ()
create M m A C ()
```

If we look at indexed monads, especially indexed state, we notice it’s isomorphic to the state monad, when `i eq o`

. This is especially useful when we want to compose a transfromation, `A -> Z`

through many smaller function applications!

Starting with our trusty old parser, let’s derive an indexed monad in broad strokes, with input/output types, where we can also run `IO`

effects, if need be.

```
newtype P a = P { unP :: String -> Maybe (a, String) }
newtype P' m a s = P' { unP' :: s -> m (a, s) }
newtype P' m a i o = P' { unP' :: i -> m (a, o) }
```

Further, since `m`

has kind `* -> *`

, our type `P'`

, could be considered a monad transformer once we are able to define a suitable `lift`

function.

To demonstrate indexed monads, I’m going to work through an example of how indexed monads can be used to compose a series of transforms through subsequent datatypes, mocking a compiler pipeline with a monad called `IxMonadT`

, which is isomorphic to `P'`

defined above. Full source code available here.

`newtype IxMonadT i o m a = IxMonadT { runIx :: i -> m (a, o) }`

Take note that our outer term, and the polymorphic target of our function instance, is going to be `a`

. This could just as well be `o`

! We also do a slight re-arrangement, putting `m`

after the index types, `i`

and `o`

. The reason for this, is so we can write a `MonadTrans`

instance and be a transformer, although as we will see, we cannot write instance methods for the Haskell `Monad`

typeclass.

To make our data type a monad, we need to define a `return`

, and a `bind`

. If we try to add an instance of Monad from `Control.Monad`

for `IxMonadT`

, the polymorphic variable will be our `a`

from the newtype definition, and we will have to define the following instance:

`(CM.>>=) :: IxMonadT i o m a -> (a -> IxMonadT i o m b) -> IxMonadT i o m b`

This is clearly not what we want!

(Note: we are going to use `CM`

to represent qualified imports from `Control.Monad`

…)

Instead, we will write the bind as follows:

```
(>>=) :: (CM.Monad m) => IxMonadT m i c a -> (a -> IxMonadT m c o b) -> IxMonadT m i o b
(>>=) v f = IxMonadT $ \i -> runIx v i CM.>>= \(a', o') -> runIx (f a') o'
```

This works to compose an indexed monad from `i -> m (a, c)`

and one of `c -> m (b, o)`

into one of `i -> m (b, o)`

! However, given the function signature of bind, there is no way to shoe horn this into the `Monad`

typeclass instance, which requires a signature, `m a -> (a -> m b) -> m b`

, so our indexed monad will never satisfy a monad constraint, and we’ll have to be satisfied knowing it’s an enriched category in the monoidal category of endofunctor :)

Given that we cannot use bind, or `>>=`

, as defined in base to work on our `IxMonadT`

, it is still possible to use `do`

blocks via the language pragma, `RebindableSyntax`

. This enables `NoImplicitPrelude`

as well, and ultimately means that `do`

blocks use the bind and return functions are available in locally in scope. Thus, we need to bring in `>>=`

from base as a qualified import, to prevent an ambiguous occurrence error when trying to resolve which `>>=`

to use inside each `do`

block.

Here are the rest of our `IxMonadT`

functions, credit to Stephen Diehl’s WIWIKWLH article on the same subject. We are still able to get a `MonadTrans`

and `Functor`

instance, although `MonadState`

and `MonadIO`

are not available due to our unique bind signature.

```
-- traditional monad functions
return :: (CM.Monad m) => a -> IxMonadT s s m a
return a = IxMonadT $ \s -> CM.return (a, s)
(>>=) :: (CM.Monad m) => IxMonadT i c m a -> (a -> IxMonadT c o m b) -> IxMonadT i o m b
(>>=) v f = IxMonadT $ \i -> runIx v i CM.>>= \(a', o') -> runIx (f a') o'
(>>) :: (CM.Monad m) => IxMonadT i c m a -> IxMonadT c o m b -> IxMonadT i o m b
v >> w = v >>= \_ -> w
instance MonadTrans (IxMonadT s s) where
lift :: (CM.Monad m) => m a -> IxMonadT s s m a
lift ma = IxMonadT $ \s -> ma CM.>>= (\a -> CM.return (a, s))
-- MonadIO
liftIO :: CM.MonadIO m => IO a -> IxMonadT s s m a
liftIO = lift . CM.liftIO
-- MonadState
put :: (CM.Monad m) => o -> IxMonadT i o m ()
put o = IxMonadT $ \_ -> CM.return ((), o)
modify :: (CM.Monad m) => (i -> o) -> IxMonadT i o m ()
modify f = IxMonadT $ \i -> CM.return ((), f i)
get :: CM.Monad m => IxMonadT s s m s
get = IxMonadT $ \x -> CM.return (x, x)
gets :: CM.Monad m => (a -> o) -> IxMonadT a o m a
gets f = IxMonadT $ \s -> CM.return (s, f s)
-- eval/exec the transformer
evalIxMonadT :: (CM.Functor m) => IxMonadT i o m a -> i -> m a
evalIxMonadT st i = fst <$> runIx st i
execIxMonadT :: (CM.Functor m) => IxMonadT i o m a -> i -> m o
execIxMonadT st i = snd <$> runIx st i
instance (CM.Monad m) => CM.Functor (IxMonadT i o m) where
fmap :: (CM.Monad m) => (a -> b) -> IxMonadT i o m a -> IxMonadT i o m b
fmap f v = IxMonadT $ \i ->
runIx v i CM.>>= \(a', o') -> CM.return (f a', o')
```

Thus, we have an indexed monad transformer that can compose state monads!

Back to our example, for our example computation of many small steps, let’s say we want to start with `SourceCode`

, and produce a `Core`

. We have the following types, as part of a compilation pipeline for the programming language `Arbitrary`

.

```
newtype SourceCode = SourceCode Text
newtype Tokenized = Tokenized [Text]
data Expr = EInt Int | EStr Text | EVar Text | EApp Expr Expr deriving (Show)
newtype Syntax = Syntax { unSyntax :: Expr } deriving (Show)
newtype Core = Core { unCore :: Expr } deriving (Show)
```

Here we can see the transformation functions: For the sake of this demonstration, I just used coerce, or stubbed out a `const`

function.

```
source2Toke :: SourceCode -> Tokenized
source2Toke (SourceCode txt) = Tokenized [txt]
toke2Syntax :: Tokenized -> Syntax
toke2Syntax _ = Syntax $ EApp (EVar "Fn") $ EInt . fromIntegral $ 42
syntax2Core :: Syntax -> Core
syntax2Core = coerce
```

For our demonstration, we will take our `SourceCode`

data type, and generate some `Core`

, using the previously defined transformation. Additionally, we’ll run and lift an `IO`

action to show the results of running the pipeline.

```
run :: IxMonadT SourceCode Core IO Core
run = do
toke <- gets source2Toke -- :: IxMonadT SourceCode Tokenized IO ()
liftIO $ putStrLn "inside IxMonad" -- :: IxMonadT Tokenized Tokenized IO ()
syn <- gets toke2Syntax -- :: IxMonadT Tokenized Syntax IO ()
modify syntax2Core -- :: IxMonadT Syntax Core IO ()
result <- get -- :: IxMonadT Syntax Core IO Core
-- with get we can manipulate the value of the transformation
liftIO $ print result -- :: IxMonadT Syntax Core IO Core
return result -- :: IxMonadT SourceCode Core IO Core -- (final type)
main :: IO ()
main = do
let srcCode = SourceCode "here is my source code"
in execIxMonadT run srcCode CM.>> print "done"
```

Great! The main property we want here is that our `IxMonadT`

function run goes from `SourceCode`

to `Core`

, and we are able to do through indexed monad composition! We also get interleaved `IO`

effects, and the convenience of `do`

notation!

The inspiration for this idea was to refactor ghc’s `Stream.hs`

module in a type safe way, such that refactoring other parts of the compiler will not cause a major issue. The source code for Stream.hs is available here. However, the way stream is written, our indexed monad type is not a drop in replacement. (Side note: Steam.hs, as written is not a true steam like Condiut or Pipes).

As for proper uses of indexed monads, there are a few that I think are worth a mention:

- Squeal uses indexed session monads for tracking schema migrations. (Shout out to Eitan for help with this article!).
- Session Types Which uses a phantom parameter as an index to “protect” resources like file handles or other resources.
- JSONTest is an indexed monad approach to testing, where your index type variables represent an abstract datatype to encode into json, the result of extracting a value, and what that value should be. Note, this is the only library does use rebindable syntax.
- Ian Malakhovski’s thesis, p. 108-109 An example of an indexed monad with error handling defined via
`throw`

and`catch`

, which is a pretty interesting idea for the application developed in the beginning of the article, considering the propensity for compilation to fail! - Control.Monad.Indexed Kmett’s indexed monad library, a good reference for finding definitions and function signatures.

In this article, we’ve defined an indexed monad transformer, and employed it to run a mock compiler transformation pipeline with interleaved IO effects. Although we can write all the needed methods for `Monad`

, `MonadState`

, and `MonadIO`

, our definition of `>>=`

prevents us from being an instance of these classes. This is unfortunate, as we can use our `IxMonadT`

just like a monad, and it’s bind has a basis in category theory to give use lawfulness. The other consequence is that if your application is using “tagless final” or “ReaderT” patterns, you will not be able to provide a integrate `IxMonadT`

into your monad transformer, since we cannot define constrains like `MonadReader`

.

On using `RebindableSyntax`

, I’m not sure I would advocate using this extension outside of library code, and the ambiguity that is caused when trying to interleave a locally defined `>>=`

, say the one for `IxMonadT`

, and the one from your favorite prelude can be frustrating. That said, there are libraries mentioned in my inspiration that fall on both sides of the issue, JSONTest uses the pragma well, while Squeal does not, and Eitan defines a typeclass `IndexedMonadTrans`

with a bind function, `pqBind`

.

Due to these restrictions, I believe the best way to derive benefit out of indexed monads is to use them in an encapsulated manner to model a specific problem. This gives the library author the benefit of indexed type variables for additional type safety and composition, and avoids exposing the end library the clunky ergonomics of `RebindableSyntax`

.

If you share small, single module, self contained haskell examples, stack script gives us an easy way to get reproducible builds, by pinning the dependencies to a Stackage snapshot within a comment at the top of your Haskell code.

There are at least two additional motivations, besides reproducible builds, that you might want to use Stack’s scripting feature:

Lower the configuration barrier: write an independently compiling Haskell source code file with package dependencies without having to configure a new stack or cabal project. Personally, I find this helpful when exploring new libraries or writing small programs.

Using Haskell as a scripting language, or replacement for Shell/Bash/Zsh. This use case pairs well with the

`Turtle`

library, although this approach does have downsides.

Stack is a build tool primarily designed for reproducible builds, done by specifying a resolver in a configuration file, usually your projects `stack.yaml`

and `package.yaml`

With Stack’s scripting feature, we still get reproducible builds by specifying a resolver, but move this specification to the file we are compiling, or as a command line argument. Therefore, for the sake of simplicity, we’ll assume that these scripts are run outside of a stack project, and stack is invoked in the same directory as the script file.

*Note:* When running a stack script inside of a stack project, it’s important to consider that stack will read settings from your `project.yaml`

and `stack.yaml`

, which may cause issues.

This article contains the following examples of using scripting with stack:

- A basic example of the Scripting Interpreter
- A simple Servant server that statically serves your current working directory
- An example of stack as a bash replacement
- Using stack script to launch ghci

For our first example, we’ll use stack to run a single file of Haskell source code as a script.

Here’s the source code we want to run, in a filed called `simple.hs`

:

```
main :: IO ()
main = putStrLn "compiled & run"
```

To run this with the stack script interpreter, we can do the following:

`$ stack script simple.hs --resolver lts-14.18`

The resolver argument is mandatory, and Stack will compile and run the `simple.hs`

file immediately after invocation using the `lts-14.18`

Stackage snapshot.

Alternatively, we can put all of the configuration information into the script itself, like this:

```
{- stack script
--resolver lts-14.18
-}
main :: IO ()
main = putStrLn "compiled & run"
```

which can be compiled and run with `$ stack simple.hs`

.

The “killer feature” for scripting with stack is probably the ability to pull in packages without having to a `stack.yaml`

or

This can probably be best seen with `stack ghci`

, where the following command will drop you into a ghci repl where you have `lens`

and `text`

packages available from the specificied resolver.

`stack ghci --package text --package lens --resolver lts-14.18`

An example of this concept with the stack scripting engine, is a quick and dirty file server, `explore.hs`

would be as follows:

```
~/projects/stack-script$ cat explore.hs
#!/usr/bin/env stack
{- stack script
--resolver nightly-2019-12-22
--install-ghc
--package "servant-server warp"
--ghc-options -Wall
-}
{-# LANGUAGE DataKinds, TypeOperators, TypeApplications #-}
module FileServer where
import Network.Wai.Handler.Warp( defaultSettings, runSettings, setBeforeMainLoop, setPort)
import Servant (Proxy(Proxy), Raw, serve, serveDirectoryWebApp)
main :: IO ()
main = runSettings settings . serve (Proxy @Raw) $ serveDirectoryWebApp "."
where port = 8080
msg = "serving on http://localhost:" ++ show port ++ "/{pathToFile}"
settings = setPort port $ setBeforeMainLoop (putStrLn msg) defaultSettings
```

Noting a couple of features

`--install-ghc`

is the flag to install ghc, if it is not already available.- The addition of the hash bang, (line 1),
`#!/usr/bin/env stack`

, let’s you run this as an executable,`$ ./explore.hs`

- If running, this script will let you see it’s source code at
`localhost:8080/static/explore.hs`

, along with any other files within the current working directory the script was run. - The snapshot here is a nightly from the day the script was written, nightly-2019-12-22, which ensures the most up to date version of libraries are used when the script is written while still pinning us to a specific snapshot.
- We pass in
`-Wall`

to ghc-options, and can give additional ghc options here.

On a fresh compilation, this will take a few minutes to run, as Stack needs to go and grab about 255Mb worth of source code in ~86 dependent packages, compile and link it in order for the above code to run. However, on subsequent runs, Stack can use a local cache of of the packages, and we can reproduce our project build without downloading and building all the dependencies!

It’s possible to use haskell, and Stack scripting feature, along with the Turtle library as a drop in replacement for sshell scripting!

To do this, we need the following at the top of our Haskell file:

```
#!/usr/bin/env stack
{- stack script
--compile
--copy-bins
--resolver lts-14.17
--install-ghc
--package "turtle text foldl async"
--ghc-options=-Wall
-}
```

This stack script does a couple of things:

`--compile`

and`--copy-bins`

create a binary executable based on the filename.

- installs ghc, if needed with
`install-ghc`

- builds the scripts with the set of packages from
`lts-14.17`

With tutle, we get a portable way to to run external shell commands, and I was able to create a nice haskell program to replace the shell script I used to automate the server tasks needed to deploy this blog!

The basics my deploy turtle script are as follows, and you can see the full example on github here

```
import qualified Turtle as Tu
import qualified Control.Foldl as L
import qualified Data.Text as T
import Control.Concurrent.Async
import System.IO
argParser :: Tu.Parser Tu.FilePath
argParser = Tu.argPath "html" "html destination directory"
main :: IO ()
main = do
-- 53 files copied over into destinationDir
hSetBuffering stdout NoBuffering
destinationDir <- Tu.options "Build blog and copy to directory" argParser
Tu.with (Tu.mktempdir "/tmp" "deploy") (mainLoop destinationDir)
```

One nice thing about turtle is the `Tu.with`

function, which lets use run our the main logic of our program with a tmp directory which is subsequently cleaned up after the `mainLoop`

function returns.

Despite turtle being a handy library, I did find some downsides - Use of `FilePath`

, which uses a pretty clunky, `String`

based file representation - Often times clunkier semantics than just writing bash: for instance, `cp -r SRC TRG`

is requires a fold over the result of `ls SRC`

and construction of an explicit `cp`

with each file, instead, you need to use `cptree`

, which took me a while to figure out, so it would be nice if the semantics matched better! - Turtle is a monolithic framework for interacting with OS through a set of mirroed shell commands trying to match `coreutiles`

, and it’s tighlty couple parts makes it not very easy to pick the parts you like, and disregard the rest!

We’ve already seen a few examples of stack script, but there is one more that should be in every Haskeller’s toolkit. Stack script can be used to launch a ghci repl. Let’s say we are working with a new ADT, and want to write a new QuickCheck instance, how can stack script help us?

The following header will load the listed packages into a ghci repl:

```
{- stack
--resolver nightly
--install-ghc
exec ghci
--package "QuickCheck checkers"
-}
moduble XTest where
```

There is one note to make here about the order of the arguments:

- The file will compile, then drop you into with module
`XTest`

is loaded - If
`exec ghci`

does not imeadiately follow`stack`

, then the`--packages`

must be before`exec ghci`

I often find myself coding up small Haskell snippets, whether it’s playing around with a new ADT, trying out a library, or reproducing an example from a paper or a book. In these cases, Stack’ scripting feature shines at giving me a self contained file where I can specify the dependencies via a snapshot in the file header, and not have to worry about breaking changes, or setting up a project with all the correct dependencies. Thus, I would urge my fellow Haskellers to consider using stack’s scripting feature when they share code online, to help others run their code today, and keep in runnable far into the future!

- Stack Docs: Script Interpreter

- FPComplete: How to Script with Stack
- Hackage: Stack.Script Useful for figuring out what is going on underneath the hood!

- Richard Odone: Scripting in Haskell and PureScript

- I built a library of of koans using the Julia programming language as part of a course project last semester! You can Run them on Colab here
- The koans themselves are hosted on Jupyter Notebooks, and built from Julia source code with Literate.jl
- Flux.jl is deep learning library I used
- The koans are run locally, the project Github repo is here
- Although I’ve finished my first run of chapters, I’m still in the process of exploring these concepts, and would appreciate feedback!

First, a **koan** is a programming language problem with three aspects: - A text section that contains the introduction of a concept - A short snippet of no working code - A “test” or proof that will work once the understanding of the above concept has been used to fix the code

And **deep learning** is well, a buzzword, but I took it hear to mean a library, or API, that is capable of building modular neural networks using GPU acceleration.

Literate programming is the idea that your source code is both human readable, and machine runnable. Literate.jl takes this up a step, and let’s the user programatically build Jupyter notebooks from executable scripts. For instance,

```
# # This would be a text
# right here is a continuation of that text
x_str = "now we are in a source code cell"
y = 1
# The first comment brings us back to text!
```

With that set up, building koans is pretty simple as we can alternate explanations with koans, and have the user interactively test and change their code.

An example koan would be:

```
# # A Demo Koan
# array indexing in Julia is 1-based
xarray = ["a", "b", "c", "solution"]
ind = 0 # Fix me !
ind = 4 #src
@assert xarray[ind] == "solution"
```

And we can see a screen shot of the notebook.

Indeed, the `#src`

tag will be filtered out by Literate.js when compiling the notebook, allowing the koan writer to test all the koans by sourcing the script, and the user to look up the solution in the source code if necessary. My script for generating the notebooks from Julia is available here, and you can find more information in The Literate Julia Docs

Julia is a great programming language, and probably the best option for building neural networks for two reasons: - Julia runs fast, is gradually typed so you can write it fast, and compiles down to LLVM which means no calls to C/C++! - For the above reasons, a neural network can be programmed entirely with Julia, making differential programming much simpler, and building neural network easier!

Flux is still under active development, with notable improvements happening in the area of differential programming over the last year, and the next generation differentiation system, Zygote is being integrated into Flux now!

So how do we teach Flux? My approach was to write 7 chapters, first covering Julia, then covering what I believed to be the the most important aspects of using a new DL library: working with data, building models, training models, using the GPU.

My strategy was inspired by a project I did earlier this fall to implement a variational auto-encoder in Flux, and I tried to create the document I would have wanted to read given my knowledge base (know math, ML, R/Python), if I were to implement a similar project. If I get a chance, I’ll talk about that project in another post!

Therefore, I set up 7 chapters, as collections of koans, in the following way:

1. Introduction to Julia

2. Working With Data In Julia

3. Intro to Flux

4. Convolutional Neural Networks and Layers in Flux

5. Recurrent Neural Networks and Layers in Flux

6. Flux optimization

7. Putting in all together, and more examples!

For the content of the koans, I wrote many of them myself, and was heavily inspired by the tutorial examples in the Flux source code.

Whether through the use of “koans”, the chrome inspect tool, or the command line, if you are going to learn a new library, you need to play with it. Although I am not sure if koans via Jupyter notebook are here to stay, I think there is an acute need for easy ways to play around with code when you are trying to learn something new. Adding insight to this process should be the goal of any good koan writer!

]]>Everyday this year, between 1 and 4am an email from arxiv.org appeared in my inbox, with the title and abstract of every paper submitted the previous day in Artificial Intelligence, Computation and Language, Computers and Society, Human-Computer Interaction, Information Retrieval, Learning, Other Computer Science, Programming Languages, Software Engineering, Social and Information Networks.

It’s a lot. Today, 156 submissions. I don’t read all the abstracts, just the title, then read into the abstract until I have a reason not to. Maybe every other day I finish a paper, but those are the topics needed to cover my interests in data science, software engineering, and start up technologies. For me, its a question of breadth and depth.

Curiosity. But that’s not a good enough answer! The first major reason is that I enjoy reading and scrutinizing papers, of all levels, and to be honest, Arxiv.org is a bit of a mixed bag. The next, is that I like searching for ideas that relate to what I’m doing at work, and inspire me to develop the skills I need to do a ‘great’ side project currently beyond my skills. There’s been a lot of positive feedback while reading, and I find new information all the time that are tangentially related to work.

Over the course of the year I found a variety of interesting papers that have influenced my work, thinking, and that I otherwise just think are worth sharing. Here they are:

A Conceptual Introduction to Hamiltonian Monte Carlo STAN, a probabilistic programming language used for bayesian statistics, uses Hamiltonian Monte Carlo, and this is the guide for understanding the algorithm with a differential geometry primer included. Betancourt provides a good overview, covering the concepts, important metrics used for debugging STAN, and even the mathematics behind Hamiltonian Systems and phase space. This is a fascinating paper from the perspective of applying physics to solve numerical problems alone, but what makes it great is the geometric intuition it provides when you need to get a STAN model to converge. In my experience, the intersection between statistical model, STAN implementation, and convergence provides the solution space for possible bayesian models in STAN, and this guide really helps understand the later two concepts.

The Future of Ad Blocking: An Analytical Framework and New Techniques Ads are everywhere, and we are only becoming more conscience of the effect they have on our attention, web experience, and importantly, privacy. Blocking ads is popular, although existing solutions are technically rudimentary in their implementation. The authors discuss how the ad blocking may likely evolve, what technical game states will be encountered, and propose an interesting end game that consists of user software that can both actively block ads while obfuscating ad block detection. There is incredible demand for ad-blocking software, and this paper really spells out and interesting solution to a problem many of us face!

Developing Bug-Free Machine Learning Systems With Formal Mathematics

The authors here are trying to bridge the gap between building a machine learning system, and deploying that system in production. If you are unfamiliar with how difficult this is, its a problem of design opposites: in model develop you are looking for a solution and you aren’t sure what exactly you’ll end up with, and in production you need a fast, efficient, and ultimately reliable algorithm that will work safely every time. What’s so fascinating, is that they built a programming language with theorem proving expressive enough to run a variety of models needed for exploring model space, which is inherently “safe” enough to run in production. This idea is certainly far from finished, but its an early example of how programming language theory can help solve some of the more difficult problems in industrial data science and machine learning. This idea is far from finished, and 2018 will hopefully see more work being done on this, with possible integrations with the major deep learning libraries.

Sparsity information and regularization in the horseshoe and other shrinkage priors Regularization is important in machine learning, and lets us train models that efficiently use just the features needed for prediction. Known as sparsity for bayesian statistics, this problem is difficult in terms of both defining the proper theoretical distributions, and computationally estimating the distributions parameters with sampling techniques. This paper goes a long way by providing theoretically justified, and computationally converging priors that allow for sparsity constraints to easily be added to bayesian models in STAN. This is a huge breakthrough, and adds a significant new practical technique available for bayesian modeling in STAN. Where only strongly uninformative priors were available before, we now have the ability to select n out of k features to be used in the final bayesian model. This should make STAN a viable option for feature selection, potentially expanding its role in many data science projects.

Proxy Discrimination in Data-Driven Systems What is fairness? Is our process fair, are we? For regulated institutions and implementers with moral fiber, this is a vital questions. This paper defines the use of proxy variables in discriminatory machine learning systems using information theory, then develops pseudocode algorithms for testing the presence of proxy variable discrimination in automated decisions. I like this paper for two reasons: One, is that it provides a good, testable mathematical basis for a concept many of us building algorithms are familiar with, proxy discrimination, and two, reading this paper was my first exposure to the field of fairness, and all its corresponding and contradictory measures. Issues of fairness are only becoming more of a priority for stakeholders, and this paper gives you a jumping off point to determine the fairness of an existing process and its inputs.

Disintegration and Bayesian Inversion, Both Abstractly and Concretely

A wonderful paper about manipulating probability distributions, including beautiful visualizations of probabilisitic manipulation. The paper solidifies the notion of a probability distribution in a formal language, the basis for EfProb Library in Python. Overall, this is a really interesting application of formal semantics, mathematics, and statistics that I found extremely educational and a joy to read! For the mathematics alone, this paper is definitely worth a browse, especially if you are building a a software system with probabilistic reasoning!

Stream Graphs and Link Streams for the Modeling of Interactions over Time This paper develops a formalism for dealing with graph interactions over time, which is both self-consistent, and compatible with graph theory. This provides a coherent framework, and subsequently develops a set of graph theory measures, for dealing with datasets in many operational domains. What caught my attention was how well the formalism describes the some of the operational and network data I see at work, and the elegance of the solution. It would be interesting to see some additional work done here with causality, as the formalism already describes temporal relationships so well.

A Tutorial on Canonical Correlation Methods Canonical Correlation Analysis is a multivariate statistical technique to compare paired sets of variables, where each set can contain many measures. This is a very well written tutorial, from explaining the motivation and history of the technique, to formulating CCA and giving a proof of its solution via Lagrange Multipliers. For a reader familiar with Principal Components Analysis, or even just linear algebra, this is a surprisingly effective tutorial, and very much worth the time!

The biggest trend I saw this year was “Deep Learning applied to problem X”,(What really is Deep Learning?) there are numerous papers per day that just deal with neural networks, implemented in all of the major toolkits. There is a lot of noise here, but definitely some good work, and I’m especially looking forward to what comes out next year about the role of causality and information theory in neural network representations. See top answer for list of Deep Learning pubs in 2017.

The next big thing I saw was exploratory data analysis on Twitter: select some tweets, run a suite of NLP feature extraction tools, then perform a statistical analysis. There are a lot of similar papers trying to predict ‘Fake News’: using paired data sets, using crowdsourcing, a lot of approaches. Something I wasn’t expecting to see, was a lot of survey paper asking software developers opinion and work related questions, like “What are your favorite tools?”, “Why did you quit your last job?”, etc. These are usually on the lower end of the rigor spectrum, and are often hit or miss in methodology. Nonetheless, there are often some interesting insights, however obvious the subject matter may be.

Another interesting area was the application of formal methods to existing problems: two of these papers made the list, and I find myself constantly thinking about, developing, and refining notation while working on data science problem.

- What my areas of interest are, including programming languages, time series processing models, bayesian statistics, causality and information in deep learning, and formal methods applied to X.
- The most effective process for me is to read the daily digest of abstracts is: open the email, read the title and author, and continue through the abstract, onto the link. If interest is lost at any point, go to the next entry, else, bookmark the link. Next, I subsequently read all the bookmarks.
- Whatever your problem is, there is someone working on a similar problem, whose approach will benefit you. This held for the majority of the data science problems I encountered, even if all they had to offer was a perspective on how to optimize something I didn’t have time to do!

- Its important to quickly distill the essence of an idea, how the authors are approaching a problem, and what they hope to achieve. There are a ton of great ideas on Arvix.org, and you can quickly see a new perspective on something!

Just tried out Python Pandas for analysis work as an alternative to R, so far, so good! Really impressed with the interface, speed, and resources online. For me, trying Pandas is analogous to using R’s data table package: some slight differences, but all the same functionality when it comes to data transformation. Pandas is built on top of NumPy arrays, and between those two packages all of R’s data frame capability is available in the python environment.

My experience using R and its wealth of well developed statistic, machine learning, and visualization packages gives me quick access tools not found in Python, and the community of Statisticians using R ensures that cutting edge packages are published regularly. But let’s be honest: R has some major weirdness that makes it hard to learn, difficult to run concurrently, and downright slow for some data structures access patterns. There are ways around a lot of these issues, but it’s hard to overcome the fact that few people know R, and fewer know it well compared to python. Writing critical code for a start-up in R a risky proposition when it comes to maintainability. Most likely, a start up is using more than just R(prove me wrong!), and if Python or alternative language can handle the analysis task, it should be used. This is where Pandas can really shine, for data transformations.

Getting used to Pandas has put my foot in the door of doing data science in the Python ecosystem, although transferring all my skills in R will require learning about a dozen packages. With all the folks out there using Python for projects and companies, the advantage of using Python for analysis can only grow as Python matures. If Python can woo Academia’s statisticians, R will eventually lose it’s superiority package support, and the user environment where a lot of folks like me learned it. Until then, I’ll most likely use both languages for different tasks while I eager anticipate the day until Julia becomes better than both!

Check out this cheat sheet for Pandas basics

A nice translation of R’s data frame functions to Pandas

Comparison of R and Python for data science(Post image is from them): link

]]>