NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
MLJ.jl: A Julia package for composable machine learning (alan-turing-institute.github.io)
teruakohatu 1656 days ago [-]
Model composability is one of Julia's and MLJ's highlights. Library designers tend to take this into account throughout the ecosystem. It helps that Julia does not need to dropdown into a C or FORTRAN wrapper like Numpy.

That being said, MLJ tends to wrap a lot of models from scikit-learn with a composable wrapper, meaning you end up having to manage Python dependencies and you lose Julia features such as multi-threading (Edit: when using Python models, multi-threading is supported when using Julia models).

Long term I am sure MLJ will slowly replace scikit models but right now it can be little painful to use if the model you want is not implemented in Julia.

snicker7 1656 days ago [-]
This is exacerbated by the fact that Julia's Pkg.jl does not yet support conditional/optional dependencies [0]. A lot of these meta packages tend to pull everything but the kitchen sink.

[0]: https://github.com/JuliaLang/Pkg.jl/issues/1285

beforeolives 1656 days ago [-]
So instead of dropping down into C, it has Python as a depedency and wraps sklearn? Way to solve the two-language problem.
ponow 1656 days ago [-]
Absa-bloody-lutely it's a way of solving the two language problem, because it provides a pathway from the current two-language setup (Python with C for the computationally intense portions) to a single language (Julia everywhere, because it can reach C speeds). In the short term, library writers can provide Julia wrappers around legacy code from other languages, but that needn't affect the library users.

I think you can only make the snarky remark because you don't truly get the 2-language problem. The problem has zero to do with the complexity of dealing with dependencies, as if the hard problem of scientific computation languages is like a Linux distribution with too much bloat. No, no, no. It's the cognitive load to solve scientific problems, when at the end of the day performance really matters. We need a language that is as terse as mathematics where we can customize all the way down to a single array entry, without performance loss. Python itself will never have the necessary performance. To do truly new numerical stuff you cannot avoid the computational kernel details (e.g., PDEs or image processing). So you're forced, as a researcher, to become good enough to be effective at two languages, and you often have a complex data exchange interface between the two, and it sits right near the critical conceptual problems that you're trying to solve.

newswasboring 1656 days ago [-]
That's a genuinely mean comment towards a new library which is using existing code to bootstrap itself. Julia does solve the two language issue, but not in all situations. Like I'll never write a driver in julia. But this mean spiritedness towards anything that's not perfect is disheartening.
kescobo 1656 days ago [-]
>Julia does solve the two language issue, but not in all situations

And not instantaneously. Developer time being finite, using what works from existing libraries while waiting for a pure-julia implementation seems like an excellent solution.

Certhas 1656 days ago [-]
Julia has always had the ability to call Python code. For example you can use PyPlot/matplotlib as a plotting backend for Julia.

There is no way to bootstrap a scientific language without leveraging existing ecosystems. But this is a starting point. Whenever someone find a sklearn model to slow, or misses the automatic parallelism that Julia brings, they can just implement the model in Julia, and bit by bit the non-Julia bits get swapped out where it brings a real benefit.

dklend122 1656 days ago [-]
You hardly need python for models at this point. There are native julia implementations of linear regression, naive bayes, gradient boosting, random forest, PCA, T-SNE and more.

The python interop is just a stopgap

oscardssmith 1656 days ago [-]
The ability to call other languages is a strength. The need to do so is a weakness.
Jouvence 1656 days ago [-]
I disagree; consistency is the most important part of this. With Python, you know where you stand - performance comes from elsewhere. With Julia it might be internal, a C/Fortran library, or apparently other things now too.
DNF2 1656 days ago [-]
I beg your pardon? Knowing that Python is slow is a strength of Python? Then, if Python suddenly became fast, it would be a negative?

Did I misunderstand something?

Presumably, all fast languages are at a disadvantage, then.

Jouvence 1656 days ago [-]
I never called it a strength; I simply don't care how fast Python is because I don't need to. Reasoning about what is going on under the hood with Python is just easier than Julia - if it needs to be fast, it's a fast external library being used.

This is really a minor issue stemming from the relative maturity of the languages - if Julia becomes more established I would hope usage of external libraries which don't offer a performance advantage (ie everything besides C and Fortran) eventually gets replaced with native packages to preserve the sanity of the users.

adgjlsfhk1 1656 days ago [-]
One problem relying on fast libraries causes is it makes doing compiler tricks like automatic differentiation (AD) basically impossible. Also, it restricts the types of APIs that make sense. A simple example of this is to compare Scikit learn to Julia. In Scikit learn, most clustering methods don't allow the user to specify a distance function because doing so would require running python inside a tight C loop, tanking performance. In MLJ, on the other hand, basically anything that requires a distance function will allow you to pass one in rather than assuming euclidean distance. This is possible, because a distance function written in Julia can still be fast, so it can be used without slowing down the whole program for people who only want euclidean distances.
DNF2 1656 days ago [-]
So then that 'weakness' in python is not an issue to you personally, because you fluently drop down to a fast language anytime you need to, with no particular loss in productivity?

Keep in mind though, that this could be a hurdle to those who are less multilingual. So even though it's not a weakness to everyone, it is to many.

notagoodidea 1656 days ago [-]
On the other side, we observe more and more inter-languages's ecosystem crossing from/to Python to leverage its gigantic set of libraries. Julia interface with a lot of other languages but it is a big trend for the last years.

I think the two-language problem begins to be as we move forward a network of multi-languages calling each other instead of either having to rewrite everything from scratch either relying on a C/C++/Fortran/Rust lib and the C ABI to patch them. We may finish with a set of "meta"-langurust in C code nameages permitting the interaction of a large number of libraries and programming languages outside of their silo.

Moreover Julia coupled with Pycall could "corrode" à la Rust for a more smooth migration to pure Julia.

anothathrow975 1655 days ago [-]
Maybe this comment was too sarcastic but the core point stands. Julia’s main advantages for most ML and Data Science users are still theoretical. The package ecosystem is still way to immature. It’s not just MLJ. I was excited about Julia and made a good faith effort to port a work project to Flux and Zygkte and it was a disaster. There’s nothing wrong with passionate open source tinkerers working to improve the ecosystem, but don’t advertise it as ready for production when it’s not.
amkkma 1655 days ago [-]
When was this and what were the issues?
DNF2 1655 days ago [-]
The two-language problem is not concerned with calling out to or wrapping pre-existing libraries in a second language. It means having to write parts of your code in a second language, due to, for example, performance reasons.
systems 1656 days ago [-]
I dont like Python, but the language have dominant libraries in many domains

Julia hides Python, Julia good

I understand that using Python feels wrong, because Python is not know for its performance, but still it seems like a Practical decision, rebuilding everything from scratch will be too much effort

indeedmug 1656 days ago [-]
There are some very cool features with MLJ. You can query models() and find all of the machine learning models that your data works with. This is great for easily plugging into various models to see what works.

The problem I ran into is that it can be very unclear when things don't plugin exactly as planned. You get a typing error that looks like C++ template errors. Or your data doesn't work quite right with a model and needs some transforming but the documentation doesn't spell how to do with your situation. There aren't nearly as many StackOverflow questions on using Julia as there are Python ones so you can't just look up a very specific question and get answers.

Certhas 1656 days ago [-]
Unfortunately this is a quite general issue with Julia at the moment. If things work it's magical, if they don't it's hard to see what subtle assumption of the packages used was violated. Julia and C++ Templates share a lot of properties, and some of the same pain points. It took C++ until C++20 to address this with Concepts. I hope it wont take quite as long for Julia.
doctoboggan 1656 days ago [-]
For those with experience with ML in Julia I’d love to get some advice. I have a little business that sells 3D printed jewelry (https://lulimjewelry.com). My biggest seller is customers engraving their or their loved ones fingerprint on the ring. Most of those prints come in needing manual cleanup, which I can usually do in a few minutes.

I’d love to train a ML algorithm to do this, and I’ve been building up the before and after pictures over time using my manually cleaned up customer fingerprint images. Can anyone give me suggestions or pointers on the sort of algorithm that may be best suited for this task? Just something to get me started down the correct path would be very helpful.

syntaxing 1656 days ago [-]
Depends on what you are cleaning. I have a hunch you wouldn't even need anything ML related. Just the right computer vision pipeline using OpenCV should get you really close. If you really want to use ML/DL because its cool and fun, I'm pretty sure Autoencoders will do what you need [1]. If you're lucky, the denoising autoencoder should do the clean up automatically. Alternative is to use a VAE which goes really far compared to GAN since its much more sampling efficient and easier to train. I think(?) it's what was used in the original deepfake.

[1] https://www.mygreatlearning.com/blog/autoencoder/

celrod 1656 days ago [-]
doctoboggan 1656 days ago [-]
I can't really share any images with you as its my customers PII, but the cleanup stage involves a decent amount of artistic interpretation which is why I didn't think a normal image processing pipeline would work. They often come in smudged, missing lines, incomplete, or with some other error. Its not just a matter of playing with curves, levels, brightness, contrast, etc. I have to trace over them defining what I think the underlying print probably looked like.
notagoodidea 1656 days ago [-]
How far could you go with some automatic cleaning, something like `maptrace` [0] and some topology rules when you have your vector file to finish with some manual artistic polish? Because, i don't believe that you will achieve easily the "do-it-like-me" result expected from either GAN or VAE or autoencoders. In case of you want to look, the classic examples about style transfer (Photo to Van Gogh style, etc.) seems to be the way to go. Specific to Julia, you can look at Flux.jl or KNet.jl.

[0] https://github.com/mzucker/maptrace

syntaxing 1656 days ago [-]
I think it's a common misconception that OpenCV is Photoshop in coding form. There's a ton of powerful that you can use for what you described but hard to say without seeing the pics. A good example is noise removal in the PSD domain (power spectrum density, not PSD photoshop files).

[1] https://docs.opencv.org/master/d2/d0b/tutorial_periodic_nois...

foerbert 1656 days ago [-]
Naively that doesn't seem like a major problem. I'd see it as largely setting up a bunch of constraints and methods to generate candidate solutions - after the initial image processing, obviously.
tomrod 1656 days ago [-]
You're right on the VAE. Depending on output quality, my experience has been fewer samples are needed for VAE training.
teruakohatu 1656 days ago [-]
I agree. For cleaning up a fingerprint I would think you could get good enough using Opencv. People have been cleanly vectoring raster images long before deep learning.

If you can automate it using a combination of Gimp filters + Inkscape, you probably can do it pretty easily convert that process into code.

doctoboggan 1656 days ago [-]
I can and do simply rasterize images for some of my clean and proper fingerprints. The problem I am having is with smudges, incomplete prints, or other artifacts. In those cases I trace over and fix the print sometimes taking a decent amount of artistic license depending on the underlying quality of the image.
bryanrasmussen 1656 days ago [-]
maybe a quick mvp would be to detect smudged images, and separate those out for manual handling, and the non smudged automatically rasterized.
tomrod 1656 days ago [-]
ML practitioner here.

Step 1: write out the steps you do to manually clean up.

Step 2: Depending on: (1) how many samples you have, (2) what is defined as "before" and "after", (3) what performance level is acceptable, there are many algorithms to choose from. Transform learning may be the target, reinforcement learning or CNN as well. It really depends on Step (1) and (2.1).

doctoboggan 1656 days ago [-]
For a given poor quality print I usually take it on my iPad and retrace either the whole print or just a portion. I usually just use the pencil tool in white and run it between the ridges of the print to give good separation and definition to the final engraving. Depending on the quality of the print (some of them are hardly more defined that a low res ink smudge) I may have to use a good amount of artistic liberties on where I think the pattern was going. Using my knowledge of what a print may look like I can usually do a decent job. Thats why I thought this might be a good task for a ML algorithm since I can show it what a bad print looks like and what a decent cleaned up version of that print should look like.
tomrod 1656 days ago [-]
doctoboggan 1656 days ago [-]
Wow, this is basically exactly what I needed, thanks! Seems like the key word I was missing in my searches was "inpainting"
tomrod 1655 days ago [-]
Glad to hear it.
doctoboggan 1655 days ago [-]
I got their model running with the included training data and it’s quite impressive. Some of my more extreme poor examples don’t produce reasonable results but many do.
tomrod 1654 days ago [-]
That's awesome! Happy scaling!
tomrod 1656 days ago [-]
So you're denoising by identifying where the patterns should be, but there is some smudge?

If so, you may be able to skip ML altogether and use a deterministic software program to classify light versus dark areas (what you're doing with your pencil). ML might help with mapping smudges to the right shape/sizing.

That said, if you have a ton of examples, then yeah an image recognition task might help.

hantusk 1656 days ago [-]
I would go for decrappify: https://www.fast.ai/2019/05/03/decrappify/

GAN's are hard to train for a new ML practitioner, VAE can have blurry results if you are not using skip-connections like in U-NET.

It sounds like you have training examples (images pre and post processing), but would benefit from an approach using transfer learning since you don't have that many.

Good luck!

ampdepolymerase 1655 days ago [-]
How do skip links help with blurriness? I thought they are mainly to prevent disappearing gradients?
neolog 1656 days ago [-]
Sounds like you want to convert raster images to vector. https://vectormagic.com/ is a commercial version.
doctoboggan 1656 days ago [-]
Thats not actually my problem. For a good quality print I can do this with no issue. (I use imagetracer.js in my web app) but the problem I have is many prints that come through smudged or otherwise damaged. In those cases I have to retrace them, sometimes taking a decent amount of artistic license using what I know about how other prints looks.
ramraj07 1656 days ago [-]
How many samples do you have of smudged prints that have been corrected? Can you also artificially smudge prints to create a training set? Then an interesting solution might be possible!
doctoboggan 1656 days ago [-]
I currently only have about 100 fixed up image sets. One of the other commenters pointed me to a blog that mentioned this software for generating realistic finger print images: https://dsl.cds.iisc.ac.in/projects/Anguli/ which could then be artificially smudged. There seems to be a dataset out there somewhat that has already done this (https://competitions.codalab.org/competitions/18426) but I can't find out where to download it.
ramraj07 1656 days ago [-]
I would advise against completely synthetic prints, you want to model your training data as close to reality as possible! Can't you just blur other finger prints you have?
Skyy93 1656 days ago [-]
I do not understand why someone should use this? It is basically a wrapper of existing Libs and Frameworks. This framework does not solve any problem that scikitlearn and other existing frameworks have not already solved.
sgt101 1656 days ago [-]
A good thing to read is https://joss.theoj.org/papers/10.21105/joss.02704

The focus of this package is on the "plubming" of auto-ml solutions. The work of designing solutions based on multiple discovered models is rather unsupported in the current state of the art, and this package looks quite supportive of it.

A big contribution (if it works in practice) is the idea of a scientific type describing what data is and how it should be mapped to algorithms - a systematic way of doing this is the underpinning of a process for model selection.

dklend122 1656 days ago [-]
Check out the " Model composability" section. Also multithreading, custom differentiable loss functions, works with any abstract table type, and the list goes on
ampdepolymerase 1656 days ago [-]
Tarrosion 1656 days ago [-]
I want to love MLJ and I _do_ love Julia, but holy wow is it hard to learn. I think three times now I've had a small dataset I wanted to do something simple like linear regression on, thought it'd be a good opportunity to learn MLJ, and ended up giving up when I was knee deep in inscrutable errors about scientific types and machines and unsupported models.

If anyone has a good introduction to recommend which is clearer than the official docs, I'd definitely appreciate it.

gugagore 1655 days ago [-]
Sorry, I don't have a good introduction to recommend, I just wanted to include a link if anyone is intrigued by "scientific types", which refers specifically to: https://github.com/alan-turing-institute/ScientificTypes.jl
indeedmug 1655 days ago [-]
The MLJ library has some example notebooks in the github repo somewhere. I used those notebooks to figure out the API of MLJ because there are unstated things in the documentation. But I agree, it's brutal to learn Julia because the documentation is so lacking for beginners and type errors can be very challenging to understand.
Tarrosion 1655 days ago [-]
FWIW I've found learning Julia (and using it daily) mostly overwhelmingly pleasant, and I perceive MLJ as an unfortunate outlier. Sounds like you had a different experience.
adgjlsfhk1 1654 days ago [-]
If you haven't tried out Julia recently, you might want to. 1.6 makes some pretty big improvements in stacktrace printing.
1656 days ago [-]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 03:09:28 GMT+0000 (Coordinated Universal Time) with Vercel.