Producing a valuable classification model from a heavily imbalanced training dataset can be challenging in machine learning. In this article, I will present a simple but effective method ML engineers can use to produce an effective classification model from an imbalanced dataset. I will also outline some of the most common mistakes ML engineers make while handling heavily imbalanced datasets.
Machine learning engineers often encounter problems requiring classification of an instance into one of two categories based on available features. Common use cases include:
These three examples have, in common, a positive class (spam, fraudulent, or tumor present) that is often heavily underrepresented in the population. For example, much less than one percent of users of an application will ever attempt fraudulent behavior, with the vast majority engaging in normal day-to-day usage.
Until recently, the Weebly iPhone app only offered our customers the ability to upgrade their site with a non-renewable in-app purchase. After a subscription’s duration elapsed, we needed to notify the user to purchase another upgrade before their service expired.
As of October 2016, we switched to renewable subscriptions which resulted in reduced friction and improved revenue terms for renewals beyond a year.
Weebly’s business model is unique in that a single user may upgrade multiple sites concurrently. At the time of this writing, Apple’s subscription groups only allow for one active subscription per user.
Implementing Apple’s renewable subscription model is more challenging for products that diverge outside of the standard subscriber model. This article provides a high-level overview of our strategy for working within those constraints.
3 domain-driven steps you can take to get your codebase into a more manageable state
Let’s face it, writing code to be part of a monolith is easy. We can query the database directly whenever we want, call whatever functions we want in other parts of the application, and not have to think much about organization because we are plugging into an existing architecture. However, the problem that this type of development leads to is a fragile, entangled codebase in which any change to one part of the application can alter or even break something in some other part without anyone knowing why. Not only that, it also creates an awful learning curve for new developers coming into the codebase. This is not desirable at all and many who find themselves in this situation begin to read about and understand the upsides of a service-based architecture (SOA). The problem is — the larger the monolith, the harder it is to break up.
But just because something is hard, does not make it impossible. That being said, it can feel impossible because it is not very feasible to go from a messy monolith directly to an SOA. There has to be some kind of intermediary state we can get to that will make it easier to start breaking things up. This intermediary state is one that is still a monolith, but one that is organized by domain without the entanglement or fragility of our original codebase. Once we reach that point, it is much easier to make decisions about the future of our application, specifically in regards to deciding what to break out into services and what parts should remain together.
In this post, we will introduce domain-driven design (DDD) and and then go over three steps you can take to transform a messy monolith into the organized intermediary state just described.
Recently I found myself needing to write tests for a small class that read from a json file. The class needed to read a json file, validate its existence and content, provide a method to inform the user if a certain key exists, and provide a method to retrieve a value for a given key. The class looked something like:
How test-driven development ensured our tax calculator always adheres to all of the Amazon Laws for online orders
When it comes to collecting tax on online orders, the rules that merchants have to abide by these days are complicated. Due to a set of laws being called the Amazon Laws, whether or not an online order is taxable or not depends on a number of factors, including destination, shipping origin, and tax nexus. And because these laws are imposed at the state level, each state has a different set of rules to follow. It gets even more complex when an order has to ship over state lines.
In the early days of our eCommerce platform at Weebly, we simply had a system to manually enter in the tax rates you want to apply and where to apply them. This worked fine, but it left it up to our users to research for themselves how they should be charging tax, and help with setting up taxes became one of our largest drivers of support requests. So we wanted to do better, and that meant building a system that can automatically calculate taxes on orders.
This created a very interesting software design problem and, in this post, I will be doing a deep dive into the technical approach and implementation that went into creating the automatic tax calculator. In addition to covering the topic of online sales tax, it will serve as a case study for approaching a problem with a test driven development (TDD) mindset.
Did you notice that on Dropbox.com, you can select a folder and start downloading it while it’s being zipped? This on-the-fly zipping feature comes handy for both the user and the server — as a user you don’t have to wait until the files are zipped on the back end before the downloading starts, and it saves the server from creating a temporary zip file and deleting it afterwards.
In this case the browser has no idea when the streamed zip file ends or how big the whole zip file will be, so what the user will see is something like:
In the Network tab, this is what such a request looks like:
This post is an overview of how we use Docker as our development environment in combination with Laravel at Weebly. My goal was to write this in a way that people with little to no Docker experience could easily follow along, and people with Docker experience could get some insight into how we used Docker with Laravel. If you have lots of Docker experience, much of this post might be reiterations of concepts you are already very familiar with.
We’ve been using the popular PHP framework Laravel for a recent project at work. Laravel’s “out of the box” approach to development is using VMWare/Vagrant (Homestead) - which works perfectly fine, but we were curious about using Docker/a containerized approach for a couple of different reasons:
1 - Lots of buzz around ‘containers/containerizing applications’. We were curious about all the fuss so figured we should see what it was all about.
2 - Our automation team has used/deployed apps using Docker previously, so we thought this might be an easy way to quickly spin up staging/integration environments for new services we build at Weebly. The old integration approach is very tailored to our monolith setup and no standard was in place for new services going forward.
3 - The idea of every change to our dev environment (every change to a container) being tracked in git felt like a potentially much cleaner solution than what we were doing with Vagrant. Not that you can’t do something similar with Vagrant, but the Docker container approach lends itself to this naturally.
For more insights into the differences between Vagrant and Docker, see this Quora post.
There is a neat project online called LaraDock. This was a cool way to get the app running using Docker in a matter of minutes, and also a great reference to see what the configuration for all your different containers might look like. Unfortunately it didn’t really help us truly understand what was happening under the hood, so we ended up starting from scratch but heavily borrowing from the skeleton that LaraDock provides. Docker itself has a pretty great tutorial which was also an excellent resource (you can skip around to relevant sections without doing the whole tutorial).