CONTAINER SLIDER

Analytical - Proof of Concepts


post-image
Predicting online bid is made by a machine or a human

This PoC is to identify human or robot behavior for online auction sites. Human bidders on the site are becoming increasingly frustrated with their inability to win auctions vs. their software-controlled counterparts. As a result, usage from the site's core customer base plummets. In order to rebuild customer happiness, the online auction site would like to identify and eliminate computer generated bidding from their auctions. Their attempt at building a model to identify these bids using behavioral data, including bid frequency over short periods of time, has proven insufficient. The goal of this PoC is to identify online auction bids that are placed by "robots", helping the site owners easily flag these users for removal from their site to prevent unfair auction activity.

In objective of this project is to identify a terrorist network whose community property being tightly knit sizes between 3 to 10, and uncovering calling patterns as identified by criminal psychologists. There are about 2 Billion Call Data Records and 200 Million unique contacts in the network. The approach is to intitially eliminate the non-suspects (Big data Processing) and later identifying the suspects (Machine learning).

post-image
Predicting, if context ads will earn a user's click

In Russia, Avito.ru is a largest general classified website selling wide range of products. Avito connects buyers and sellers across Russia. Sellers are highly motivated to place ads on Avito, hoping to gain attention from the site's 70 million unique monthly visitors. There are three different types of ads available to sellers on Avito: regular, highlighted, and context. Context ads are seen as the best way to target users with goods and services. Currently, Avito uses general statistics on ad performance to drive the placement of context ads. Their existing model ignores individual user behavior, making it difficult to predict which ad will be the most relevant for (and earn the most clicks from) each potential buyer. In this PoC, we will improve on their model by predicting if individual users will click a given context ad thus creating a world where both buyers and sellers win.

post-image
Classifying products into the correct category

The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). We are selling millions of products worldwide every day, with several thousand products being added to our product line. A consistent analysis of the performance of our products is crucial. However, due to our diverse global infrastructure, many identical products get classified differently. Therefore, the quality of our product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights we can generate about our product range. For this competition, we have provided a dataset with 93 features for more than 200,000 products. The objective is to build a predictive model which is able to distinguish between our main product categories. The winning models will be open sourced.

post-image
Predicting Sponsored web pages served by StumbleUpon

Online media companies rely more and more on paid advertising to keep their lights on and their content engines humming. "Native advertising" is a popular alternative to the unsightly banner ads and infuriating pop-ups of Internet Advertising. Native ads mimic the core content of the site they're advertising on, ideally avoiding any interruption of the user's experience. When native advertising is done right, users aren't desperately scanning an ad for a hidden "x". In fact, they don't even know they're viewing one. To pull this off, native ads need to be just as interesting, fun, and informative as the unpaid content on a site. This PoC is to identify the paid content disguised as just another internet gem you've stumbled upon. If media companies can better identify poorly designed native ads, they can keep them off your feed and out of your user experience.

post-image
Forecasting city bike share system usage

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world. The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this PoC, we combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

post-image
Modeling quoted price for industrial tube assemblies

Caterpillar sells an enormous variety of larger-than-life construction and mining equipment to companies across the globe. Each machine relies on a complex set of tubes (yes, tubes!) to keep the forklift lifting, the loader loading, and the bulldozer from dozing off. Like snowflakes, it's difficult to find two tubes in Caterpillar's diverse catalogue of machinery that are exactly alike. Tubes can vary across a number of dimensions, including base materials, number of bends, bend radius, bolt patterns, and end types. Currently, Caterpillar relies on a variety of suppliers to manufacture these tube assemblies, each having their own unique pricing model. This PoC is to predict the price a supplier will quote for a given tube assembly given the information about detailed tube, component, and annual volume datasets.

post-image
Predict store sales using historical markdown data

One challenge of modeling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line. The data set has historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and participants must project the sales for each department in each store. Data on selected holiday markdown events are included in the dataset. These markdowns are known to affect sales. In this PoC, we will predict which departments are affected and the extent of the impact.

post-image
Predicting the category of crimes in San Francisco

San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay. In this PoC, you must predict the category of crime (given time and location) that occurred based on the nearly 12 years of crime reports from across all of San Francisco's neighborhoods. We will explore the dataset visually like Top Crimes Map etc,

post-image
Predicting “purchased policy” based on transaction history

As a customer shops an insurance policy, he/she will receive a number of quotes with different coverage options before purchasing a plan. This PoC is meant to predict the purchased coverage options using a limited subset of the total interaction history. If the eventual purchase can be predicted sooner in the shopping window, the quoting process is shortened and the issuer is less likely to lose the customer's business. The information about the customer, about the quoted policy, and the cost is available. Using a customer’s shopping history, can you predict what policy they will end up choosing? In this PoC, you must predict the category of crime (given time and location) that occurred based on the nearly 12 years of crime reports from across all of San Francisco's neighborhoods. We will explore the dataset visually like Top Crimes Map etc,

post-image
Predict an employee's access needs, given his/her job role

When an employee at any company starts work, they first need to obtain the computer access necessary to fulfill their role. This access may allow an employee to access resources through various applications or web portals. It is assumed that employees fulfilling the functions of a given role will access the same or similar resources. It is often the case that employees figure out the access they need as they encounter roadblocks during their daily work (e.g. not able to log into a reporting portal). A knowledgeable supervisor then takes time to manually grant the needed access in order to overcome access obstacles. As employees move throughout a company, this access discovery/recovery cycle wastes a nontrivial amount of time and money. There is a considerable amount of data regarding an employee’s role within an organization and the resources to which they have access. Given the data related to current employees and their provisioned access, models can be built that automatically determine access privileges as employees enter and leave roles within a company. These auto-access models seek to minimize the human involvement required to grant or revoke employee access. The objective of this PoC is to build a model, learned using historical data, that will determine an employee's access needs, such that manual access transactions (grants and revokes) are minimized as the employee's attributes change over time.

post-image
Detecting Insults in Social Commentary

Social Media is replete with comments by various people in conversation streams like news commenting sites, magazine comments, message boards, blogs, text messages, etc... The Objective of this PoC is to detect when a comment from a conversation would be considered insulting to another participant in the conversation. The idea is to create a generalized single-class classifier which could operate in a near real-time mode, scrubbing the filth of the internet away in one pass.

post-image
Forecasting daily solar energy with an ensemble of weather models

Renewable energy sources, such as solar and wind, offer many environmental advantages over fossil fuels for electricity generation, but the energy produced by them fluctuates with changing weather conditions. Electric utility companies need accurate forecasts of energy production in order to have the right balance of renewable and fossil fuels available. Errors in the forecast could lead to large expenses for the utility from excess fuel consumption or emergency purchases of electricity from neighboring utilities. Power forecasts typically are derived from numerical weather prediction models, but statistical and machine learning techniques are increasingly being used in conjunction with the numerical models to produce more accurate forecasts. The goal of this PoC is to discover which statistical and machine learning techniques provide the best short term predictions of solar energy production.

post-image
Constructing an optimal portfolio of loans

This PoC is to determine whether a loan will default, as well as the loss incurred if it does default. Unlike traditional finance-based approaches to this problem, where one distinguishes between good or bad counterparties in a binary way, we seek to anticipate and incorporate both the default and the severity of the losses that result. In doing so, we are building a bridge between traditional banking, where we are looking at reducing the consumption of economic capital, to an asset-management perspective, where we optimize on the risk to the financial investor.

post-image
Predicting if a listener will love a new song?

EMI Insight performs extensive market research about their artists by interviewing thousands of people around the world. This research has produced EMI One Million Interview Dataset; one of the largest music preference datasets in the world today, that connects data about people--who they are, where they live, how they engage with music in their daily lives-- with their opinions about artists. This PoC will focus on one key subset of this data: understanding what it is about people and artists that predicts how much people are going to like a particular track. The goal of this PoC is to design an algorithm that combines users’ (a) demographics, (b) artist and track ratings, (c) answers to questions about their preferences for music, and (d) words that they use to describe artists in order to predict how much they like tracks they have just heard.