What does the Data Scientist do with all the Data?

Why do we rarely make models?

In your imagination, what an algorithm engineer might do is to read the paper today, implement the paper tomorrow, and use it the day after tomorrow. Then the company’s income will increase, and our salary and level will also increase. But in fact, most of the engineers in the positions are not like this daily.

A famous foreigner (I forgot the name) once said that algorithm engineers spend 70% of their time on data and spend less than 20% on models and tuning. You may have heard this sentence more or less, but you probably don’t understand it very well. Why is it so? Why not spend more time making models? The reason is also very simple, not because I don’t want to, but I can’t.

Frame restrictions

There are many reasons why the model cannot be moved randomly. Generally speaking, the most common is the limitation of the framework. This situation exists in both large and small companies. For example, when I was in a large company before, the framework of the company was so mature that I rarely wrote code to implement a certain model, and more of it was the connection of the visual interface. Line and setting operation. Here comes the problem.

In this scenario, the optional models in the visual interface are fixed, and they are all developed by the basic team. After they have developed so many models, we can only use so many models, unless we break away from this whole process, but obviously, this is impossible. So for a long period, we could only choose from a limited number of models. It was not until later that the company developed a new framework tool that allowed us to customize the code of the neural network to implement the deep model, and then the shotgun exchange ushered in a comprehensive upgrade.

Although small companies do not have a mature and difficult-to-change framework like large companies, they generally have their own set of processes. For example, the link left by the company’s predecessors was developed based on the open-source xgboost. You want to use TensorFlow to train a neural network model to replace the original xgboost. Generally speaking, this is definitely effective, and it will definitely usher in improvement. But the problem is that you may need to reconstruct the entire link between the training model and the online call model.

Many algorithm engineers have poor development capabilities and are reluctant to do engineering refactoring. In addition, this piece of work is not small, so it is easy to happen that everyone knows how to do it better, but Due to the relatively large investment, everyone is unwilling to do it and has been delayed.

what the data scientist do with all the data?

The effect is difficult to guarantee

The second reason is that in some models and practices on the paper, the effect is difficult to guarantee. If you read the paper, you will find that the conclusions of the paper often have many premises. For example, a certain specific data or scene, a powerful early recall, and filtering system, or perfect feature preparation, etc. The paper will not write all of these, it will only write the practices and results. So this has led to the fact that many methods written in papers may not work well in actual applications.

This is not paper bragging, but you do not have the same conditions. For example, Ali’s data is very accurate, accurate to the fact that every action and behavior of the user from opening the app to closing the app is recorded, how long each product or module is displayed by the user, and even the user turns the page The speed has a comprehensive and complete record. With this kind of data, a small company of average size simply cannot do it. If you can’t do this data, you won’t have the precise features in the paper. So how do you ensure that you use Ali’s model to have the same effect?

Priority issues

We all know that things can be divided into four categories urgent and important, not important or not urgent, urgent and not important, urgent and important, and important, not urgent. Many people also know that the most important thing is to do well those important and not urgent things. Everyone will say it, but in fact, not everyone will choose this way.

When you face the pressure of KPI assessment, front-line engineers may only be able to focus on urgent tasks. Because they need to quickly make a little achievement to complete their performance, the best way to complete their performance is by no means to upgrade or update the model, but to find some features to do it, or use some tricks to see if it can enhance the effect. Taking time to update the model is a lot of labor, and it may not be effective. However, the cost of making features is very small, making one has no effect, you can make another one, and iterating is fast.

Also Read: How to transfer files from phone to laptop without USB?

This is not entirely due to the shortsightedness of engineers, but also the result of the influence of the entire workplace atmosphere. Everyone values performance and performance so that everyone falls into the local optimal solution, but they are getting farther and farther away from the overall optimal solution. To avoid this situation, an architect or leader with a long-term vision and overall planning is needed to resist the risk of upgrading the model. There are sufficient and detailed plans for possible situations and things to be done in the future, and sufficient experience to deal with various possible things. But everyone knows that leaders with this ability are rare in the workplace. It’s rare in large companies, but even rare in small companies.

What do they do with the data?

After talking about the model, let’s talk about the data. Since the model cannot be changed frequently, engineers can only do more data. So what data are the engineers working on and it takes so much time?

Training data

Large companies have a complete process. After we design the process, training data, test data, model training, and deployment can be integrated into a one-stop assembly line. But in small and medium-sized companies, this is often impossible. Raw data cannot be directly used to train the model, which requires a complicated process. First, sampling needs to be done. Take the scenario estimated by CTR as an example. Under normal circumstances, the click-through rate in the real scenario will not exceed 10%. However, the ratio of positive and negative samples for model training is generally about 1:3, so we need to sample the negative samples.

You can’t take samples directly, because there may be a lot of dirty or illegal data in these samples. We need to filter the problematic data before sampling, to ensure that our data is clean. After sampling, we need to search for and complete features and fields. Because data is often stored separately, for example, the user’s basic information is a table, the user’s behavior data is another table, the product information is a table, and various data are stored in various places.

After we have the sample, we still need to find a lot of data to collect all the fields that need to be used. After we have collected all the required data, we can start the production of real samples, which is to use the original data we searched and collected to generate the sample features of the input model. Each feature may have its unique generation logic, which is also a huge project. This step is not over yet, and the data needs to be converted into the format required by the model. For example, TensorFlow data or tensor, JSON, and the like.

With such a series of steps, large companies generally have a complete set of automatic scheduling procedures. Engineers don’t need to worry about it, just use it. But in small and medium-sized companies, there may only be some manual tools, and you need to run some tasks or scripts manually if you need data. In the process of running, there may be failures and various problems. Although it is plain and worthless, these things require a lot of work.

New features

How to do features? In competitions like Kaggle, it may be to write two functions using pandas, or just a few lines of processing logic. But in fact, it is by no means that simple. After normalization, this feature value will be scaled to the range of 0-1. But there are two parameters used here, one is the maximum value and the other is the minimum value. Where do these two parameters come from? You may think this is not easy, we will know after traversing the data. But the problem is that you don’t only use this data once.

You need to generate this feature every time you generate training data. Do you manually traverse the data to find the maximum and minimum values every time you run? And the data is changing. The maximum and minimum user age may be different every day. What if we want to run training data for several days? Designing a new feature is simple, but some of the parameters inside will make things more complicated. We often need to design a complex mechanism to add the newly completed feature to the process.

Effectiveness analysis

There is also a major part of data processing in effect analysis. There are two types of effect analysis. The first is to do some indicators and related analyses that were not available before or to do some business indicator analysis at the request of the boss to achieve our performance. For example, data such as the most basic CTR, CVR, income, etc., as well as certain data that the boss wants to see temporarily. For example, analyze the distribution of certain characteristics, such as looking at the number of samples in a particular ethnic group or the situation of the data, etc., and so on.

The second is the effect analysis after our model is made. If the effect of the model is still good, it is fine. If the effect is not good, the problem will arise. How can we determine what went wrong? Is it because of the insufficient performance of the model itself? Or are our characteristics insufficient or there are problems with the characteristics? Or is our data quality low? Or is there a bug?

Algorithms are not like engineering. Most things in engineering are certain. The wrong result must be due to a bug in logic. So as long as you carefully test and analyze the cause, you can always solve it. The problem that is difficult to reproduce and cannot find the cause is very rare. But the algorithm is different. In most cases, there is no absolute error or correctness, or even an absolute reason. Our role is more like a detective, guessing the cause of the problem based on some clues and then trying to solve it with experiments. In this process, a lot of data processing and analysis is involved.

For example, if you suspect that a problem with the distribution of certain features causes the model to perform poorly, then you need to analyze the distribution of features. If you suspect that there is a bug in the data, then you need to design a plan, filter the data, carefully identify the problems in the data, and verify your ideas. If you feel that the amount of training data is not enough, then you need to increase the amount of training and design comparative experiments. In short, you need a lot of data analysis to troubleshoot problems, not just look at the code, think about it, and you can get a conclusion.