Machine learning and data science are terminologies in a broad sense. They involve super many fields and knowledge. What a data scientist does may be very different from what another does, and so does a machine learning engineer. Usually the past (data) is used to understand or predict (build models) the future.
In order to integrate the above mentioned points into context, I have to explain what my role is. Once I stayed in a small machine learning consulting team. We’ve done everything from data collection to cleaning, building models to deploying services in multiple industries that you can think of. Because the team is small, there are many titles on everyone’s head.
Machine Learning Engineer‘s Daily:
At 9 a.m., I went into the office, said hello to my colleagues, put food in the refrigerator, poured a cup of coffee, and walked to my desk. Then I sat down and looked at my notes the day before, opened Slack, read unread messages and opened links to team-shared papers or blog posts, because the field is growing fast, so I need to look at more cutting-edge things.
I usually spend a little time browsing papers and blog posts after reading the unread news, and carefully studying the content that is difficult to understand. I have to say that some of them are very helpful to the work I am doing. Generally speaking, reading takes me about an hour or more, depending on the article itself. Some friends will ask me why I have been so long.
In my opinion, reading is an ultimate meta-skill. Because once there is a better way to accomplish what I am currently doing, I will save more time and energy by learning to use it immediately. But there are also special cases. If a deadline for a project is approaching, I will shorten the reading time to advance the project.
After finishing reading, I will check the work of the day before, check my notebook, and see where I need to start work, why can I do this? Because my notebook is a running-book diary.
For example, “Data processing is in the right format, and now you need to train the data in the model. If I encounter difficulties in my work, I will write something like, “Data mismatch happened, and then I will try to fix the hybrid match and get the baseline before trying a new model. “
At about 4 p.m., I’ll sort out my code, which involves making the messy code clear, adding comments, and combining. Why do we do this? Because of this question, I often ask myself: what if others do not understand this? If I want to read this code, what do I need most? With this kind of thinking, I find it particularly meaningful to spend some time sorting out the code. At about 5 p.m., my code should be uploaded to GitHub.
It’s an ideal day, but not every day. Sometimes you have a great idea at 4 p.m. and then follow it, and then it’s possible to stay up all night.
Now that you have a general idea of the day-to-day life of machine learning engineers, I will share with you the hearts I have gained from it.
1. Opening eyes and closing eyes are all data.
Many times, machine learning engineers focus on building better models rather than improving the data that builds them. Although the model can provide exciting short-term results by investing enough computing power, this is not always the goal we want.
It takes a lot of time to familiarize yourself with the data when you first come into contact with the project. Because in the long run, familiarizing yourself with these data will save you more time in the future.
That doesn’t mean you shouldn’t start with details. For any new data set, your goal should be to become an “expert” in this area. Such problems as checking distribution, finding different types of features, outliers, why they are outliers, and so on. If you can’t tell the story of the current data, how can you make the model handle the data better?
Examples of exploratory data analysis lifecycle (operations performed every time a new data set is encountered). More details about exploratory data analysis.
2. Communication is more difficult than solving technical problems.
Most of the obstacles I encountered were not technical, but related to communication. Of course, there are also technical challenges, but as engineers, it is our job to solve technical problems.
Never underestimate the importance of internal and external communication. Nothing is worse than technical selection error, because it is to solve the technical challenges of error. What is going to happen?
Externally, this is because the customer’s pursuit does not match what we can provide. Internally, because many people have multiple jobs, it’s hard to ensure that everyone can concentrate on one thing.
How to solve these problems?
For external problems, we can only constantly communicate with customers. Does your client know what services you can provide? Do you understand the needs of your customers? Do they understand what machine learning can provide and what it can not provide? How can you communicate your ideas more effectively?
For internal problems, you can judge how difficult internal communication is by the number of software tools we use to solve them: Asana, Jira, Trello, Slack, Basecamp, Monday, Microsoft Teams. One of the most effective ways I’ve found is to simply update the message on the relevant project channel at the end of the day.
Is it perfect? No, but it seems to work. It gave me a chance to reflect on what I did and tell everyone what support I need next, and even get advice from everyone.
No matter how good an engineer you are, your ability to maintain and acquire new business is related to your communication skills.
3. Stability > the most advanced technology
Now there’s a natural language problem: categorizing text into different categories, with the goal of letting users send a piece of text to a service and automatically categorize it into one of two categories. If the model is not confident in the prediction, please pass the text to the human classifier, with a load of about 1000-3000 requests per day.
BERT has been on fire in the last year. However, if there is no scale like calculation, it is still very complicated to use BERT training model to solve the problems we want to solve, because we need to revise many contents before putting into production. On the contrary, we use another method, ULMFiT, which is not the most advanced, but can still get satisfactory results and is easier to use.
4. Two of the most common pits for beginners in machine learning
There are two pits in applying machine learning to actual production: one is the gap between curriculum work and project work, the other is the gap between model in notebook and production model (model deployment).
I took a machine learning course on the Internet to complete my master’s degree in AI. But even after completing many of the best courses, when I started working as a machine learning engineer, I found that my skills were built on the structured backbone of the curriculum, and projects were not as well organized as the courses.
I lack a lot of specific knowledge that I can’t learn in the course, such as how to question data, what data to explore and what data to use.
How to remedy this defect? I am lucky to be Australia’s best talent, but I am also willing to learn and be willing to make mistakes. Of course, mistakes are not goals, but in order to be right, you must figure out what is wrong.
If you are learning machine learning through a course, continue to learn the course, but you need to learn what you are learning through your own project to make up for the deficiencies in the course.
How to deploy? I’m still not doing very well at this point. Fortunately, I noticed a trend: machine learning engineering and software engineering are merging. With services like Seldon, Kubeflow and Kubernetes, machine learning will soon become another part of the stack. Building a model in jupyter is simple, but how do you get thousands or even millions of people to use it? This is what machine learning engineers should think about, and it is also a prerequisite for machine learning to create value. But according to recent discussions at Cloud Native, people outside big companies don’t know how to do that.
5. 20% of the time
20% of the time, which means that 20% of our time is spent on learning. In an objective sense, learning is a loose term. As long as it is about machine learning, it can be incorporated into the scope of learning. As a machine learning engineer, understanding business can greatly improve your work efficiency.
If your business advantage is that you are doing the best you can now, the business of the future depends on you continuing to do what you are best at, which means you need to keep learning.
6. One tenth of the papers are worth reading, but seldom used.
This is a rough indicator. However, when exploring any data set or model, you will soon find that this rule is universal. In other words, you may get 10 groundbreaking papers out of thousands of submissions each year. Of the 10 pioneering papers, five may come from the same institute or individual.
You can’t keep up with every new breakthrough, but you can apply them on a solid foundation of fundamental principles that have stood the test of time.
Next is the question of exploration and development.
7. Become your own biggest questioner
Exploration and development are the dilemma between trying new things and things that have worked. You can deal with these problems by becoming your biggest skeptic. Keep asking yourself questions about the benefits of choosing these alternatives to the old ones.
Generally speaking, it’s easy to run the models you’ve used and get high-precision numbers, which can then be reported to the team as a new benchmark. But if you get a good result, remember to check your work and ask your team to do the same again. Because you are an engineer, you should have this awareness.
It’s a good decision to spend 20% of your time exploring, but it might be better if it’s 70/20/10. This means that you need to spend 70% of your time on core products, 20% on secondary development of core products, and 10% on moonshots, although these things may not work immediately. I’m ashamed to say that I’ve never practiced this in my role, but it’s the direction I’m heading in.
8. “Toy problem” is very useful
Toy problems can help you understand many problems, especially to solve a complex problem. First, create a simple question, which may be a small part of your data or irrelevant data sets. Find a solution to this problem and extend it to the whole data set. In a small team, the trick is to abstract the problem and figure out how to solve it.
9. rubber duck
If you have a problem, sitting down and staring at the code may solve the problem, maybe not. At this point, if you talk to your colleagues and pretend they are your rubber ducks, the problem may be solved easily.
“Ron, I’m trying to traverse this array and loop through another array to track states, and then I want to combine these states into a list of tuples.”
“The cycle in the cycle? Why don’t you vectorate it? “
“Can I do that?”
“Let’s try it.”
10. The number of models built since 0 is declining
This is related to the integration of machine learning engineering and software engineering.
Unless your data problems are very specific, many of them are very similar, including classification, regression, time series prediction, and recommendations.
AutoML and other services are providing world-class machine learning for everyone who can upload data sets and select target variables. For developers, libraries like fast. AI provide state-of-the-art models in several lines of code, as well as various model animations (a set of pre-built models), such as PyTorch hub and TensorFlow Hubs, which provide the same functionality.
This means that we don’t need to understand the deeper principles of data science and machine learning. We just need to know their basic principles. We should be more concerned about how to apply them to practical problems to create value.
11. Mathematics or code?
For the customer issues I deal with, we all have code first, and all machine learning and data science code is Python. Sometimes I get involved in mathematics by reading papers and reproducing them, but most of the existing frameworks include mathematics. This is not to say that mathematics is unnecessary. After all, machine learning and deep learning are all forms of Applied Mathematics.
Mastering the operation of the minimum matrix, some linear algebras and calculus, especially the chain rule, is enough to be a machine learning practitioner.
Keep in mind that most of the time, or most practitioners’goal is not to invent a new machine learning algorithm, but to show customers whether potential machine learning is helpful to their business.
12. What you did last year may not work next year.
This is a general trend, because the integration of software engineering and machine learning engineering is becoming more and more obvious.
But that’s why you’re in the industry. The framework will change, the various practical databases will change, but basic statistics, probability, mathematics, and all these things will remain the same. The biggest challenge remains how to apply them to create value.
There should be many pits to explore for the growth of machine learning engineers. If you are a novice, it’s enough to master these 12 first!
Author of this article: The tiger says eight things
Read the original text
This article is the original content of Yunqi Community, which can not be reproduced without permission.