What are the Python libraries for machine learning?
A library is a set of functions and routines that can be easily reused. Python is an open source programming language that has many libraries. As a beginner in machine learning, it can be complicated to find your way around. That's why we propose in this article our selection of useful libraries to become a machine learning pro.
Its name means Numerical Python. It is an essential library in Python. Its specificity? It allows to do numerical calculation and to manage data tables and matrices. It has most of the usual functions such as exponential, logarithm or arctan. Moreover, it is optimized for calculations and will allow to parallelize the operations, i.e. to use all the processors of the computer to go faster in calculations. For data type formats, such as a CSV for example, numpy will not allow you to label the data. Thanks to this library, you can directly integrate code in C, C++ or Fortran.
An alternative to Numpy is Scipy, which is also a Numpy-based library.
The Pandas library is based on Numpy and allows to easily manipulate structured data:
- Create new columns
- Managing missing data
- Filtering data
- Aggregate information according to columns
- Calculate metrics such as mean, median or sums.
It is based on two types of objects: series, which are very similar to lists in terms of operation, and data frames, which are multi-column tables. We can also include a type of object called Panels which allows to manipulate objects in 3 or 4 dimensions.
Moreover, it facilitates the reading of data coming from different sources: CSV, SQL or even text. In short, it is the essential tool to manipulate data on Python. This library also allows to have a better overview of the data.
When doing machine learning, and more generally data, it is necessary to know how to represent the data. This is a first level of information and it also allows to make a first filter between useful data and those which are not.
For those who have already used Matlab, they will see that the visualization capabilities of Matplotlib are very similar. However, the open source nature of Python, and therefore of Matplotlib, means that many data scientists will turn to Python.
Another data visualization tool is Seaborn, which is adapted to Pandas and often comes as a complement to Matplotlib. Moreover, it is known to be more functional because its default visualizations are more aesthetic. Another advantage: it is possible to build dashboards via Seaborn, which would be very complicated to do via Matplotlib
The name of Bokeh as a visualization library also comes up regularly.
Machine learning and Deep Learning
Now let's get into the libraries that allow you to make machine learning models. For all the most classical models, this is the package you need to know. During your learning phase you will almost certainly come across it!
It has many models with classic parameters but also many variants that allow you to have a multitude of possibilities! Moreover, this library has many metrics that will allow you to evaluate the quality of your models. It allows you to manage the creation of a machine learning algorithm from start to finish:
- Template formatting
- Cutting of datasets
- Setting up datasets
- Training models
- Testing models
- Evaluating models
For your first machine learning models, this is THE library to use. For basic models requiring prediction, classification or clustering, there is no better library. It can also be used to create neural network models, although it is not the most recognized library in this respect. And for more complex models, it will often be suitable. Moreover, scikit-learn is built on Scipy and therefore allows to operate on data based on this library.
Among the existing models, you will find :
- K-nearest neighbors
- Linear regression with many variants
- Logistic regression
- Random forest decision tree
- SVM (support vector machine)
If you want to implement Deep Learning models, a library like Keras is very relevant since it was designed for this purpose. Developed by a Google engineer, François Chollet, this library makes it very easy to create neural networks. Although I advise you to implement your first neural networks yourself to learn and understand them, once this learning phase is over, a library like Keras will allow you to develop them more efficiently rather than redoing everything. This library has been designed to allow developers to develop faster.
This library can run on CPUs (Central Processing Units - commonly known as processors) and GPUs (Graphics Processing Units) that will accelerate calculations.
Moreover, you can reuse models already made by other people to reuse them in your own way on the subjects you want to work on. Another point: it is easy to use this library, so for your first models not guided by tutorials or other exercises, we advise you to start with Keras because this library is easy and intuitive to use.
It is an alternative to Keras, also created by Google, but this time Google.Brain. It allows you to work on neural networks and is very well integrated with the entire Google Cloud ecosystem. This library is also available to perform calculations directly in applications or in the browser of a computer. But this is compensated by its great flexibility.
Like Keras, this library is used by CPUs, GPUs but also, and this is where it has the best performance, on TPUs (Tensor Processing Units) developed by Google for this specific type of task.
Moreover, the community around this tool is very active and it is becoming increasingly easy to launch such applications "in production" and to launch them on a large scale.
Here is a second alternative to Keras: PyTorch. This library also allows you to launch machine learning models on a large scale. It is available in the various large Clouds: AWS (Amazon), Google Cloud, Azure (Microsoft) or Alibaba Cloud.
For textual data and to do what is called NLP (natural language processing), this library is perfectly adapted.
Two last alternatives to create deep learning models are Theano and Microsoft Cognitive Toolkit.
Related use cases
To manipulate networks, also called graphs, Networkx is the right Python library. It allows you to manipulate complex networks, to make calculations on them and to represent them. You want to represent for example the French railway network on a graph ? Then Networkx is the right Python library to do it.
Working on textual data? This is what NLTK allows you to do! This library is specialized on text mining and allows to extract and manipulate texts on Python. If you have a semantic application, NLTK will surely be useful at some point.
OpenCV stands for Open Source Computer Vision. It was created for Computer Vision applications and to speed up their deployment. It takes as input many visual inputs such as images and videos. It allows for example to do what is calledOCR (Optical character recognition) to solve character recognition problems on an image or for example a PDF. The tables recognized by OpenCV can then be directly transformed into numpy objects to easily reprocess them in Python. Moreover, this library integrates well with Matplotlib.
We have presented you the list of useful libraries in Python for machine learning. However, this list could be completed. Libraries useful for data collection such as Requests, Scrapy or Beautifulsoup could for example have had their place in this article but we preferred to focus on the core tools needed for machine learning:
- Read data
- Process data
- Visualize data
- Create mathematical models based on data