A Brief Summary for Popular Analytics Tools
· Excel
· SQL
· Tableau & Power BI
· R
· Python
· SAS
In recent years, the industry’s barrier to accessing big data dropped significantly. More and more professionals from marketing, product, human resource, health care, etc., rely on the fast-growing data to extract insights. However, so many analytics tools are available, including Excel, SQL, Tableau, Power BI, Python, R, SAS, and so on. What are the differences, then? I will briefly summarize the commonly-used tools in this post and share my thoughts.
Excel
Excel, or Microsoft Excel, is spreadsheet software. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications (VBA). It has been a very widely applied spreadsheet since the 1990s.
Pros
- Convenient for quick data exploration, summary, and visualization.
- A traditional and widely used analytics tool.
Cons
- Limited data size: Excel is suitable for small data set. It cannot wholly open a table with more than 1,048,576 rows. Also, computing is time-consuming when the data size is large.
- Inconvenient collaboration: it is hard to track the revisions made by different users.
Excel is a must-have analytics tool. All the analytics professionals should know it to some extent, even if they mostly use other tools because 1) some data may be in Excel format, and 2) collaborators may use this tool.
Usually, Excel is used by operating data cells and menu/toolbar. For example, we can select the data cells and click the menu Insert > Chart > Histogram to plot a histogram.
SQL
SQL, Structured Query Language, was first developed in the 1970s by IBM researchers to communicate with databases. As most industry data is stored in the database, SQL is critical for analytics professionals to access data.
Pros
- Easy to learn. It is more like a concise natural language to describe “what we want” instead of “how to do” as many other programming languages.
- Large amount of data can be retrieved quickly and efficiently.
- Good for data preparation and descriptive analytics.
Cons
- Inconvenient for advanced analytics such as prediction.
SQL is the most convenient tool for analytics professionals to fetch big-size data. It is also often used for quick data exploratory analysis (EDA), such as checking missing values, duplicate data points, and applying data aggregation.
The syntax of SQL is intuitive. To get a histogram, we can define the bin and count the data points by group using code like
select
-- need to tune the arbitrary 10 to get 5 to 20 bins
floor(my_variable/10)*10 as bin
count(*)
from my_data
group by 1
order by 1
Tableau & Power BI
Power BI and Tableau are two important Business Intelligence (BI) tools for collecting, integrating, analyzing, and presenting business information. Both of them are popular data visualization platforms in the industry. Analytics professionals often build dashboards for KPI (Key Performance Indicator) using them. Tableau was founded in 2003, and Power BI was initially designed in 2010.
Pros
- Easy to use. We can click the menu/toolbar and drag and drop components by mouse to finish most analytics jobs.
- High performance on big data.
Cons
- Lacks data modeling capabilities. It is mostly used for data visualization and descriptive analytics.
Although Tableau and Power PI are similar, Tableau is more efficient in handling massive data sets, while Power BI is more straightforward and friendly for beginners.
For example, to create a histogram in Tableau, we can create Bins, drag and drop variables to the Columns, and drag and drop the number of records in the Rows. See this video for a quick demo.
R
R was implemented in the early 1990s by statisticians. It is a popular choice in data science and analytics for statistical computing and graphics.
Pros
- A lot of libraries are available for data analytics and statistical modeling.
- Strongly supported by the open-source community.
- Exemplary support for data wrangling and data visualization.
Cons
- Although R is very productive in analytics studies, it is inefficient for production.
To plot a histogram in R, we can call its hist function:
hist(my_data$my_variable)
Python
Python is a general-purpose, object-oriented programming language that emphasizes code readability. Released in 1989, Python is easy to learn and a favorite of programmers and developers. Python is one of the most popular programming languages based on the 2021 TIOBE Index. The plenty of useful analytics libraries also makes this universal language a prevalent analytics tool in the industry.
Pros
- A lot of libraries are available for data analytics.
- Strongly supported by the open-source community.
- It is easier to be integrated into the production system.
Cons
- Although Python is an easy-to-learn language, it has a deeper learning curve than R for those with no programming background.
To plot a histogram in Python, we need to import the pandas
package and call the hist function:
import pandas as pd my_data.my_variable.hist()
SAS
SAS is a statistical software suite developed by SAS Institute in 1976. SAS has been an industry staple for a long time now, but it is being phased out and replaced by open-source tools in recent years.
Pros
- Tested algorithms. Every version of SAS is first tested in a controlled environment before being released.
- Data Security.
- Easy to use. SAS syntax is straightforward to learn. And it also has Graphical User Interface (GUI) to make statistical computing easier for non-programming users.
Cons
- SAS license is expensive.
- Not open source. R and Python are always quicker than SAS in implementing a paralytics algorithm.
SAS has many ways to plot a histogram. For example, we can use the PROC
step:
PROC sgplot data=mylib.my_data;
histogram my_variable;
RUN;
The SQL, R, Python, and SAS mentioned above are programming analytics tools. Also, both R and Python can work together with Tableau, Power BI, and Excel. Excel also includes the native VBA programming language. Not surprisingly, analytics programming is an unignorable skill. Although it may be challenging for those who do not have the related experience initially, it benefits reproducibility, flexibility, and collaboration. It makes analytics professionals more productive in the industry. With the libraries’ help, analytics programming is less technical than engineering programming. Many recent libraries, such as Bamboolib, can generate the code automatically to make the analytics programming even more convenient.