Basic Data Analysis with MySQL Shell Python mode

I recently watched a fantastic Python Pandas library tutorial
series on YouTube
. Without a doubt, Pandas is great for all
sorts of data stuff. On the same token, MySQL Shell in Python mode
is quite powerful in the sense that Python and the MySQL Shell
(version >= 8.0) are somewhat united in the same environment.
Although Pandas is in a league all its own when it comes to data
analysis, between the power of MySQL and Python, we can also
perform some basic analysis easily in MySQL Shell Python mode. In
this blog post, I will cover some basic data analysis using Python
mode in the MySQL Shell. Continue reading to see examples…

magnifying-glass-data-analysis-vectorBusiness
vector created by freepik – www.freepik.com

OS, Software, and DB used:

  • OpenSuse Leap 15.1
  • MySQL 8.0.21

Self-Promotion:

If you enjoy the content written here, by all means, share this
blog and your favorite post(s) with others who may benefit from or
like it as well. Since
coffee is my favorite drink, you can even buy me one if you would
like!

Data Set Used

You can download the Stack Overflow Developer
Survey Results
data set used in this post for your own
exploration if you would like.

Basic Data Analysis with MySQL Shell Python Mode: Connecting, table
row and column count, column names

For starters, we use the available global db object and get a
connection to an existing table in the database/schema by passing
in a name to the get_table() method:

1 MySQL  localhost:33060+ ssl  learning  Py > data =
db.get_table('so_data')

I store the ‘so_data’ table in a shell.Object variable named
‘data’. We can call the count() method against this object and
get a count of the total rows in the table:

1
2
 MySQL  localhost:33060+ ssl  learning  Py >
data.count()
88883

Related: Read the post, Dynamic
MySQL CREATE TABLE statement with pandas and pyodbc
, I wrote
and see how I dynamically created the ‘so_data’ table and
populated it with accompanying data.

Calling the select() method on the ‘data’ object, I can
essentially retrieve all rows and columns from the table. However,
at this time, I am only interested in the actual column names of
the table. I’ll store this result in a ‘rows’ object
variable:

1  MySQL  localhost:33060+ ssl  learning  Py > rows =
data.select().execute()

With the ‘rows’ variable, I call the get_column_names()
method on it and am returned a Python list of all the table’s
column names:

1
2
3
4
5
6
7
8
9
10
11
12
13
 MySQL  localhost:33060+ ssl  learning  Py > cols =
rows.get_column_names()
 MySQL  localhost:33060+ ssl  learning  Py > cols
[
    "Respondent", "MainBranch", "Hobbyist", "OpenSourcer",
"OpenSource", "Employment",
    "Country", "Student", "EdLevel", "UndergradMajor", "EduOther",
"OrgSize", "DevType", "YearsCode", "Age1stCode", "YearsCodePro",
"CareerSat", "JobSat", "MgrIdiot", "MgrMoney", "MgrWant",
"JobSeek", "LastHireDate", "LastInt",
    "FizzBuzz", "JobFactors", "ResumeUpdate", "CurrencySymbol",
"CurrencyDesc", "CompTotal", "CompFreq", "ConvertedComp",
    "WorkWeekHrs", "WorkPlan", "WorkChallenge", "WorkRemote",
"WorkLoc", "ImpSyn", "CodeRev", "CodeRevHrs", "UnitTests",
    "PurchaseHow", "PurchaseWhat", "LanguageWorkedWith",
"LanguageDesireNextYear",
"DatabaseWorkedWith","DatabaseDesireNextYear",
    "PlatformWorkedWith", "PlatformDesireNextYear",
"WebFrameWorkedWith", "WebFrameDesireNextYear",
"MiscTechWorkedWith",
    "MiscTechDesireNextYear", "DevEnviron", "OpSys", "Containers",
"BlockchainOrg", "BlockchainIs", "BetterLife", "ITperson",
    "OffOn", "SocialMedia", "Extraversion", "ScreenName",
"SOVisit1st", "SOVisitFreq", "SOVisitTo", "SOFindAnswer",
    "SOTimeSaved", "SOHowMuchTime", "SOAccount", "SOPartFreq",
"SOJobs", "EntTeams", "SOComm", "WelcomeChange", "SONewContent",
"Age", "Gender", "Trans", "Sexuality", "Ethnicity", "Dependents",
"SurveyLength", "SurveyEase"
]

As you can see, there are quite a lot of columns in this table.
Instead of counting them myself, I use the Python len() method and
get a count of the ‘cols’ list object:

1
2
 MySQL  localhost:33060+ ssl  learning  Py > len(cols)
85

Summary: We can use combinations of MySQL Shell
methods: get_table(), count(), select(), get_column_names() and the
Python len() method and determine pertinent table meta-data
information.

Basic Data Analysis with MySQL Shell Python Mode: Select specific
columns, distinct/unique column values, and constraining rows

We learned in the previous section that table ‘so_data’ has
over 80k rows of data, along with 85 total columns. That alone is
nothing to scoff at. Instead of loading up our screens with many
many rows and columns, I’ll utilize several of the available
MySQL Shell methods, choosing certain columns and constraining the
number of returned rows (if any).

The select() method can accept a list of columns names,
separated by commas. In this next query, I specify just the
‘SocialMedia’ column and limit the total number of rows to 10,
using the limit() method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  MySQL  localhost:33060+ ssl  learning  Py >
data.select('SocialMedia').limit(10).execute()
+-------------+
| SocialMedia |
+-------------+
| Twitter     |
| Instagram   |
| Reddit      |
| Reddit      |
| Facebook    |
| YouTube     |
| YouTube     |
| YouTube     |
| Twitter     |
| YouTube     |
+-------------+
10 rows in set (0.0010 sec)

Pro Tip: The limit() method parameter is the
number of rows you want to be returned from theselect() method.

While the above query does provide good information, suppose we
need to know of all the unique values in the ‘SocialMedia’
column. We can easily include the DISTINCT keyword in the call to
select() with the desired column name:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 MySQL  localhost:33060+ ssl  learning  Py >
data.select('DISTINCT (SocialMedia)').execute()
+--------------------------+
| SocialMedia              |
+--------------------------+
| Twitter                  |
| Instagram                |
| Reddit                   |
| Facebook                 |
| YouTube                  |
| NA                       |
| VK ВКонта́кте             |
| WhatsApp                 |
| I don''t use social media|
| WeChat                   |
| LinkedIn                 |
| Snapchat                 |
| Weibo                    |
| Hello                    |
| Youku Tudou              |
+--------------------------+
15 rows in set (0.2199 sec)

Based on the returned query results, we can see there are 15
unique values for the ‘SocialMedia’ column.

Summary: The select() method is capable of
choosing either all table columns or individual columns depending
on your needs and the column names supplied as parameters. If you
need specific columns in your query results, provide those columns
separated by commas as parameters to select(). select() also allows
MySQL keywords to be included with columns should you need any of
them as in the example query using DISTINCT.

On the other hand, constrain the number of output rows returned
from any query using the limit() method by specifying the number of
desired rows using limit()‘s number parameter.

Basic Data Analysis with MySQL Shell Python Mode: Counting, group
by, and other aggregate functions

In the previous section, we executed a query using select() and
DISTINCT, retrieving the unique values in the ‘SocialMedia’
column. We have these results from that query:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 MySQL  localhost:33060+ ssl  learning  Py >
data.select('DISTINCT (SocialMedia)').execute()
+--------------------------+
| SocialMedia              |
+--------------------------+
| Twitter                  |
| Instagram                |
| Reddit                   |
| Facebook                 |
| YouTube                  |
| NA                       |
| VK ВКонта́кте             |
| WhatsApp                 |
| I don''t use social media|
| WeChat                   |
| LinkedIn                 |
| Snapchat                 |
| Weibo                    |
| Hello                    |
| Youku Tudou              |
+--------------------------+
15 rows in set (0.2199 sec)

What is the total number of non-NULL rows for each unique
‘SocialMedia’ column value in the ‘so_data’ table? Can we
find out with MySQL Shell Python mode? Absolutely.

We can use the COUNT() aggregate function on the
‘SocialMedia’ column right in the select() method. However, we
need a GROUP BY clause in the query as well. MySQL Shell has us
covered with a same-named method, group_by().

In this query, I’ll retrieve a count of the actual values in
the ‘SocialMedia’ column (ignoring NULL‘s in that count) and
group those counts by the ‘SocialMedia’ column:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 MySQL  localhost:33060+ ssl  learning  Py >
data.select('SocialMedia', 'COUNT(SocialMedia) AS
num_medias').group_by('SocialMedia').execute()
+--------------------------+------------+
| SocialMedia              | num_medias |
+--------------------------+------------+
| Twitter                  |      11398 |
| Instagram                |       6261 |
| Reddit                   |      14374 |
| Facebook                 |      13178 |
| YouTube                  |      13830 |
| NA                       |       4446 |
| VK ВКонта́кте             |        603 |
| WhatsApp                 |      13347 |
| I don''t use social media|       5554 |
| WeChat                   |        667 |
| LinkedIn                 |       4501 |
| Snapchat                 |        628 |
| Weibo                    |         56 |
| Hello                    |         19 |
| Youku Tudou              |         21 |
+--------------------------+------------+
15 rows in set (0.2566 sec)

Summary: We can easily query using aggregate
functions in the select() method and group on appropriate columns
with the group_by() method.

Basic Data Analysis with MySQL Shell Python Mode: Filter rows with
where and bind query criteria values

The WHERE clause is responsible for filtering out which rows are
returned from a query by way of some predicate test. Without a
WHERE clause, all rows are returned from a SELECT query. Maybe you
want that. Maybe not. If not, use WHERE to filter rows according to
your needs.

The WHERE clause is not limited to only the SELECT statement, as
it is highly important in the DML commands UPDATE and DELETE.
Without a WHERE clause targeting a specific row or rows, all rows
are affected – in the case of DML (UPDATE and DELETE) – or
returned from a SELECT query. MySQL Shell has a where() method we
can use to filter the rows in a select() query just the same as in
regular MySQL (or any SQL dialect).

It is generally a good practice not to include potential user
input values into our query expressions. Most programming languages
have some sort of binding mechanism in place that imposes a sort of
parameterized query and/or a prepared statement. Using
parameterized queries and prepared statements, we can greatly
reduce the risk of SQL Injection attacks.

MySQL Shell has a bind() method we can use instead of directly
concatenating values into the query strings of the where()
predicate test(s). bind() accepts named parameters (which is what I
will use in the examples below) or the universal ‘?’ style of
parameter binding.

Readers may be interested in the ‘DatabaseWorkedWith’ column
of the ‘so_data’ table so let’s filter that column by rows
where the ‘SocialMedia’ column is ‘Hello’ using where() and
bind():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
 MySQL  localhost:33060+ ssl  learning  Py > qry_cols =
'DatabaseWorkedWith'
 MySQL  localhost:33060+ ssl  learning  Py >
data.select(qry_cols).where('SocialMedia =
:soc_media').bind('soc_media', 'Hello').execute()
+-----------------------------------------------------------------------------------------------------------------------------------------+
| DatabaseWorkedWith                                              
                                                                   
   |
+-----------------------------------------------------------------------------------------------------------------------------------------+
| Microsoft SQL Server;MySQL;SQLite                                
                                                                   
  |
| Cassandra                                                        
                                                                   
  |
| MySQL;Redis                                                      
                                                                   
  |
| Microsoft SQL Server;MySQL                                      
                                                                   
   |
| NA                                                              
                                                                   
   |
| NA                                                              
                                                                   
   |
| Cassandra;Elasticsearch;Microsoft SQL Server;Oracle;SQLite      
                                                                   
   |
| PostgreSQL                                                      
                                                                   
   |
| Cassandra                                                        
                                                                   
  |
| NA                                                              
                                                                   
   |
| NA                                                              
                                                                   
   |
| DynamoDB                                                        
                                                                   
   |
| Cassandra                                                        
                                                                   
  |
| MongoDB;MySQL;Oracle;Other(s):                                  
                                                                   
   |
| SQLite                                                          
                                                                   
   |
| NA                                                              
                                                                   
   |
| NA                                                              
                                                                   
   |
| NA                                                              
                                                                   
   |
|
Cassandra;Couchbase;DynamoDB;Elasticsearch;Firebase;MariaDB;MongoDB;Microsoft
SQL Server;MySQL;Oracle;PostgreSQL;Redis;SQLite;Other(s): |
+-----------------------------------------------------------------------------------------------------------------------------------------+
19 rows in set (0.3925 sec)

In this query, I use bind() and include a matching named
placeholder (minus the colon : prefix) and corresponding value for
what is specified in the where() method predicate. For example, in
the where() method I used the ‘:soc_media’ named parameter and
represented it in bind() with ‘soc_media’ and the actual
accompanying value, ‘Hello’. Pretty straightforward.

Oftentimes, you need to filter a SELECT query by more than one
column or expression using multiple predicates. The MySQL Shell
where() method easily accepts multiple predicate conditions just as
a regular MySQL WHERE clause would using the AND and OR logical
operators.

But, with multiple where() predicates, that also means multiple
parameterized values right? Yes, it does.

However, multiple calls to bind() can be chained one after
another, for each of the needed bound values. See the following
query for a better understanding:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
 MySQL  localhost:33060+ ssl  learning  Py >
data.select(qry_cols).where('SocialMedia = :soc_media AND
DatabaseWorkedWith <> :na').bind('soc_media',
'Hello').bind('na', 'NA').execute()
+-----------------------------------------------------------------------------------------------------------------------------------------+
| DatabaseWorkedWith                                              
                                                                   
   |
+-----------------------------------------------------------------------------------------------------------------------------------------+
| Microsoft SQL Server;MySQL;SQLite                                
                                                                   
  |
| Cassandra                                                        
                                                                   
  |
| MySQL;Redis                                                      
                                                                   
  |
| Microsoft SQL Server;MySQL                                      
                                                                   
   |
| Cassandra;Elasticsearch;Microsoft SQL Server;Oracle;SQLite      
                                                                   
   |
| PostgreSQL                                                      
                                                                   
   |
| Cassandra                                                        
                                                                   
  |
| DynamoDB                                                        
                                                                   
   |
| Cassandra                                                        
                                                                   
  |
| MongoDB;MySQL;Oracle;Other(s):                                  
                                                                   
   |
| SQLite                                                          
                                                                   
   |
|
Cassandra;Couchbase;DynamoDB;Elasticsearch;Firebase;MariaDB;MongoDB;Microsoft
SQL Server;MySQL;Oracle;PostgreSQL;Redis;SQLite;Other(s):..