Categories
Python Study Notes

Python Type-ing for Help

I’ve been using Python Pandas to analyze the data I recently scraped from BoardGameGeek. I have used lists and dataframes, but this experience showed me how much I have to learn. The official documentation is a treasure trove of information, but can be difficult to understand. One of the first obstacles is figuring out where to find information relevant to the problem at hand. In this post I will share an example of a problem I encountered and how I used ‘dtypes’ and ‘type()’ to find the right documentation to solve it.

The Setup

import pandas
import sqlite3
database_connection = sqlite3.connect("my_db.sqlite")
connection_cursor = database_connection.cursor()
gen2_tags = pandas.read_sql_query \
 ("SELECT generation,tag,parent,child_count FROM node \
   WHERE generation = '2'", \
   database_connection)
database_connection.close()

This created a dataframe object from my database with these columns: ‘generation’, ‘tag’, ‘parent’ and ‘child_count.’ Since I won’t be loading any more data from the database, I close the database connection. From here on out, I will be working with data stored in the gen2_tags object. This bit of code just makes sure that all of the data in my columns is of the type that I expect (either string or int).

gen2_tags = gen2_tags.convert_dtypes()
gen2_tags = \
 gen2_tags.astype({'generation':'int64'})
gen2_tags.dtypes

The Problem

Using groupby() and count(), I have a partial list of unique tags along with the number of times they occur.

gen2_tags[['tag','generation']].groupby('tag').count()

This is the output:

                    generation
tag
age                     282212
bgaccessoryversion        4664
boardgame                   98
boardgameaccessory       10496
boardgameartist          81863
...                        ...
videogamepublisher       59333
videogameseries          15200
videogametheme           69690
videogameversion        108737
yearpublished           282212

[74 rows x 1 columns]

At the bottom of the output, I can see that I have 74 rows. Unfortunately, I can’t see every unique tag value. The problem, then, is how to display only those unique values.

Solving the Problem

First, I try using ‘unique’ which I came across in StackOverflow using a Google search. I expect this to display just a list of the unique ‘tag’ values.

pandas.unique(gen2_tags['tag'])
<StringArray>
['yearpublished',                  'minplayers',
 'maxplayers',                    'playingtime',
 'minplaytime',                   'maxplaytime',
 'age',                                  'name',
 'description',                     'thumbnail',
 'image',                  'boardgamepublisher',
 'boardgamehonor',          'boardgamemechanic',
 'boardgamepodcastepisode',  'boardgameversion',
 'boardgamefamily',         'boardgamecategory',
 'boardgameartist',         'boardgamedesigner',
 'boardgamesubdomain',                   'poll',
 'statistics',        'boardgameimplementation',
 'videogamebg',            'boardgameaccessory',
 'boardgameexpansion',   'boardgamecompilation',
 'boardgameintegration',            'boardgame',
 'boardgameissuearticle',  'bgaccessoryversion',
 'videogametheme',           'videogameversion',
 'videogamedeveloper',     'videogamepublisher',
 'videogamegenre',              'videogamemode',
 'videogameplatform',      'videogamecharacter',
 'vgcharacterversion',        'rpgissuearticle',
 'rpgartist',                    'rpgpublisher',
 'rpggenre',                      'rpgdesigner',
 'rpgproducer',                   'rpgcategory',
 'rpgitemversion',          'rpgpodcastepisode',
 'error',                     'commerceweblink',
 'rpg',                           'rpgmechanic',
 'rpgseries',                        'rpghonor',
 'rpgsetting',                'rpgissueversion',
 'rpgissue',               'videogameexpansion',
 'videogamehonor',            'videogameseries',
 'videogamecompilation', 'videogamehardwarecategory',
 'videogamehwversion',         'videogamehwmfg',
 'vgwalkthrough',          'videogamefranchise',
 'videogame',               'videogamehardware',
 'vgcharactercat',      'boardgameissueversion',
 'boardgameissue',            'boardgameauthor']
Length: 74, dtype: string

I can see all the unique values, but I don’t want to see ‘StringArray’ or ‘Length’ or even the brackets and quotes. My initial instinct is to follow the Python documentation on lists for splicing:

tag_list = pandas.unique(generation_tags['tag'])
print(tag_list[:])

But, it gives me the exact same output as before. Perhaps I am not dealing with a list in the basic Python sense at all? Looking over the output, the first line refers to a ‘StringArray.’ And this confirms it:

type(tag_list)
pandas.core.arrays.string_.StringArray

Not only that, but the object comes from Pandas. In the Pandas API reference for StringArray, I can confirm that the output is as expected, but it isn’t really helping me do what I want. The documentation refers me to a pandas.array object (in a ‘See Also’ box on the webpage).

The examples here are like the examples for the StringArray. Not too helpful, but I notice that the ‘See Also’ box sends me to something called a Pandas Series. It isn’t obvious at my first read through, but a bit of persistence gives me a new hint.

I realize that Series data is ‘iterable.’ Looking up what ‘iterable’ means in the Python Glossary suggests that I need a for loop. So I try:

for i in pandas.Series(tag_list):
  print(i)

This displays a list of the unique ‘tag’ values without any extra characters or meta information.

yearpublished
minplayers
maxplayers
playingtime
minplaytime
maxplaytime
age
name
description
thumbnail
image
boardgamepublisher
boardgamehonor
boardgamemechanic
boardgamepodcastepisode
boardgameversion
boardgamefamily
boardgamecategory
boardgameartist
boardgamedesigner
boardgamesubdomain
poll
statistics
boardgameimplementation
videogamebg
boardgameaccessory
boardgameexpansion
boardgamecompilation
boardgameintegration
boardgame
boardgameissuearticle
bgaccessoryversion
videogametheme
videogameversion
videogamedeveloper
videogamepublisher
videogamegenre
videogamemode
videogameplatform
videogamecharacter
vgcharacterversion
rpgissuearticle
rpgartist
rpgpublisher
rpggenre
rpgdesigner
rpgproducer
rpgcategory
rpgitemversion
rpgpodcastepisode
error
commerceweblink
rpg
rpgmechanic
rpgseries
rpghonor
rpgsetting
rpgissueversion
rpgissue
videogameexpansion
videogamehonor
videogameseries
videogamecompilation
videogamehardwarecategory
videogamehwversion
videogamehwmfg
vgwalkthrough
videogamefranchise
videogame
videogamehardware
vgcharactercat
boardgameissueversion
boardgameissue
boardgameauthor

Problem solved!

Of course, I could have just copied some code from StackOverflow, but trying to read the APIs helps me understand Python on a whole different level. For example, I won’t soon forget to check the type of my Python objects when they don’t behave as I expect. One of the really nice aspects of Python is how much documentation is available. I should definitely use it!

Leave a Reply

Your email address will not be published. Required fields are marked *