Skip to main content

· 5 min read

P.S. This turned out to be a rant...just so you know...

Motivation

I am currently involved in a small web scraping project. My job is to retrieve information about local charity organizations from a website that only offers a web-based search interface. The complexity of the project is manageable but I learned a thing or two in that process of making a runnable Python script to capture the data and transform it into a readable Excel spreadsheet.


1. Web Scraping Can Be Fun

Within the bounds of laws and regulations, programmatic ways of gathering information from the web are what I would recognize as different acts of web scraping. I believe it is not an unfamiliar concept to many, possibly due to the popularity of Python and how easy it is to do some of the simple web scraping tasks in Python. After working on web development projects for a while, I gained more understanding of how websites work behind the scene and in the most recent web scraping project that I worked on, I was able to make use of my insights to explore ways of gathering required data.

There was a working Python script wrote by someone else for the above-mentioned project when I took over. It was making use of basic Selenium selectors to crawl the information field by field. The issue with this approach is that the Chrome web driver involved has to be kept up-to-date with the user's Chrome browser. The script often fails to run after a month or two, and it is annoying to always have to download the driver again. The problem is made worst when the script has to be run by someone else.

Change is the only constant. I got the chance to update the script because the website it was scraping from had a major upgrade and the logic of the selectors no longer worked. With the web development experiences that I have now, I decided to do some preliminary checks and see if I could find an easier way to get the data.

The first thing that came to my mind was to check the network calls that the site is making and see if I can make them directly. Cutting the middleman out of the race is always a strategy worth trying. With the inspector tool opened, I was able to observe the requests and the responses when the website refreshes.

Whipping out an API testing tool, I was able to replicate the network calls directly. In fact, I can now gather the entire list of organization information in JSON representations. For individual details of an organization, I had to find the corresponding query string that the site used to identify it. This was interesting as the query string looks like this: M2E5M2Q1N2YtNzk2NS1lMzExLTgyZGItMDA1MDU2YjMwNDg0.

I was pretty clueless at first but as someone who has now been through the entire journey of front-end & back-end development, I know that people don't write perfect software and there are always clues hidden in the source code. Given that we can inspect the HTML of a website easily, I decided to look for hidden treasures in the HTML file. After some inspection, I found the piece of code that is used to make the query string: btoa(charityID). After googling btoa, I found out that it's a way to encode a string into base64. With that, I was able to simplify the web scraping process by encoding the string programmatically and using the requests package to simply making POST requests to get what I wanted.


2. It's not a bug but a feature?

I thought the above experience is interesting but the following point is what triggered me to write this article. After I made the script, I was informed that the resultant files had a few issues. Looking at the code again, I realized that there was a mistake.

To understand the problem, let me briefly introduce the background. The information organized in JSON format contains primary categories and sub-categories. Thus, one combination could be

  • Primary category: Personal
    • sub-category: Expenditure
  • Primary category: Business
    • sub-category: Expenditure

In the example given here, it is clear that both categories contain a sub-category called "Expenditure". This does not seem like a problem unless the JSON format is something like the following:

[
{ "key": "someOtherValue", "value": 123},
{ "key": "expenditure", "value": 123},
{ "key": "expenditure ", "value": 456},
{ "key": "someMoreValue", "value": 123},
]

It is simply an array of key-value pairs. So, how do the developers that created this schema find out whether an expenditure amount belongs to the "personal" or the "business" category?

Initially, I was unaware that the same identifier was being used twice. What I did realize is that some identifiers have a trailing space. I thought they were careless mistakes and put in some code to strip out trailing spaces while processing the data. Later, I found out that trailing spaces were intentional and that was how they differentiate one value from the other. The best part is that because the trailing spaces are practically visually hidden, the developers simply loop through the values in the array and displayed them normally as a table on the website. When I inspected the HTML, there were indeed trailing spaces for some of the identifiers. I was rather speechless to find out that a trailing space was used as part of an unique identifier. This is worse than having a bad name...

Conclusion

We all tend to take the shortest, most efficient path to make something work. This could mean copy-pasting code and making the slightest change to satisfy a new requirement. If the software is important and used by many, we ought to stop in our tracks sometimes and plan proper refactoring to make it right. Or else...

· 4 min read

Motivation

I try to watch coding-related conference talks once in a while and thought that my recent pick of Design Strategies for JavaScript API by Ariya resonate with me. Here's a summary and discussion on the topic of code quality based on ideas from the talk.


Code Quality

While the talk focuses on API design, it speaks to all programmers as writing functions that are used across classes, modules, and files is a common task. What's worse than inconveniencing others is the fact that some functions are misleading even to the author. When we do write functions, we should strive to achieve the following:

  • Readable
  • Consistent

Readability

Read Out Loud

If you can't pronounce or easily spell out the function name, it deserves a better name.

Avoid Boolean Traps

Often the first toolkit that we get hold of when we start to modify a function to meet the new requirements is "Boolean parameter". We add a true/false value at the end of the existing parameter list. It won't be long before our list of parameters grows out of control and we can't pinpoint which parameter is responsible for what anymore.

One potential fix is to use an option object:

person.turn("left", true) // turn left and take one step forward
person.turn("left", false) // turn left and stay at the same place
// change to
person.turn("left", {"stepForward": true})
person.turn("left", {"stepForward": false})

Another refactoring idea is to abstract out the commonly used function into a separate function, so perhaps:

person.turn("left", true) // turn left and take one step forward
person.turn("left", false) // turn left and stay at the same place
// change to
person.turnAndStepForward("left") // if this combination is often used

Do not jump into abstractions too quickly though.

Use a Positive Tone

This might appear to be a glass-half-full or glass-half-empty subjectivity point of view. However, the talk gave by Ariya suggests that we should avoid double negatives such as x.setDisabled(true) and use x.setEnabled(true) instead. This is to help with understanding statements more intuitively. It is also important to use one over the other consistently.

Explicit Immutability

I think this is one of the main takeaways I gathered from the talk. While I try my best to write immutable functions, some level of mutability is hard to avoid. When we do have functions that can either be mutable or immutable, it might be beneficial to indicate that in the function name. For example:

aString.trim() // modify the existing string
aString.trimmed() // only return a modified string

Consistency

Naming

To be consistent is to be predictable. This relies on making smart observations about the existing norm and agreed-upon conventions. With the knowledge of what we believe all programmers should know, which can be patterns and structures that are familiar, best-practices, or stood the test of time, we can write functions that will turn out to be unsurprising to potential readers.

On a smaller scale, if two functions do similar things, they ought to be named similarly. This is an extension of the idea of polymorphism. For example:

person.turn("left")
car.steer("left")

Perhaps a better way to name the functions will be to use turn for both.

person.turn("left")
car.turn("left")

Parameters

In the same vein, having consistent parameters will help to reduce mistakes. For example:

person.rotate(1, 2) // first horizontally, second vertically
rectangle.rotate(1, 2) // first vertically, second horizontally

Suppose that both objects have a method called rotate but the parameters are two different ordered pairs of the same values. That is a disaster in the making.


Conclusion

With the help of powerful IDEs, we now enjoy the convenience of having documentation of functions available as we write code. This may make recognizing what a function is doing or what each parameter means easier, but it should not be an encouragement to write bad functions. Also, if someone is already making a mess writing code, it may not be wise to trust his/her documentations, if there is any...

· 4 min read

Motivation

I am working on fixing some of the issues raised by flake8 (a Python Linter) in a Python-based backend repository and thought it would be nice to discuss some of the common issues and the solutions that I gathered from the web (well, mostly StackOverflow). The use of an auto formatter such as black will help resolve some of these common issues automatically. flake8rules is an excellent resource of a complete list of issues as well.


line too long (92 > 79 characters)flake8(E501)

Line too long issues mainly happen for the following cases:

  • string
  • if-statement
  • method chaining
  • parameter list

... I was going to explain with examples how to use Python's implied line continuation inside parentheses, brackets and braces but decided not to. Nowadays I chose to leave it to my auto formatter to do the job.

For those who insist to write code without any helpful plugins or IDE support, I would like to share that practice does make perfect. I used Vim for a period of writing Java code without autocomplete or format-on-save. I ran style checks and manually fixed issues raised such as having a space between operators. After a month or two, these things became second nature and I was pretty happy with the ability to write well-formatted code without any help. I suppose that was an interesting experience so go ahead and try it yourself.


do not use bare 'except'flake8(E722)

Example:

def create(item):
try:
item("creating")
except:
pass

Fix:

def create(item):
try:
item("creating")
except Exception:
pass

Explanation:

Bare except will catch exceptions you almost certainly don't want to catch, including KeyboardInterrupt (the user hitting Ctrl+C) and Python-raised errors like SystemExit

A better fix:

  • Think about what exact exception to catch and specify that instead of just catching any exception.
  • Think about whether this exception handling is necessary and are you unintentionally using it for control flow?
  • When catching an exception, use either logging or other proper resolutions to handle the anticipated error.

'from some_package_name_here import *' used; unable to detect undefined names flake8(F403)

I thought this is an interesting topic for discussion. Earlier in my coding journey, I was amused by the many lines of import statements found in some scripts. Sometimes the number of import statements outweigh the number of practical code within the same file.

Nowadays I know better than to fear abstractions (I still fear BAD abstractions and rightfully so). However, with the help of powerful IDEs, importing and tracing the variable/function to their imported package is easier than before. The problem with 'from xxx import *' is that it is unclear what has been imported. Following this, IDEs could not decide whether some of the undefined variables come from the package that you imported or they are errors.

Example

from package_a import *
from package_b import *
# suppose both packages included a function named pretty_print
# it is unclear which method is invoked below
pretty_print("abc")

Fix

from package_a import pretty_print
from package_b import other_function_required
pretty_print("abc")

Conclusion

When browsing an unfamiliar code repository, we tend to have less sentimental feelings and that fresh perspective allows us to see the good, the bad, and the evil. Besides learning from the good practices, the hacky and the code standard violations (and things like commented out code) are excellent places to start reviewing concepts of coding styles and coding standards, to find out why code misbehaved.

That's all for now.


External resources: