CodeMatsing

Estimated reading time: 3 minutes, 13 seconds. Contains 646 words

In terms of development, I have always been a fan of simply using Linux.

I always installed a Linux distribution alongside my Windows OS. Although I had a habit of distro hopping before, I always find myself going back to an ubuntu-based distribution (Ubuntu or PopOS) cause of the resourceful forums and active community. Also, I much prefer gnome-style desktop but I am aware that it is resource hungry.

I dual-boot Linux for the following reasons:

I didn't want to go through the pain of adding extra apps on top of a terminal (I didn't want so much apps to eat my memory)
Most servers are linux-based and I want to familiarize myself with the environment
I'm an advocate of open-source products (a.k.a. I'm really cheap and I always try to find free alternative software first)
I've never upgraded my laptop's HDD to an SSD so I always had to endure slow boot time if I wanted to use Windows (again I'm very frugal 😂)

As a result, I use Linux about 95% of the time when I'm using my laptop and I rarely boot Windows, usually when I really need to open a file or an application that is not compatible with Linux or when I play games.

But eventually, I was gifted with an SSD and I saw how it significantly made a difference in boot time and opening applications. Since then, I considered migrating my workspace and adjusting my workflow. I never really committed to it until I recognized that some of my responsibilities would be more conveniently and quickly accomplished if I find a sweet balance inside Windows OS. Some of my roles require Exploratory Data Analysis. Although, this is still possible in Linux, intuitive software tools are more available and supported in Windows, free or proprietary.

And so my set up is shown as seen in the figure: (pink apps are paid)

Essentially, I wanted to highlight that my drives are partitioned to separate System Files and Data Files. I do this for the following reasons:

If system files partition occurs an error, Data Files will not be affected
Accessibility of Data Files to any OS that I boot (much applicable when I still dual-boot)
Save memory allocation in SSD for app installation (for faster boot times of applications)

Subsequently, I kept all database servers within Windows OS and did not install them in my WSL. I did this for the following reasons:

Avoid multiple servers (both in Windows and WSL)
Consistency of DB server placement (MongoDB isn't officially supported in WSL, yet)
Accessibility of stored data to for access of both WSL scripts and Windows apps
Centralized Configuration (To avoid getting confused if data are stored in Windows or WSL db servers)
Last and most significant reason: I was experiencing much more issues in configuring data path for WSL than in Windows. (chmod, chown, symlinks, systemctl orchestration obstacles)

In case I do need to access my DB servers from my WSL all I need to do is:

Identify host IP and place in .env file for script access:

# in WSL terminal
wsl.exe hostname -I #gives you Windows IP

Ensure that ports are accessible via WSL by editing firewall to each database port to what is mentioned here

For MongoDB, an additional step is needed as shown here
For PostgreSQL, an additional step is needed as show here

Honestly, this setup is more of a workaround. I would like it if the DB Servers are in the WSL container but the setup was really a pain and I don't have time to meticulously set it up at the moment to my liking. I'll update this post once I'm able to fix it.

I've also read in some forums that it's recommended to install DB servers in WSL instead of its host due to access speeds.

If you have any suggestions or comments to this setup, let me know. Would like to know what you think cause I'm pretty sure there is a better configuration out there 😊

Edit:

I would also like to provide the following resources to finally setup a WSL that will also provide GUI interface for Linux. It's not really needed to have a GUI. However, I do need to simulate and debug webcrawlers for data mining tasks and I need to observe how my web crawlers traverse a website.

Here are some sources:

How to install WSL2 (pretty straightforward)
How to enable GUI apps for WSL (unfortunately, I don't find Microsoft's documentation for setup working out of the box for me. It should be explicitly stated that VcXsrv or an alternative is needed in your device)

If there are issues, always check your firewall rules if there is need for some tweaking.

Estimated reading time: 3 minutes, 40 seconds. Contains 736 words

From my experience in R&D, data is prime factor to consider before finalizing a plan of action when researching. Because without data, how can we empirically and justifiably show valuable information to other people or stakeholders?

There are many times that I have ambitious research topics in mind only to be pulled back to reality that I need to find a way to collect data either existing or non-existing.

Taking on various projects opened my eyes to the endless possibilities of data sources. And these are methods that have really helped me in the past:

Utilizing the search engine. The first thing I do is to type keywords in related to the topic at hand. You'd be surprised with the vast availability of sources out there if you dig hard enough.
Scouring the internet for possible reports that can be downloaded (usually published reports can be found in respective websites of different private and public companies). You shouldn't just settle with what search engines immediately return. It pays to check these sites out because when you view these reports, you might sense patterns that could help you extract information repeatedly.
Personally contact and request data from organizations and institutions. Both public and private companies, may not publish raw data, however, it's worth the effort to approach them in terms of the articles they publish in their sites or publicly. You don't know, they might just write back and help you with the data you need.
Scrape a data from their website. This isn't a common knowledge for some people but people can actually 'save' information from websites. Like a screenshot of their website but already organized through tables. Web scraping is the term for it and there are also existing software services that offer doing this for you!
Experimentation. Collect data through empirical methods. Especially for scenarios where data is really non-existent. Personally, this has always been my last option unless my research requires it. This is because experimentation adds extended time needed before data processing and analysis.

I feel like some of the bullets above should have been common information. But I observe that some people already decide that some methods are unfavorable - especially if it means re-encoding or re-organizing the data again. As a result, they turn away these feasible data sources and return searching for alternative sources in hope that they could find one that is more convenient to work with.

But reality is, technology has far advanced that we really need to appreciate how it can help us in cases like this.

Just like how people should be aware of different sources for data, people should also be aware of the different tools that's available that could help them extract and systematize their data from more difficult providers. Most times, data coming from the internet would be published in PDF documents or image content. And these are the most difficult sources to manage since immediately, people would accept defeat encoding mountain-size image data.

But if we are resourceful, we can be creative with the ways we can extract information from these examples. There are many image-to-text or pdf-to-text converters available online. Check them out and see if they can help you! Some that I normally use are the following:

PDF Tables
Online OCR
Python - given that some sites have paid services, when I handle large amount files, I do code my own text extractors. There are packages out there that are available to help you out:

Tabula-py
Py-Tesseract

With these pointed out, I would want to raise it as well that data owners should lessen inconvenience if they are hiring analysts to extract information from their data and in a time-constrained schedule. It would lessen data processing period if they provide raw files. It regularly happens that a client would provide tabular reports in PDF format only to discover that they have the raw files in document, excel or CSV format. The process could have been more efficient if raw files were submitted early on for data processing. It will significantly reduce the time needed to encode accurate the characters for analysis and more effort can be allotted in the analysis itself.

CodeMatsing

SAM SOLIS

I'm Sam Solis

Education

BS ECE

MS IE (current)

Experience

IS Analyst

Lead Developer

Systems Developer

Software Engineer

Proficiencies

Programming

Web Frameworks

Database Management

Web Scraping

Statistical Analysis and Machine Learning

Awards Won

Projects Done

Workshops Done

WHAT CAN I DO

Web Development

Database Design

Automation

Business Analytics

Data Cleansing

Data Mining

Portfolio

Matsing Learning

Windows as a Development Environment

How to get data for research?

Popular Posts

Labels

Blog Archive

About Me

Report Abuse

Start Work With Me

Contact Me