How to get data for research?

Estimated reading time: 3 minutes, 40 seconds. Contains 736 words

From my experience in R&D, data is prime factor to consider before finalizing a plan of action when researching. Because without data, how can we empirically and justifiably show valuable information to other people or stakeholders?

There are many times that I have ambitious research topics in mind only to be pulled back to reality that I need to find a way to collect data either existing or non-existing.

Taking on various projects opened my eyes to the endless possibilities of data sources. And these are methods that have really helped me in the past:

Utilizing the search engine. The first thing I do is to type keywords in related to the topic at hand. You'd be surprised with the vast availability of sources out there if you dig hard enough.
Scouring the internet for possible reports that can be downloaded (usually published reports can be found in respective websites of different private and public companies). You shouldn't just settle with what search engines immediately return. It pays to check these sites out because when you view these reports, you might sense patterns that could help you extract information repeatedly.
Personally contact and request data from organizations and institutions. Both public and private companies, may not publish raw data, however, it's worth the effort to approach them in terms of the articles they publish in their sites or publicly. You don't know, they might just write back and help you with the data you need.
Scrape a data from their website. This isn't a common knowledge for some people but people can actually 'save' information from websites. Like a screenshot of their website but already organized through tables. Web scraping is the term for it and there are also existing software services that offer doing this for you!
Experimentation. Collect data through empirical methods. Especially for scenarios where data is really non-existent. Personally, this has always been my last option unless my research requires it. This is because experimentation adds extended time needed before data processing and analysis.

I feel like some of the bullets above should have been common information. But I observe that some people already decide that some methods are unfavorable - especially if it means re-encoding or re-organizing the data again. As a result, they turn away these feasible data sources and return searching for alternative sources in hope that they could find one that is more convenient to work with.

But reality is, technology has far advanced that we really need to appreciate how it can help us in cases like this.

Just like how people should be aware of different sources for data, people should also be aware of the different tools that's available that could help them extract and systematize their data from more difficult providers. Most times, data coming from the internet would be published in PDF documents or image content. And these are the most difficult sources to manage since immediately, people would accept defeat encoding mountain-size image data.

But if we are resourceful, we can be creative with the ways we can extract information from these examples. There are many image-to-text or pdf-to-text converters available online. Check them out and see if they can help you! Some that I normally use are the following:

PDF Tables
Online OCR
Python - given that some sites have paid services, when I handle large amount files, I do code my own text extractors. There are packages out there that are available to help you out:

Tabula-py
Py-Tesseract

With these pointed out, I would want to raise it as well that data owners should lessen inconvenience if they are hiring analysts to extract information from their data and in a time-constrained schedule. It would lessen data processing period if they provide raw files. It regularly happens that a client would provide tabular reports in PDF format only to discover that they have the raw files in document, excel or CSV format. The process could have been more efficient if raw files were submitted early on for data processing. It will significantly reduce the time needed to encode accurate the characters for analysis and more effort can be allotted in the analysis itself.

CodeMatsing

Saturday, June 5, 2021

How to get data for research?

1 comments:

Popular Posts

Labels

Blog Archive

About Me

Report Abuse

Start Work With Me

Contact Me