Posts

Showing posts from May, 2022

Hadoop Ecosystem for Big Data

Image
Introduction:                 Every day the internet generates billions of bytes of data. Every time you put on a dog filter, watch cat videos or order food from your favourite restaurant, you generate data. Imagine how much data millions of other people are doing the same things every hour and day must be generating. A lot of data, right, terabytes and petabytes of data. This is called big data. Sometimes it becomes imperative to collect and analyse user-generated data to provide better services. But handling this amount of data is easy; we essentially need highly efficient technologies to manage and analyse this amount of data. This article will briefly discuss tools that help store, maintain, and analyse big data.                Any Big data handling process can roughly be divided into four layers, each with its tools. These layers are: Data Ingestion Data Storage Resource Management Data...

4 Ways to Transform Data in Power BI Desktop

Image
  While making transformations in Power BI Desktop there often arrives a question which is the best method/place to do the transformations? Transformation method you choose becomes critical if you are dealing with large data sets (million+ rows) and can impact performance of your model and overall user experience. Below are some of the methods you can adopt. 1. Relational Database: If you are connected to any of relational databases like Microsoft SQL Server, Oracle DB, etc and you have the access to make transformation within the database then it is the best place to do the transformation before loading that data in Power BI environment. 2. Native Query (within Power Query): Another method which is less talked about is using native query. • Using this method you can make the necessary transformations using SQL script. • The native query fetches the transformed data directly from the database(like SQL Server, Oracle) hence avoids/minimizes your transformations in Power Query. 3. Po...

Why does DATA need a massage?

Image
  A bird-eye overview of the basic principles of data transformation: This post is a part of the “Analytics for startup” series so it might be helpful to the ones who just started figuring out how to build their own BI solution. I tried to answer the most common question that I frequently hear at the beginning of the BI implementation stage. Why do we need to do any data transformation? A very fair question having the fact that the data transformation can be the most challenging and the most time and money consuming part of the implementation. This is not any kind of tutorial but a bird-eye overview of the basic principles. Why do we need to transform the data? We don’t need to, actually. If the source is just one simple system or a couple of google sheets, then we don’t. In reality, though, we normally have many sources from the very beginning, e.g. some kind of CRM, Google analytics, google sheet, a database of our product, a website, etc. The main ideas of the transformation are...

Why is statistics important in Data Science, Machine learning, and Analytics

Image
  Benefits of having a strong background in "statistics" as a "Data Scientist"         S tatistics, in its broadest sense, refers to a collection of tools and methods for evaluating, interpreting, displaying, and making decisions based on data. Some individuals refer to statistics as the mathematical analysis of technical data. “A significant constraint on realizing value from Big Data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from Big Data “ — McKinsey.              In this article, I will attempt to explain why I believe it is essential for data science and machine learning enthusiasts to possess a deeper understanding of statistics. Looking deeper statistics is a form of mathematical analysis that employs multiple quantitative models to produce experimental data o...

10 Pandas Functions for Faster Data Science

Image
  1 . read_html() Web scraping is one of the key processes that brings people to Python. Lots of people don’t know that Pandas has a web scraping function. With read HTML, all you have to do is pass in the name of the URL and you can access that data on that web page. Here is the  full documentation .                                  pd.read_html("URL") 2. corr() This function will return a correlation matrix for pairs of numeric columns.                                  df.corr() 3. drop_duplicates() Getting rid of duplicate values is an important step in data analysis. There are lots of convoluted ways to do it, but Pandas has a super simple function you can use. Here is the  full documentation  for the function.                         ...