Intern As Data Analyst Resume
SUMMARY:
- Over all 5+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Data Engineer/Data Developer and Data Modeler. experience as Data Engineer and Python Developer with expertise in Spark/Hadoop, Python/Scala and Cloud computing platforms AWS and Azure.
- Worked with Azure Databricks notebooks to validate the inbound/out bound from an external source like Amperity
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Power BI, and Microsoft SSIS, Databricks
- Worked with in Azure Cloud Services (PaaS & IaaS), Azure Databricks, Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure HDInsight, Key Vault, Azure Data Lake for data ingestion, ETL process, data integration, data migration, ADLS Gen2, Azure Devops (VSTS) AI solutions
- Strong development experience in the design and development of web based, client - server applications with strong understanding of Object-Oriented programming using java and j2EE related technologies
- Involved in various phases of Software Development Life Cycle (SDLC) of the application like requirement gathering, design, analysis, and code development
- Data Engineer with experience in implementing various Big Data/ Cloud Engineering, Snowflake, Data Warehouse, Data Modelling, Data Mart, Data Visualization, Reporting, Data Quality, Data virtualization and Data Science Solutions. A good experience on understanding of architecting, designing and operationalization of largescale data and analytics solutions on Snowflake Cloud Data Warehouse.
- Strong experience in writing scripts usingPythonAPI, PySpark API and Spark API for analyzing the data.
- Strong experience of leading multiple Azure Big Data and Data transformation implementations in various domains
- Detailed exposure on Azure tools such as Azure Data Lake, Azure Data Bricks, Azure Data Factory, HDInsight, Azure SQL Server, and Azure DevOps
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Developing and migrating on-premises databases to Azure Data Lake stores using Azure Data Factory.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Strong experience in writing scripts usingPythonAPI, PySpark API and Spark API for analyzing the data.
- Hands On experience on Spark Core, Spark SQL, Spark Streaming and creating the Data Frames handle in SPARK with Scala.
- Developed Hive scripts for end user / analyst requirements to perform ad hoc analysis.EMR with Hive to handle less important bulk ETL jobs.
- Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage pattern. Expertise in OLTP/OLAP System Study, Analysis and E-R modelling, developing Database Schemas like Star schema and Snowflake schema used in relational, dimensional and multidimensional modelling
- Experience in creating separate virtual data warehouses with difference size classes in AWS Snowflake
- Hands-on experience in bulkloading & unloadingdata into Snowflake tables using COPY command
- Hands of experience in Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
- Experience with data transformations utilizing SnowSQL in Snowflake.
- Skilled in System Analysis, E-R/Dimensional Data Modelling, Database Design and implementing RDBMS specific features.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing Data Mining and reporting solutions that scales across massive volume of structured and unstructured Hands-on experience with Big Data tools like Hive, Pig, Impala, Pyspark, SparkSql.
- Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principal Component Analysis
- Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as PowerBi.
- Well Versed with Major Hadoop distributions, Cloudera, and Horton Works. Having experience on Eclipse, NetBeans IDEs
- Having Exposure working in the Agile Methodologies. Designed and developed the data pipeline processes for various modules within the AWS.
- Designed ETL process using Informatica Designer to load the data from various source databases to target data warehouse in Vertica.
- Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, Spark SQL for Data Mining, Data Cleansing.
- Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau.
- Excellentwork experience with Database writing highly complex SQL/PLSQL queries, with major Relational Databases MS Access/Oracle/MySQL/Teradata/MS SQL.
- Good experience working on AWS-BigData/Hadoop Ecosystem in the implementation of DataLake.
- Experience in AWS Cloud services such as EC2, S3, EBS, VPC, ELB, Route53, Cloud Watch, Security Groups, Cloud Trail, IAM, Cloud Front, Snowball, RDS and Glacier.
- Experience working in reading Continuous json data from different source system using Kafka into Databricks Delta and processing the files using Apache Structured streaming, PySpark and creating the files in parquet format
TECHNICAL SKILLS:
Operating Systems: Linux (Ubuntu, CentOS), Windows, Mac OS
Hadoop Ecosystem: Hadoop, MapReduce, Yarn, HDFS, Pig, Oozie, Zookeeper
Big Data Ecosystem: Spark, Spark SQL, Spark Streaming, Spark MLlib, Hive, Impala, Hue, Airflow
Cloud Ecosystem: Azure, AWS, Snowflake cloud data warehouse
Data Ingestion: Sqoop, Flume, NiFi, Kafka
NOSQL Databases: HBase, Cassandra, MongoDB, CouchDB
Programming Languages: Python, C, C++, Scala, Core Java, J2EE
Scripting Languages: UNIX, Python, R Language
Databases: Oracle 10g/11g/12c, PostgreSQL 9.3, MySQL, SQL-Server, Teradata, HANA
IDE: IntelliJ, Eclipse, Visual Studio, IDLE
Tools: SBT, Putty, Win SCP, Maven, Git, Jasper reports, Jenkins, Tableau, Mahout, UC4,Pentaho Data Integration, Toad
Methodologies: SDLC, Agile, Scrum, Iterative Development, Waterfall Model
PROFESSIONAL EXPERIENCE:
Confidential
Intern as Data Analyst
Responsibilities:
- Experienced in Data modelling performing business area analysis and logical and physical data modelling using Erwin and data warehouse/ data mart applications as well as operational applications enhancements and new development. Data warehouse/ data marts design was implemented using Ralph Kimball methodology.
- Highly maintained the stage and production conceptual, logical, and physical data models along with related documentation for a large data warehouse project. This included confirming migration of data models from oracle designer to Erwin and updating the data models to correspond to the existing database structures.
- Excellent SQL programming skills and developed Stored Procedures, Triggers, Functions, Packages using SQL/PL SQL. Performance tuning and query optimization techniques in transactional and data warehouse environments.
- Involved with DBA group to create Best-Fit Physical Data Model from the logical Data model using Forward engineering using Erwin tool.
- Enforced referential integrity in the OLTP data model for consistent relationship between tables and efficient database design.
- Conducted design walk through sessions with the business intelligence team to ensure that reporting requirements are met for the business.
- Developed Data Mapping, Data Governance, and Transformation and cleansing rules for the Master Data Management Architecture involving OLTP, ODS.
- Served as a member of a development team to provide business data requirements analysis services, producing logical and physical data models using Erwin.
- In depth analyses of data report was prepared weekly, biweekly, monthly using MS Excel, SQL UNIX.
Environment: MS Excel, SQL, UNIX, Data Mapping, Data Model, OLTP
Confidential
Data Engineer
Responsibilities:
- Implemented ETL Design for the Data Lake project to capture information across products and modules and place them on S3 bucket.
- Designed and implemented the process for the new structural design for Payroll Integration for cloud reporting environment.
- Addressed ad hoc analytics requests and facilitated data acquisitions to support internal projects, special projects, and investigations.
- Generated Billing Reports per client and number of pay slips generated for the given positions/employees.
- Using Databricks to create Delta tables that can build a traditional Datawarehouse using this with incremental loads.
- AI vision that provided by Databricks Lakehouse is used to process huge amounts of complex financial and alternative data to create data and insights for our clients.
- Created reports for the Business Analysts for review in terms of Pay groups, Number of Clients per State, worked in and Lived in Positional reports.
- Implemented the data flow design in the capturing and sending the information to the private party vendors.
- Created Standard and Custom Reports for the HR and Payroll processing.
- Developed polished visualizations to share results of data analyses.
- Created internal docs for maintained using python tools.
- Developed Tableau dashboards and ETL jobs to get the data refreshed daily.
- Experience with Snowflake Multi-Cluster Warehouses.
- In-depth knowledge of Data Sharing in Snowflake.
- In-depth knowledge of Snowflake Database, Schema and Table structures.
- Used Temporary and Transient tables on diff datasets.
- Developed several ETL using Informatica to load data into snowflake from various sources
- Created Hive tables for loading and analysing data.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
- Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.
- Designed and implemented an ETL framework using Scala and Python to load data from multiple sources into Hive and from Hive to Vertica
- Used HBase on top of HDFS as a non-relational database.
- Load the data into Spark RDD, perform advanced procedures like text analytics and processing using in-memory data Computation’s capabilities of Spark using Scala to generate the Output response.
- Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into the OLTP system through Sqoop.
- Handled large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations, and others during the ingestion process itself.
- Implemented Partitions, Buckets, and developed Hive query to process the data and generate the data cubes for visualizing.
- Optimizing existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD.
- Used Spark Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real-time and persists into Cassandra.
- Extracted Fingerprint image Data stored on local network to Conduct Exploratory Data analysis (EDA), Cleaning and organizing. Ran NFIQ algorithm to ensure data quality by collecting the high score images. Finally Created histograms to compare distributions of different datasets.
- Loaded the data in GPU and achieved Half Precision FP16 on Nvidia Titan RTX and Titan V GPU for TensorFlow 1.14.
- Setup alertingand monitoring using Stack driverin GCP
- Optimized TFRecord data ingestion pipeline using tf. Data API and made them scalable by streaming over network, thus enabling of models with Datasets which were bigger than CPU memory.
- Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
- Loaded data using AWS Glue
- Used Athena for data analytics.
- Worked with the data science team in automating and productional zing various models like logistic regression, k-means using Spark MLlib.
- Created various reports using Tableau based on requirements with the BI team.
- Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyse data from Cassandra tables for quick searching, sorting, and grouping.
- Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive
- Working Experience onAzure Databrickscloud to organizing the data into notebooks and making it easy to visualize data using dashboards.
- Architect and implement ETL and data movement solutions using Azure Data Factory, SSIS
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
- Created azure data factory (ADF pipelines) using Azure blob.
- Performed ETL using Azure Data Bricks. Migrated on-premises Oracle ETL process to Azure Synapse Analytics
- Performed Data Aggregation, Validation and on Azure HDInsight using spark scripts written in Python.
- Worked on azure data bricks, PySpark, HDInsight, Azure ADW and hive used to load and transform data
- Performed monitoring and management of the Hadoop cluster by using Azure HDInsight.
- Developed Spark RDD transformations, actions, and DataFrame's, case classes, Datasets for the required input data and performed the data transformations using Spark-Core.
- Worked on Apache Spark Utilizing the Spark, SQL, and Streaming components to support the intraday and real-time data processing.
- Experience in Snowflake administration and experience with managing snowflake system.
- Create Data pipelines for Kafka cluster and process the data by using sprk streaming and worked on streaming data to consume data from KAFKA topics and load the data to landing area for reporting in near real time.
- Documented logical data integration (ETL) strategies for data flows between disparate source/target systems for structured and unstructured data into common data lake and the enterprise information repositories Experience with various technology platforms, application architecture, design, and delivery including experience architecting large big data enterprise data lake projects
- Enable and configure Hadoop services such as HDFS, YARN, Hive, HBase, Kafka, Sqoop, Zeppelin Notebook and Spark/Spark2 and involved in analysing log data to predict the errors by using Apache Spark.
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion.
- Managing the OpenShift cluster that includes scaling up and down the AWS app nodes.
- Virtualized the servers usingDockerfor the test environments and dev-environments needs, also configuration automation usingDockercontainers.
- Hold good experience in Cloudera platform installation, administration, and tuning.
- Migrate on in-house database to AWS Cloud and designed, built, and deployed a multitude of applications utilizing the AWS stack (Including S3, EC2, RDS, Redshift, Athena) by focusing on high-availability and auto-scaling.
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for analysis and used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS
- Involved in designing the data warehouses and data lakes on regular (Oracle, SQL Server) high performance (Netezza and Teradata) and big data (Hadoop - MongoDB, Hive, Cassandra and HBase) databases.
- Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run with time and data availability and Performed Exploratory Data Analysis and Data Visualizations using Python, and Tableau
Environment: Workforce Now (Payroll Integration on AWS), Snowflake, Python, PLSQL, Redshift, AWS S3, AWS Step functions, Tableau, Shell Scripting. Hadoop, HDFS, Hive, Oozie, Sqoop, Kafka, Elastic Search, Shell Scripting, HBase, Tableau, Oracle, MySQL, Teradata and AWS, Airflow, ETL.