Big Data Engineer/cloud Data Engineer Resume
Dallas, TexaS
SUMMARY
- Professional software developer in the manufacturing, finance, and supply chain industries. Expertise as a lead developer, architect, and software designer in all aspects of the software development lifecycle. Database and business intelligence development expertise.
- Able to juggle client expectations while keeping technical considerations in mind.
- Around 7 years of experience as a Big Data Engineer wif skills in Spark/Hadoop, Python/Scala, SQL, and cloud computing platforms AWS and Azure, as well as Machine Learning and AI.
- Experience in configuring, installing, benchmarking and managing Apache Hadoop in various distributions like Cloudera and Hortonworks.
- Building big data pipelines, and data sets, and incorporating user needs into data architecture
- Experience in Data Warehousing and Database Management. Areas of specialization include Data Architecture, Data Modelling (Both Dimensional and Relational Models), Data Analysis, Database Design, Data Federation, Data Integration (ETL), Metadata/Semantic/Universe Design, Static and OLAP/Cube Reporting and Testing.
- Excellent knowledge in OLTP/OLAP System Study wif focus on Oracle Hyperion Suite of technology, developing Database Schemas like Star and Snowflake schema (Fact tables, Dimension Tables) used in relational, dimensional and multidimensional modeling, physical and logical Data Modelling using Erwin9.6, ER Studio.
- Experienced Data Warehousing fundamentals, as well as the Recommendation Engine, Spark, and Kafka.
- Advanced data standards and approaches such as Snowflake, Spark, and Hadoop have been used to implement data science solutions and production - ready systems on big data platforms.
- Using Spark RDD APIs, Data Frames, Spark-SQL, and Spark-Streaming APIs, me have developed production-ready Spark applications.
- Experience wif MapReduce, Spark-Scala, PySpark, Spark-SQL, and Pig for constructing ETL systems on massive amounts of data.
- For the design of business intelligence systems, adept at using SAS Enterprise Suite, R, Python, and Big Data related technologies such as Hadoop, Hive, Pig, Sqoop, Cassandra, Spark, Oozie, Flume, Map - Reduce, and Cloudera Manager.
- Worked wif APIs and data management.
- Developed Hive UDFs and worked wif HIVE DDLs and Hive Query Language extensively (HQLs).
- Hands-on expertise wif Hadoop ecosystem products such as HDFS, MapReduce, Yarn, Spark, Sqoop, Hive, Pig, Flume, Kafka, Impala, Oozie, Zookeeper, and HBase, as well as creating and executing data engineering pipelines and analyzing data.
- Numpy, Pandas, Matplotlib, SciPy, and Scikit-Learn are among the Python libraries and packages
- Predictive and Descriptive Analytics are two areas in which me excel.
- Metadata modeling for business intelligence and accompanying technologies such as Tableau, Power BI, and Cognos.
- Worked wif big data services in the AWS Cloud Platform and Azure Cloud Platform
- Structured Query Language (SQL) and stored procedures expertise.
- Advanced SAS programming skills, including PROC SQL (JOIN/ UNION), PROC APPEND, PROC DATASETS, and PROC TRANSPOSE.
- Business Intelligence, development, and administration certifications.
TECHNICAL SKILLS
Databases: MySQL, PostgreSQL, MongoDB, Hive, HBase, Oracle, AWS Aurora, AWS DynamoDB
Programming Languages: SQL, PL/SQL, Python, R, Java, Scala, T-SQL, SAS
Big data: HDFS, MapReduce, Apache Spark, Apache Hive, YARN, HBase, Apache Pig, Apache HBase, Spark Streaming, Spark SQL, Spark ML, Oozie, Hue, Sqoop, Flume, Kafka, Zookeeper, Apache Airflow, NiFi
OS: UNIX, LINUX, Windows
Cloud Platforms: AWS, Microsoft Azure
Data Integration & Visualization Tools: SSIS, Pentaho Data Integration Services, Informatica Power BI, Tableau
PROFESSIONAL EXPERIENCE
Confidential, Dallas, Texas
Big Data Engineer/Cloud Data Engineer
Environment: Hortonworks, Hadoop, HDFS, AWS Glue, AWS Athena, Azure, EMR, Pig, Sqoop, Hive, NoSQL, HBase, Shell Scripting, Scala, Spark, Spark SQL, AWS, SQL Server, PowerBi, DataBricks.
Responsibilities:
- Involved in designing and deploying muti-tier applications using all the AWS Services (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, and IAM) focusing on high availability, fault tolerance, and auto-scaling in AWS Cloud Formation
- Supporting continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances.
- Installed and configured various components of Hadoop ecosystem and maintained integrity on Cloudera.
- Used Hortonworks distribution of Hadoop to store and process huge data generated from different enterprises.
- Monitored systems and services through Cloudera manager dashboards
- Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytics using Apache Spark Scala APIs.
- Developed Scala scripts using both Data frames/ SQL/ Data sets and RDD/MapReduce in Spark for Data Aggregation, queries, and writing data blocks into the OLTP system through Sqoop.
- Developed Hive queries to pre-process the data required for running the business process.
- Created HBase tables to load large sets of structured, semi-structured, and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
- Implementations of generalized solution model using AWS SageMaker.
- Extensive expertise using the core Spark APIs and processing data on EMR cluster
- Working on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline dat can be written to Glue Catalog and can be queried from Athena
- Created ETL pipelines, data flows, and complex data transformations wif manipulations in Azure Data Factory. Also used PySpark wif DataBricks.
- Working wif different data storage options including Azure Blob, ADLS Gen-1, Gen-2.
- Azure Data Factory (ADF) pipelines extract the data from Relational sources like Teradata, Oracle, SQL Server, DB2 and non-relational sources like Flat files, JSON files, XML files, Shared folders, etc. Developed and designed a streaming pipeline using Apache Spark and Python.
- Build Azure Pipelines for E2E solutions.
- Experience in bulk importing CSV, XML, and Flat file data using Azure Data Factory.
- Automated jobs using Scheduled, Event-based, Tumbling window triggers in ADF.
- To apply the business transformations and perform data purification processes, created Azure Databricks notebooks.
- Join, filter, pre-aggregate, and process files stored in Azure data lake storage using Databricks Python notebooks.
- Using Azure Data Factory V2, ingested a large amount and variety of data from multiple source systems into Azure DataLake Gen2.
- Migrated the data from the SAS server to AWS using AWS CLI and worked wif AWS DevOps to control and manage the code.
- Extracted, transformed, and loaded data into Azure SQL DB and SQL Data warehouse using reusable pipelines created in Data Factory.
- External tables were created in Azure SQL Database for data visualization and reporting.
Confidential, Florida
Data Engineer
Environment: Pyspark, Python, R, AWS, MySQL, KIBANA, PowerBI, Amazon Sage-maker, AWS Athena, SAS, AWS CLI, SAS
Responsibilities:
- Organize daily sprint meetings to keep users informed and discuss issues dat need to be handled
- Databricks was used to create Scripts in PySpark, Python, SQL, and R, as well as to integrate Databricks wif AWS.
- Recreated Machine Learning models using SAS Python to calculate the correctness of business models and gain a better understanding of client growth.
- me worked one-on-one wif my hierarchal management as a lead developer to provide updates from the development team and the client.
- To reduce the lines of code, me wrote the code in PySpark, combining the logic from multiple SAS scripts and eliminating the extra scripts in the process.
- Shell command was used to migrate data from SAS servers DevOps was used to control and manage the code.
- Assist the development team using PySpark as an ETL platform.
- Developed Pyspark apps for data transformation, extraction, and aggregation from numerous file formats utilizing Spark - SQL in Databricks for analyzing and converting the data to reveal insights into customer usage patterns.
- Scripts for automation should be written. Data is transferred from on-premises clusters to Azure using Python.
- Using industry-leading Data Modeling tools, designed Data Marts using the Star Schema and Snowflake Schema Methodologies.
- Using Python's Pandas and NumPy tools, me performed data cleaning, feature scaling, and feature engineering.
- Designed and developed architecture for data services ecosystem using Relational, NoSQL and Big Data technologies.
- Achieved performance tuning using Spark Context, Spark-SQL, Data Frames, Pair RDD's and Spark YARN
- Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- Worked on migrating PIG scripts and MapReduce programs to Spark Data frames API and Spark SQL to improve performance.
- Created scripts for importing data into HDFS/Hive using Sqoop from DB2.
- Implemented Data Ingestion in real-time processing using Kafka.
- Worked on Sequential files, RC files, Maps ide joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
- Built and analyzed datasets using R, SAS, Matlab, and Python (in decreasing order of usage).
- Developed Pig scripts to parse the raw data, populate staging tables and store the refined data in partitioned DB2 tables for Business analysis.
Confidential, New Jersey
Data Engineer (Spark Developer)
Environment: Pyspark, Spark, KAFKA, MySQL, PowerBI, GCP, KIBANA
Responsibilities:
- Transformed the data using stream sets. Multiple pipelines were built from a variety of different sources (S3, GCS).
- Streaming data from Pub Sub (Kafka) is gathered and saved in Google Cloud Storage utilizing stream S3 sets.
- Spark is used to do batch processing in Data Bricks. Worked wif CSV, JSON, and AVRO as well as other data storage formats.
- Big Query is used to load transformed (certified) data from S3 or Google Cloud for additional analysis. Data from S3 is first put into GCS Data Bricks, then into Big Query. The data is loaded into the tables named Version and Data Proc.
- Big query uses SQL to perform table joins and count records. Big Query table joins are also performed in Looker using look ML code (Depends on requirement).
- Created accounts and managed the access for the looker.
- Sankey, Bubble chart, and other custom graphs were imported into Looker. To represent the information.
- Looker is used to count data from stream sets and error records.
- Every 24 hours, the dashboard is refreshed.
- For storing the data collected from pub/sub and multiple vendors, we used Google Cloud Storage and Amazon S3.
- Using Elastic Search, design and implement an end-to-end search service solution. In Elastic Search, me used various aggregations such as Metrics and Average aggregations.
- Using Logstash, me loaded data from S3 into elastic search.
- Kibana, which analyses enormous SOL amounts of data stored in Elastic Search and provides reports and dashboards, was used to create reporting dashboards.
- Using lovely soup, me worked on Python scripting for web scraping (web 8cala crawler). For internal analytics, data was collected from websites such as Nordstrom and Neiman Marcus.
- Scripts stored in Google cloud storage conversion done using spark Scala. Jira Scala and Spark are used to process schema-oriented and non-schema-oriented data.
- For ticket creation, we used the Jira tool and followed the agile scrum process.
Confidential
ETL/PowerBI Developer
Responsibilities:
- Involved in writing heavy T-SQL Stored Procedures, Complex Joins, and new T-SQL features like Merge, Except commands.
- Developed data standards, data exchange, XML data standard or data-sharing model.
- Developed SSIS packages using Lookup Transformations, Merge Joins, Fuzzy Lookups, and Derived Columns wif Multiple Data Flow tasks.
- Created XML Package Configurations, and Error Handling using Event Handlers for On Error, On Warning, and On-Task Failed event types.
- Working wif re-directing outputs into different formats including XML Files and various other data sources.
- Involved in automating SSIS Packages using SQL Server Agent Jobs, Windows Scheduler, and third-party tools.
- Worked on several transformations in Data Flow including Derived-Column, Script, Slowly Changing Dimension, Lookup, Data Conversion, Conditional -Split and many more.
- Worked on XML Web Services Control Flow to read from XML Source.
- Designed and developed various SSIS packages (ETL) to extract and transform data and was involved in Scheduling SSIS packages.
- Developed ETL solutions for integrating data from multiple sources like Flat File (delimited, fixed-width), Excel, SQL Server, Raw File, and DB2 into the data warehouse.
- Interpreted data, analyzed results using statistical techniques, and provided ongoing SSRS and PowerBI reports.
- Created, maintained and updated ETL packages loading from/to both OLTP and OLAP database tables.
- Developed and implemented data collection systems and other strategies which optimize statistical efficiency and data quality.
- Worked on the Reports module of the projects as a developer on MS SQL Server 2008 (T-SQL, Scripts, stored procedures, and views).
Confidential
Data Analyst
Environment: PostgreSQL, Pentaho Data Integration, Power BI, MY SQL, SSIS, SSRS
Responsibilities:
- Involved in analyzing unstructured data and its formats.
- Involved in designing the Data warehouse and data loading to the target tables.
- Hands on experience on Data warehouse Star Schema Modelling, Snow-Flake Modelling, FACT & Dimension Tables, Physical and Logical Data Modelling.
- Designed and debugged complex code using PostgreSQL, MY SQL.
- Involved in developing custom ETL Mappings to load data from excel files into the data warehouse.
- Monitored and scheduled ETL jobs using Pentaho Data Integration.
- Developed reports and dashboards using Power BI.
- Created Technical Design and Data Lineage documents.
- Proficient in writing Confidential - SQL Statements, Complex Stored Procedures, Dynamic SQL queries, Batches, Scripts, Functions, Triggers, Views, Cursors, and Query Optimization.
- Interacting and coordinating wif development, testing team, and users on a regular basis to understand the system, thus making sure the functionality is intact.
Confidential
Java Developer
Environment: Java, Spring MVC, MS SQL, Hibernate, HTML, JavaScript, CSS, Maven, Git, Postman, Eclipse
Responsibilities:
- Developed web applications using Java, Spring MVC & Microservices.
- Have good experience and exposure to most of the J2EE technologies like Spring3.0, Spring MVC, Restful Web Service, Hibernate, JDBC, JSP, XML, and JavaScript.
- Experience in SOA and developing Web Services using SOAP, REST, XML, WSDL, and JAXP XML Beans.
- Extensive experience in the development and implementation of Model View Controller frameworks using Spring MVC.
- Extensive experience in the development and implementation of ORM framework Hibernate/ Hibernate wif Spring Data Access.
- Good Knowledge on Object-Oriented Programming (OOP), Analysis and Design (OOAD) concepts, and designing.
- Experience in designing and developing J2EE/JEE compliant systems using IDE tools like Eclipse, RAD and deploying these applications in Windows-based local/Development/Integration Domain configurations, application server instances configure on top of WebLogic or WebSphere
- Did a fit-gap on the customer requirements and customized existing data models and Data warehouse.
- Debugged complex errors dat made the business change the existing functionality.
- Extensive production support, monitoring the application, debugging & defect fixing.
- Knowledge transfer sessions to the co-team members.
- Participating and contributing to Agile Ceremonies dat of Daily Scrum, Backlog Refinement, Sprint Planning, Sprint Retrospectives, Sprint Review, and Demo.
- Worked wif the technology team in understanding the requirements and then translating them to the technical design and detailed design documents.