We provide IT Staff Augmentation Services!

Data Engineer / Analyst Resume

0/5 (Submit Your Rating)

Richmond, VA

SUMMARY

  • 7+ years of technical experience in Data Engineer / Analyst business needs of clients, developing effective and efficient solutions and ensuring client deliverables within committed timelines.
  • Deep knowledge and strong deployment experience in Hadoop and Big Data ecosystems - HDFS, MapReduce, Spark, Sqoop, Hive, Kafka, zookeeper, and HBase.
  • Hands-on experience on Google Cloud Platform (GCP) in all the big data products Big Query, Cloud DataProc, Google Cloud Storage, and Composer (Air Flow as a service).
  • Extensively worked on Spark using Scala on the cluster for computational analytics, installed it on top of Hadoop, and performed advanced analytical applications by using Spark with Hive and SQL/Oracle.
  • Experienced Data Modeler with strong conceptual, Logical, and Physical Data Modelling skills, Data Profiling skills, Maintaining Data Quality, experience with JAD sessions for requirements gathering, creating data mapping documents, and writing functional specifications, and queries. And Dimensional Data Modelling, FACT & Dimension tables.
  • Expertise in AWS Resources like EC2, S3, EBS, VPC, ELB, SNS, RDS, IAM, Route 53, Auto scaling, Cloud Formation, Cloud Watch, and Security Groups.
  • Experience in using Snowflake Clone and Time Travel.
  • Experience in using stack driver service/ dataproc clusters in GCP for accessing logs for debugging.
  • Skilful in Data Analysis using SQL on Oracle, MS SQL Server, DB2, Teradata, and AWS.
  • Experienced in troubleshooting ETL jobs, Datawarehouse, and Data mart data store models.
  • Assisted in creating communication materials based on data for key internal/external audiences.
  • Written PySpark job in AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue Data Catalog with metadata table definitions.
  • Strong experience in migrating other databases to Snowflake.
  • Expert in documenting the Business Requirements Document (BRD), generating the UAT Test Plans, maintaining the Traceability Matrix, and assisting in Post Implementation activities.
  • Experience in GCP Dataproc, GCS, Cloud functions, Big Query, Azure Data Factory Data Bricks.
  • Experience in building efficient pipelines for moving data between GCP and Azure using Azure Data Factory.
  • Explore data in a variety of ways and across multiple visualizations using Power BI.
  • Enterprise Data Modeler with a deep understanding of developing Enterprise Data Models that strictly meet Normalization Rules, as well as Enterprise Data Warehouses using Kimball and Data Warehouse Methodologies.
  • Knowledgeable in Best Practices and Design Patterns, Cube design, BI Strategy, and Design and 3NFModeling.
  • Delivered zero defect code for three large projects which involved changes to both the front end (web services) and back end (Oracle, Teradata).
  • Experience in building and architecting multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation in GCP and coordinating tasks among the team.
  • Constructing and manipulating large datasets of structured, semi-structured, and unstructured data and supporting systems application architecture using tools like SAS, SQL, Python, R, Minitab, Power BI, and more to extract multi-factor interactions and drive change.
  • Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL jobs and run aggregation on PySpark code
  • Experience in publishing Power BI Desktop reports created in Report view to the Power BI service.
  • Using the S3 CLI tools, create scripts for creating new snapshots and deleting existing snapshots in S3.
  • Expertise with Hive data warehouse architecture, including table creation, data distribution using Partitioning and Bucketing, and query development and tuning in Hive Query Language.
  • Experienced in leading the enhancement, architecture, and ongoing evolution using a wide array of technologies (Spark, Python, Apigee, Delta Lake, Databricks, Kafka, Data bucks, as well as more traditional technologies such as MuleSoft and SQL) across Amazon cloud environment.
  • Knowledge of job workflow management and monitoring tools like Oozie, and zookeeper.
  • Proficient in writing Bash, Pearl, and Python scripts to automate and provide Control Flow.
  • I have handled performance tuning, conducted backups, and ensured the integrity and security of databases managed by Postgres DB in the AWS environment and Aurora-Postgres.
  • Familiar with Data Stage production job scheduling.

TECHNICAL SKILLS

PROGRAMMING: PYTHON, R, SAS, PROGRAMMING, SQL, Scala, Shell Scripting

DATABASE DESIGN TOOLS: MS Visio, Fact and Dimensions tables, Normalization and Denormalization techniques, Kimball Inman Methodologies

DATA MODELLING TOOLS: Erwin Data Modeler and Manager, ER Studio v17, physical and logical data modeling

ETL/DATA WAREHOUSE TOOLS: Informatica Power Centre, Talend, Tableau, Pentaho, SSIS, DataStage

QUERYING LANGUAGES: SQL, NO SQL, PostgreSQL, MySQL, Microsoft SQL, Spark-SQL, Sqoop 1.4.4

DATABASES: AWS RDS, Teradata, Hadoop FS, SQL Server, Oracle, Netezza, Microsoft SQL, DB2.

NOSQL DATABASES: MongoDB, Hadoop HBase, Apache Cassandra

CLOUD TECHNOLOGIES: AWS, Azure

HADOOP ECOSYSTEM: Hadoop, MapReduce, Yarn, HDFS, Kafka, Storm, Pig, Oozie

BIGDATA ECOSYSTEM: Spark, Spark SQL, Spark Streaming, Spark, Hive

INTEGRATION TOOLS: Git, Gerrit, Jenkins, Maven

STREAMING: Flume 1.6, Spark Streaming, Streaming Analytics

METHODOLOGIES: Agile, Scrum, Waterfall UML

FAMILIAR: Microsoft Office, GitHub, Bitbucket Slack

PROFESSIONAL EXPERIENCE

Confidential, Richmond, VA

Data Engineer / Analyst

Responsibilities:

  • Extensive hands-on experience with Big Data Engineer Stack including HDFS, MapReduce, Sqoop, Hive, Pig, HBase, Oozie, Flume, Kafka, Zookeeper, and spark
  • Experience with NoSQL Databases like HBase as well as other ecosystems like Zookeeper, Oozie, Impala, Strom, Spark-Streaming/SQL, Kafka, Flume
  • Unit tested the data between Redshift and Snowflake.
  • Develop and deploy the outcome using spark and Scala code in the Hadoop cluster running on GCP.
  • Developed Hive, and Bash scripts for source data validation and transformation.
  • Experience in Converting Hive/SQL Queries into Spark transformations using Java and experience in ETL development using Kafka, Flume, and Sqoop.
  • Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators both old and newer operators.
  • Designed and implemented an ETL framework to load data from multiple sources into Hive and from Hive into Teradata.
  • Developed Spark Code and Spark-SQL/Streaming for faster testing and processing of data.
  • Good experience in Hive Data Warehousing concepts like Static/ Dynamic Partitioning, Bucketing, Managed, and External tables, Join operations on tables.
  • Migrated previously written cron jobs to airflow/composer in GCP.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Worked on loading CSV/TXT/AVRO/PARQUET files using Scala/ Java language in Spark Framework and Process the data by creating Spark Data Frame and RDD and saving the file in Parquet format in HDFS.
  • Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled Structured data using Spark SQL.
  • Was involved in setting up apache airflow service in GCP.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
  • Utilized Power BI to create various analytical dashboards that help business users to get a quick insight into the data.
  • Published and maintained workspaces in Power BI Service, allotted the time refresh for the data, and maintained the apps and workbooks.
  • Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators.
  • Developed the Pysprk code for AWS Glue jobs and for EMR.
  • Load D-Stream data into Spark RDD and do in-memory data Computation to generate Output response.
  • Well Versed with Major Hadoop distributions, Cloudera and Horton Works. Having experience on Eclipse, and NetBeans IDEs.
  • Used Python and R scripting by implementing machine algorithms to predict data and forecast data for better results
  • Experience in developing packages in R studio with a shiny interface
  • Developed spark jobs to process all the information and specify the passion points and email promotions for each user. Used Sqoop, FTP, APIs, SQS, and On S3 copy to pull data to HDFS.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Developed custom python program including CI/CD rules for GCP data catalog for metadata management.
  • Wrote Data Pipeline that fetches Adobe Omniture data which is routed to S3 using SQS every hour.
  • Generated and injected Vehicle IOT data to AWS IOT platform using python
  • Implemented AWS lambda architectural model for handling end-to-end real-time and batch analytic loads.
  • Published IOT Data to Kafka Stream, consumed by a spark module to perform predictive analysis using the Random Forest Machine learning algorithm.
  • Utilized Redshift to store the processed records and implemented batch scripts for continuous learning.
  • Built an Elastic search cluster in integration with Kibana for publishing real-time dashboards for maintenance data.
  • Handled real-time data using Kafka.
  • Transferred all the data from history jobs to Azure with HDInsight installed
  • Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, and IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation

Confidential, Chicago, IL

Data Engineer

Responsibilities:

  • Encoded and decoded JSON objects using PySpark framework to create and modify the data frames in Apache Spark.
  • Experience working with project managers to understand and design the workflows architectures as per requirements and data scientists to assist with feature engineering.
  • Created Source Target Mappings (STM) for the required tables by understanding the business requirements for the reports.
  • Hands-on experience with Big Data application phases like data ingestion, data analytics, and data visualization.
  • Developed dashboards for visualization using Cloud Dashboard Tools Looker and AWS Quick sight for continuous monitoring of real-time data.
  • POC to explore AWS Glue capabilities on Data cataloging and Data integration.
  • Experience in writing python code to schedule jobs dynamically in Apache Airflow.
  • Worked extensively on Airflow for orchestration and scheduling of the ingestion scripts.
  • Experience in developing the APIs for Data Cata log Maintenance using Alation.
  • Created coherent Logical Data Models that helped guide important client business decisions.
  • Experience improving overall Data Quality with the use of tools like Infogix.
  • Extensively worked on various data analysis and transformation BI tools like Dremio and Looker.
  • Experience working with different file formats like Parquet and fixed file formats.
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
  • Worked on various POC’s to adopt new technologies like Apache Airflow, Snowflake and Terraform for infrastructure management.
  • Responsible for managing BAU workflows and L2 support team on a 24/7 schedule.
  • Experience in performing ETL operations and debugging and fixing issues like memory exceeds on EMR clusters.
  • Experience in reducing the latency of spark jobs for faster data processing by tweaking spark configurations and other optimization techniques.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services(Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB)
  • Experience working with tools like Airflow for scheduling the jobs and adhoc manual jobs.
  • Experience with Continuous Integration process and Automated deployments using Git, Jenkins, Docker by developing scripts using python, bash scripting.
  • Used AWS Redshift to query and combine exabytes of structured and semi-structured data across data warehouse.
  • Developed python scripts for data ingestion process in Apache Spark using PySpark.
  • Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored(ADLS) using Azure Data Factory (ADF V1/V2).
  • Experience in containerization technology like Docker for the runtime environment of the system to build, test & deploy.
  • Practiced in clarifying business requirements, performing gap analysis between goals and existing procedures/skill.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark for faster processing of data.
  • Experience in Automating ETL processes, making it easier to wrangle data and reducing time.
  • Experience working in Agile Methodology by actively participating in grooming and planning sessions on sprint basis.
  • Build scalable databases capable of ETL processes using SQL and PySpark.
  • Experience in building scalable distributed data solutions using an EMR cluster environment with Amazon EMR.
  • Experience implementing industry level coding standards using frameworks like PyLint, pep8.
  • Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.

Confidential, Richmond, VA

Data Engineer / Analyst

Responsibilities:

  • Worked with Data Vault Methodology Developed Normalized Logical and Physical database models.
  • Developed Data Mapping, Data Governance, Transformation, and cleansing rules for the Master Data Management Architecture involving OLTP, ODS, and OLAP.
  • Worked on Performance Tuning of the database which includes indexes, optimizing SQL Statements.
  • Created tables, views, sequences, triggers, tablespaces, constraints, and generated DDL scripts for physical implementation.
  • Developed mapping spreadsheets for (ETL) team with source to target data mapping with physical naming standards, datatypes, volumetric, domain definitions, and corporate meta-data definitions.
  • Creating Reports in Looker based on Snowflake Connections
  • Established and maintained comprehensive data model documentation including detailed descriptions of business entities, attributes, and data relationships.
  • Implemented Data Vault Modelling Concept solved the problem of dealing with change in the environment by separating the business keys and the associations between those business keys, from the descriptive attributes of those keys using HUB, LINKS tables and Satellites.
  • Used AWS Glue for the data transformation, validate and data cleansing.
  • Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
  • Maintain and work with our data pipeline that transfers and processes several terabytes of data using Spark, Scala, Python, Apache Kafka, Pig/Hive & Impala
  • Apply data analysis, data mining and data engineering to present data clearly. detailed production level using Workflow Diagrams, Sequence Diagrams, Activity Diagrams and Use Case Modelling.
  • Have been working with AWS cloud services (VPC, EC2, S3, RDS, Redshift, Data Pipeline, EMR, DynamoDB, Workspaces, Lambda, Kinesis, RDS, SNS, SQS).
  • Worked on Power BI reports using multiple types of visualizations including line charts, doughnut charts, tables, matrix, KPI, scatter plots, box plots, etc.
  • Created and administrated workspaces for each project on Power BI service and published the reports from Power BI Desktop to Power BI Services workspace
  • Utilized Power BI to create various analytical dashboards that helps business users to get quick insight of the data
  • Involved in creating Physical and Logical models using Erwin.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services(Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW,HDInsight/Databricks, NoSQL DB).
  • Involved with Data Analysis Primarily Identifying Data Sets, Source Data, Source Meta Data, Data Definitions and Data Formats.
  • Expert in Data Analysis, Design, Development, Implementation, and Testingusing Data Conversions, Extraction, Transformation, and Loading(ETL) and ORACLE, SQL Server,and other relational and non-relational databases.
  • Document all data mapping and transformation processes in the Functional Design documents based on the business requirements.
  • Generated ad-hoc SQL queries using joins, database connections and transformation rules to fetch data from legacy DB2 andSQL Serverdatabase systems.
  • Highly proficient inData Modellingretaining concepts ofRDBMS, Logical andPhysical Data Modellinguntil3NormalForm (3NF)andMultidimensional Data Modelling Schema(Star schema, Snow-Flake Modelling,Facts, and dimensions).
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop on AWS EMR.
  • Generated ad-hoc SQL queries using joins, database connections and transformation rules to profile datafrom DB2 and SQL Server database systems.
  • Worked with data compliance teams, Data governance team to maintain data models, Metadata, Data Dictionaries; define source fields and its definitions.

Confidential

Data Engineer

Responsibilities:

  • Involved in Data mapping specifications to create and execute detailed system test plans. Thedata mapping specifies whatdatawill be extracted from an internaldatawarehouse, transformed, and sent to an external entity.
  • Documentedlogical,physical, relational, and dimensionaldatamodels. Designed theDataMartsin dimensionaldatamodelling using star and snowflake schemas.
  • Involved in building scalable distributed data lake system for Confidential real-time and batch analytical needs.
  • Involved in designing, reviewing, optimizing data transformation processes using Apache Storm.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFs using both Data frames/SQL and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Scoop.
  • Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD’s.
  • Imported data from Kafka Consumer into HBase using Spark streaming.
  • Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce, Hive and Sqoop as well as system specific jobs.
  • Experienced in handling large datasets using partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective efficient Joins, Transformation and other during ingestion process itself.
  • Prepared documentation for all entities, attributes,datarelationships, primary and foreign key structures, allowed values, codes, business rules, and glossary evolve and change during the project.
  • Coordinated withDBAondatabase build and tablenormalizationsandde-normalizations.
  • Identified the entities and relationship between the entities to developConceptual Model busing ERWIN.
  • Developed Logical Model from the conceptual model.
  • Responsible for differentDatamappingactivities from Source systems.
  • Involved withData Profilingactivities for new sources before creating new subject areas in warehouse.
  • Extensively workedData Governance, i.e.,Metadata management, Master data Management, Data Quality, Data Security.
  • Performed complexdata analysisin support of ad-hoc and standing customer requests.
  • Enforced referential integrity in theOLTP data modelfor consistent relationship between tables and efficient database design.
  • Experience in creatingUNIXscripts for file transfer and file manipulation.

Confidential

Data Analyst

Responsibilities:

  • Analysed business requirements, system requirements, data mapping requirement specifications, and responsible for documenting functional requirements and supplementary requirements in Quality Centre.
  • Setting up of environments to be used for testing and the range of functionalities to be tested as per technical specifications.
  • Tested Complex ETL Mapping and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.
  • Responsible for different Data mapping activities from Source systems to EDW, ODS& data marts.
  • Delivered files in various file formatting systems (ex. Excel file, Tab-delimited text, Coma separated text, Pipe delimited text, etc.)
  • Performed ad hoc analyses, as needed, with the ability to comprehend analysis as needed.
  • Involved in Teradata SQL Development, Unit testing, and Performance tuning to ensure testing issues are resolved based on using defect reports.
  • Tested the database to check field size validation, check constraints, stored procedures and cross-verifying the field size defined within the application with metadata.
  • Installed, designed, and developed the SQL Server database.
  • Created a logical design of the central relational database using Erwin.
  • Configured the DTS packages to run in periodic intervals.
  • Extensively worked with DTS to load the data from source systems and run in periodic intervals.
  • Worked with data transformations in both normalized and de-normalized data environments.
  • Involved in data manipulation using stored procedures and Integration Services.
  • Worked on query optimization, stored procedures, views, and triggers.
  • Assisted in OLAP and Data Warehouse environment when assigned.
  • Created tables, views, triggers, stored procedures, and indexes.
  • Designed and implemented database replication strategies for both internal and Disaster Recovery.
  • Created ftp connections, and database connections for the sources and targets.
  • Maintained security and data integrity of the database.
  • Developed several forms & reports using Crystal Reports.

We'd love your feedback!