We provide IT Staff Augmentation Services!

Data Engineer Resume

3.00/5 (Submit Your Rating)

KentuckY

SUMMARY

  • Over 6+ years of experience as Data Engineer, including profound expertise and experience on statistical data analysis such as transforming business requirements into analytical models, designing algorithms, and strategic solutions that scales across massive volumes of data.
  • Experience in Big Data/Hadoop, Data Analysis, and Data modeling with applied information Technology.
  • Good exposure in functional point analysis including estimation, planning and design on DataStage platform.
  • Experience in developing Apache Spark applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL.
  • Experience in developing various spark applications using Spark - shell (Scala).
  • Experience in developing ETL applications on large volumes of data using different tools:data, Spark-Scala, PySpark, and Spark-SQL.
  • Experience in using different Hadoop eco-system components such asHDFS, MapReduce, Spark, Sqoop, Hive, HBase, Kafka and Airflow.
  • Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
  • Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured semi-structured and unstructured data sets and stores them in HDFS.
  • Experience in Designing, Developing, Documenting, Testing of ETL jobs. Executed mappings in Server and Parallel jobs using DataStage to populate tables in Data Warehouse and Data marts.
  • Experience in Text Analytics, generating data visualizations using Python and creating dashboards using tools like Tableau and Quicksight.
  • Experience in working over with bash or shell by writing scripts.
  • Experiencein designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
  • Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement. Practical understanding of the Data modeling (Dimensional & Relational) concepts and implementation of schemas such as Snowflake.
  • Experience in creating configuration files to deploy the SSIS packages across all the environments.
  • Experience in Amazon Web Service (AWS) concepts like EMR and EC2 webservices which provides fast and efficient processing of Teradata Big Data Analytics.
  • Hands on Experience with ETL tools such as AWS Glue, using Data pipeline to move data to AWS Redshift.
  • Work experience in Amazon’s Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as Storage mechanism.
  • Good Knowledge in using NIFI to automate the data movement between different Hadoop systems.
  • Experienced in ingesting data into HDFS from various Relational databases like Teradata using Sqoop and exported data back to Teradata for data storage.
  • Experience with SSIS tools like export wizard and import, package installation and SSIS Package Designer.
  • Knowledge of Azure storage Accounts, containers, Blob storage, Azure Data Lake, Azure Data Factory, Azure SQL data warehouse, stretch Database.
  • Experience working on Dockers Hub, creating Dockers images and handling multiple images primarily for middleware installations and domain configuration.
  • Strong experience inCI (Continuous Integration)/ CD (Continuous Delivery)software development pipeline stages like Commit, Build, Automated Tests, and Deploy using Jenkins.
  • Hands on experience in SQL and NoSQL database such as Snowflake, HBase, Cassandra and MongoDB.
  • Hands-on experience in GCP, Big Query, cloud dataflow and Pub/Sub cloud shell.
  • Involved in all phases of software development life cycle in Agile, Scrum and Waterfall management process.
  • A self-motivated exuberant learner and adequate with challenging projects and work in ambiguity to solve complex problems independently or in the collaborative team.
  • Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.

TECHNICAL SKILLS

Databases: Snowflake, AWS RDS, Teradata, Oracle, MySQL, Microsoft SQL, Postgre SQL.

NoSQL Databases: MongoDB, Hadoop HBase and Apache Cassandra.

Programming Languages: Python, Scala, MATLAB.

Cloud Technologies: AWS, Azure, Cloudera, GCP

Data Formats: CSV, JSON, Parquet.

Querying Languages: SQL, NO SQL, PostgreSQL, MySQL, Microsoft SQL

Integration Tools: Ansible, Docker.

Scalable Data Tools: Hadoop, HDFS, Hive, Apache Spark, Pig, Map Reduce, Sqoop.

Operating Systems: Red Hat Linux, Unix, Windows, macOS.

Reporting & Visualization: Tableau, Matplotlib.

Version Control Tools: GitHub, Bitbucket, code-cloud.

PROFESSIONAL EXPERIENCE

Confidential, Kentucky

Data Engineer

Responsibilities:

  • Interacted with clients to gather business and system requirements which involved documentation of processes based on the user requirements.
  • Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • DevelopedScalascripts andUDF's using bothdata frames/SQL and RDDinSparkfor data aggregation, queries and writing back into S3 bucket.
  • Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
  • DevelopedMapReduceprograms for pre-processing and cleansing the data in HDFS obtained from heterogeneous data sources to make it suitable for ingestion into Hive schema for analysis.
  • Involved in developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS.
  • Visualized reports using Tableau Desktop and Quicksight.
  • Implemented a proof of concept deploying this product inAWS S3 bucketandSnowflake.
  • Utilized AWS services with focus on big data architect/analytics/enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala toperform ETL jobswith ingested data.
  • Designed and developed ETL jobs to extract data from Salesforce replica and load it in DataMart in Redshift.
  • UsedSpark Streamingto divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine,Spark SQLfordata analysisand provided to the data scientists for further analysis.
  • Prepared scripts to automate the ingestion process usingPythonandScalaas needed through various sources such asAPI, AWS S3, Teradata and snowflake.
  • Automate AWS infrastructure through infrastructure as code by writing various Terraform modules, scripts by using AWS IAM users, groups, AWS Glue and Redshift clusters.
  • Migrated on premise database structure to confidential Redshift data warehouse.
  • ImplementedSpark RDD transformationstoMap business analysis and apply actions on top of transformations.
  • Implemented a Continuous Delivery pipeline withDockerand GitHub.
  • CreatedHIVE Queriesto process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
  • Created scripts to readCSV, JSON and parquet filesfrom S3 buckets inPythonand load intoAWS S3, DynamoDB and Snowflake.
  • ImplementedAWS Lambdafunctions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway.
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/writesnowflake utilityfunction using Scala.
  • Worked on Snowflake Schemas and Data Warehousing andprocessedbatch and streaming data load pipeline usingSnow Pipeand Matillion from data lake Confidential AWS S3 bucket.
  • Profile structured, unstructured, and semi-structured data across various sources to identifypatterns in data and Implement data quality metricsusing necessary queries orpythonscripts based on source.
  • Worked on designing, building, deploying and maintaining Mongo DB.
  • UsedSQLqueries and other tools to perform data analysis and profiling.
  • Used Agile methodology named SCRUM for all the work performed.

Environment: Spark, Scala, AWS, ETL, Hadoop, Python, Snowflake, HDFS, Hive, MapReduce, PySpark, Pig, Docker, GitHub, Apache Spark, Teradata, JSON, PostgreSQL, MongoDB, SQL, Agile and Windows.

Confidential, OH

Data Engineer

Responsibilities:

  • Participated in requirement gathering session with business users and sponsors to understand and document the business requirements.
  • Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities likeApache Sparkwritten inScala
  • Developedsparkapplications for performing large scale transformations and denormalization of relational datasets.
  • Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
  • Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built withPython.
  • Involved in the data ingestion process through datastage to load into HDFS from Mainframes.
  • Performed analysis on the unused user navigation data by loading intoHDFSand writingMapReducejobs.
  • Involved in loading the data from multiple Data sources like (SQL, DB2, and Oracle) intoHDFSusingSqoopand load intoHive tables.
  • CreatedHBasetables to load large sets of structured data.
  • Involved in importing the real time data toHadoopusingKafkaand implemented theOoziejob for daily imports
  • Performed Real time event processing of data from multiple servers in the organization usingApache Stormby integrating withApache Kafka.
  • Continuous integration using Jenkins for nightly builds and send automatic emails to the team.
  • Managed and reviewedHadoop logfiles.
  • Used NIFI data pipeline to process large sets of data and configured Lookup’s for Data Validation and Integrity.
  • Used Pig UDF's to implement business logic in Hadoop.
  • Involved in convertingHive/SQLqueries intoSpark transformationsusingSpark RDDs and Scala.
  • Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV & other compressed file formats.
  • Created Python script to monitor server load performance in production environment and horizontally scale the servers by deploying new instances.
  • Worked onPySpark APIsfor data transformations.
  • Deployed the pipeline in Azure Data Factory (ADF) that process the data using SQL activity by developing the JSON scripts.
  • Experience in Shell scripting using sh, ksh and bash.
  • Data ingestiontoHadoop(Sqoop imports). To perform validations and consolidations for the imported data.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Consumed XML messages using Kafka and processed the xml file using Spark streaming to capture UI updates.
  • Maintained and developed complex SQL queries, views, functions and reports that qualify customer requirements on Snowflake.
  • Data Lineage analysis of the existing Informatica and Snaplogic pipelines to generate lineage documentation of existing ETL data flows.
  • Used Jenkins plugins for code coverage and run all the test before generating war file.
  • Used Spark-SQL to load JSON data and create schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
  • Design of Data warehouse and data marts with Star Schemas and Snowflake architecture and Kimball Dimensional Modelling.
  • Data ingestion to one or more Azure Services - Azure Data Lake, Azure Storage and processing data in Azure Databricks.
  • Designed and Implemented Sharding and Indexing Strategies for MongoDB servers.
  • Wrote complex SQL queries, PL/SQL stored procedures and convert them to ETL tasks.
  • Data ingestion to one or more Azure Services like Azure Data Storage, Azure Data Lake and Azure SQL.
  • Involved in Agile methodologies, daily scrum meetings, sprint planning.

Environment: Spark, Scala, Azure, ETL, Hadoop, Python, Snowflake, HDFS, Hive, MapReduce, PySpark, Pig, Tableau, Teradata, JSON, XML, Apache Kafka, PostgreSQL, SQL, PL/SQL, Agile and Windows.

Confidential, Texas

Data Engineer

Responsibilities:

  • Worked with the business users to gather, define business requirements and analyze the possible technical solutions.
  • Developed Spark scripts by using Python and Scala shell commands as per the requirement.
  • Developed ETL framework using Spark and Hive (including daily runs, error handling, and logging) to useful data.
  • DevelopedPIGscripts for the analysis of semi structured data.
  • Used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
  • Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and Scala.
  • Used Hive to analyze the Partitioned and Bucketed data and compute various metrics for reporting.
  • Used Kafka to load data into HDFS and move data back to S3 after data processing
  • Worked on migrating MapReduce programs into Spark transformations using Scala.
  • Used ETL(SSIS) to develop jobs for extracting, cleaning, transforming and loading data into data warehouse.
  • Prepared the complete data mapping for all migrated jobs using SSIS.
  • UsedETLto implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
  • Designing and implementing data warehouses and data marts using components of Kimball Methodology, like Data Warehouse Bus, Conformed Facts & Dimensions, Slowly Changing Dimensions, Surrogate Keys, Star Schema, Snowflake Schema, etc.
  • Build pipelines by using Apache Airflow in GCP environment.
  • Used Google BigQuery power to process and analyze vast chunks of data.
  • UtilizedSparkSQL API inPySparkto extract and load data and perform SQL queries.
  • Worked on Ingestion, Parsing and loading the data from CSV and JSON files using Hive and Spark.
  • Writtenpig scriptto load processed data fromHDFSintoMongoDB.
  • Extensively involved in writing SQL queries (sub queries and join conditions) for building and testing ETL processes.
  • Actively participating in the code reviews, meetings and solving any technical issues.

Environment: Spark, Scala, ETL, Python, GCP, HDFS, Hive, Kafka, Pig, CSV, JSON, PySpark, SQL, Agile and Windows.

Confidential

Data Engineer

Responsibilities:

  • Involved in requirement analysis, design, development, testing, and documentations.
  • Designed and deployed a Spark cluster and different Big Data analytic tools including Spark, Kafka streaming, AWS and HBase with Cloudera Distribution.
  • Involved in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
  • Worked on reading multiple data formats onHDFSusingpython.
  • Generated workflows through Apache Airflow.
  • Leveraged ETL methods for ETL solutions and data warehouse tools for reporting and analysis.
  • Used CSV Excel Storage to parse with different delimiters in PIG.
  • Perform structural modifications using Map - Reduce, HIVE and analyze data using visualization/ reporting tools.
  • Involved in extracting source data from Sequential files, XML files, CSV files, transforming and loading it into the target Data warehouse.
  • Managed and Monitored Hadoop Clusters by Cloudera Manager.
  • Translated business concepts intoXML vocabulariesby designingXML SchemaswithUML.
  • Developed code to write canonical model JSON records from numerous input sources to Kafka queues.
  • Develop generic SQL Procedures and Complex T-SQL statements to achieve the reports generation.
  • Analyzed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class Modeling techniques.

Environment: Scala, Spark, Python, PySpark, ETL, HDFS, Pig, Cloudera, MapReduce, Hive, XML, CSV, JSON, Kafka, SQL.

We'd love your feedback!