Sr Data Engineer Resume
Columbia, SC
SUMMARY
- Over 9 years of experience with emphasis on Bigdata Technologies / Hadoop ecosystem, SQL, Python, Java, J2EE technologies.
- Skilled programming in Map - Reduce framework and Hadoop ecosystems.
- Hands on experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/Sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
- Good experience with AWS Databases such as RDS(Aurora), Redshift, DynamoDB, Glue Catalog and Elastic Cache (Memcached & Redis)
- Experienced in collecting error log and data logs across the cluster by using Flume.
- Excellent working knowledge of HDFS Filesystem and Hadoop Daemons such as Resource Manager, Node Manager, Name Node, Data Node, Secondary Name Node etc.
- Core Qualifications Good experience with Map Reduce, HDFS, Yarn, Python, Sqoop, HBase, Oozie Hadoop Streaming and Hive.
- Experienced in coordinating job flow in cluster through Oozie and Zookeeper.
- Good experience in implementing advanced procedures like text analytics and processing the in-memory computing capabilities with Apache Impala, Scala.
- Extensively worked on using HBASE to perform real time analytics on HDFS.
- Experience with NumPy, Matplotlib, Pandas, Seaborn, and Plotly python libraries.
- Performed map-side joins on RDD and Imported data from different sources like HDFS/HBase into Spark RDD.
- Worked on large datasets by using PySpark, NumPy and pandas.
- Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
- Experience in using Apache Kafka for log aggregating.
- Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS and performed the real-time analytics on the incoming data.
- Vast experience with Scala and Python In-depth understanding of MapReduce and the Hadoop Infrastructure Focuses on the big picture with problem-solving.
- Parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing.
- Good understanding of AWS SageMaker.
- Experience onHorton worksandCloudera Hadoopenvironments.
- Setting up data inAWSusing S3 bucket and configuring instance backups to S3 bucket. Exporting and importing data into S3.
- Loaded and transformed large sets of structured, semi structured, and unstructured data in various formats like text, zip, XML and JSON.
- Experience in supporting data analysis projects using Elastic Map Reduce on the Amazon Web Services (AWS) cloud.
- Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/data marts from heterogeneous sources.
- Good understanding of software development methodologies, including Agile (Scrum).
TECHNICAL SKILLS
Hadoop Core Services: HDFS, Map Reduce, Spark, YARN, Hive, Pig, Scala, Kafka, Flume, Impala, Oozie, Zookeeper
Bigdata Distribution: Horton works, Cloudera, Amazon EMR, Azure
NO SQL Databases: HBase, Cassandra, MongoDB
Cloud Computing Tools: AWS, GCP
Languages: Java/J2EE, Python, SQL, Pig Latin, HiveQL, Unix Shell Scripting
Java & J2EE Technologies: Core Java, Servlets, Hibernate, Spring, Struts, JMS, EJB
Application Servers: Web Logic, Web Sphere, JBoss, Tomcat, Jetty
Databases: Oracle, MySQL, SQL
Operating Systems: Windows, LINUX
Web Technologies: HTML5, CSS3, XML, JavaScript, JSON, Servlets, JSP
IDEs / Tools: MS Visual Studio, Eclipse
PROFESSIONAL EXPERIENCE
Confidential, Columbia, SC
Sr Data Engineer
Responsibilities:
- Deliver business and application specific consulting services included in the integrated Teradata solutions.
- DevelopedTerraformscript and deployed it in cloud deployment manager to spin up resources like cloud virtual networks.
- Building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Extensive use of cloud shell SDK in GCP to configure/deploy the services Data Proc, Storage, and BigQuery
- Create Tables, Views, stored procedures, other database objects on Teradata, write SQL scripts, and load data using Teradata Load utilities (Bteq, Fastload, Multiload, TPT, and Fastexport).
- Designing and implementing highly distributed, scalable and performant ETL processes - staging, data enrichments and strong data delivery techniques using Teradata, BTEQ, on key Teradata utilities like fast load, multi load, TPT.
- In charge of PySpark code, creating dataframes from tables in data service layer and writing them to a Hive data warehouse.
- Used stackdriver service/ dataproc clusters in GCP for accessing logs for debugging
- Developed routines leveraging Teradata utilities and custom SQL and workflows to collect data from multiple data sources based on mapping specifications provided to the Developer.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Utilizing your expertise to independently test, code advanced programs and debug advanced code.
- Performing data warehouse tasks for large projects (database design, business analyses, collecting business requirements and translating into technical solution, ETL development).
- Provide support and hand over to production team. Post-production support and maintenance.
- Expertise in writing Linux scripts using PUTTY to Stage, Move, Copy, Delete files
- Expertise in debugging Linux script or Job's failures.
- Migrated previously written cron jobs to airflow/composer in GCP.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Extensively used SQL, Numpy, pandas, Scikit Learn,spark, Hive for data analysis and model building.
- Involved in converting Map Reduce programs into Spark transformations using Spark RDD's using Scala and Python.
- Expertise in developing Multi-tier enterprise level web applications using various J2EE technologies including JSP, Servlets, Struts, Spring, Hibernate, JTA, JDBC, pandas,JNDI, JMS, Java Multi-threading API.
- Expertise in developing the micro-services/API using Spring Boot
Environment: Python, Pyspark, Hadoop, GCP, Data Proc, BigQuery, Storage, Airflow, Spark, Kafka, Hbase, Pandas, Java, Spring Boot, Scala, Teradata, SQL
Confidential, Branchburg, NJ
Sr. Data Engineer
Responsibilities:
- Experience with complete SDLC process staging code reviews, source code management, and build process.
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Implemented Big Data platforms using Cloudera CDH4 as data storage, retrieval, and processing systems.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Experience in moving data between GCP and Azure using Azure Data Factory.
- Experience in building power bi reports on Azure Analysis services for better performance.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
- Developed data pipelines using Flume, Sqoop, Pig, and Map Reduce to ingest data into HDFS for analysis.
- Developed Oozie Workflows for daily incremental loads, which get data from Teradata and then imported into hive tables.
- Implemented Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Sparkfor Data Aggregation, queries, and writing data into HDFS through Sqoop.
- Developed pipeline for constant information ingestion utilizing Kafka, Sparkstreaming.
- Wrote Sqoop scripts for importing large data sets from Teradata into HDFS.
- Performed Data Ingestion from multiple internal clients using Apache Kafka.
- Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).
- Developed Flume configuration to extract log data from different resources and transfer data with different file formats (JSON, XML, Parquet) to Hive tables using different SerDe's.
- Load and transform large sets of structured, semi structured, and unstructured data Pig.
- Experienced working on Pig to do transformations, event joins, filtering, and some pre-aggregations before storing the data onto HDFS.
- Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
- Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python and build models using deep learning frameworks.
- Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala, and Python.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators both old and newer operators.
- Responsible for executing hive queries using Hive Command Line, Web GUI HUE, and Impala to read, write and query the data into HBase.
- Developed and executed hive queries for denormalizing the data.
- Developed the Apache Storm, Kafka, and HDFS integration project to do a real-time data analysis.
- Experience loading and transforming structured and unstructured data into HBase and exposure handling Automatic failover in HBase.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD’s.
Environment: Cloudera, Gcp, Java, Scala, Hadoop, Spark, HDFS, python, MapReduce, Yarn, Hive, Pig, Zookeeper, Impala, Oozie, Sqoop, Flume, API, Airflow, Kafka, Teradata, SQL, GitHub, Phabricator.
Confidential, Miami, FL
Data Engineer
Responsibilities:
- Developed ETL data pipelines using Sqoop,Spark, Spark SQL, Scala, and Oozie.
- UsedSpark for interactive queries, processing of streaming data and integrated with popular NoSQL databases.
- Experience withAWS Cloud IAM, Data pipeline,EMR, S3, EC2.
- Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations
- DevelopedSpark code using Scala andSpark-SQL for faster processing of data.
- Created Oozie workflow engine to run multiple Spark jobs.
- Exploring Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark-SQL, Data Frame, pair RDD's, Spark YARN.
- Experience with terraform scripts which automates the step execution in EMR to load the data to Scylla DB.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Implemented scheduled downtime for non-prod servers for optimizing AWS pricing.
- De-normalizing the data as part of transformation, which is coming from Netezza and loading it to No SQL Databases and MySql.
- Developed various machine learning models such as logistic regression, KNN and gradient boosting with pandas, Numpy, seaborn, Matplotlib, scikit learn in python.
- Developed Kafka consumer API in Scala for consuming data from Kafka topics.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system using Scala programming.
- Implemented data quality checks using Spark Streaming and arranged bad and passable flags on the data.
- Created PySpark code that uses Spark SQL to generate dataframes from avro formatted raw layer and writes them to data service layer internal tables as orc format.
- In charge of PySpark code, creating dataframes from tables in data service layer and writing them to a Hive data warehouse.
- Good knowledge in setting up batch intervals, split intervals, and window intervals in Spark Streaming using Scala Programming language.
- Implemented Spark-SQL with various data sources like JSON, Parquet, ORC, and Hive.
- Loaded the data into Spark RDD and did in memory data Computation to generate the output response.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real-time and persist it to Cassandra.
- Involved in converting MapReduce programs into Spark transformations using Spark RDD in Scala.
- Developed Spark scripts using Scala Shell commands as per the requirements.
Environment: HDFS, Spark, Scala, Tomcat, Netezza, EMR, Oracle, Sqoop, AWS, Terraform, Scylla DB, Cassandra, MySql, Oozie
Confidential, Livermore, CA
Big Data Engineer / Hadoop Developer
Responsibilities:
- Interacted with business partners, Business Analysts and product owner to understand requirements and build scalable distributed data solutions using Hadoop ecosystem.
- Developed Spark Streaming programs to process near real time data from Kafka, and process data with both stateless and state full transformations.
- Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HQL queries.
- Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time.
- Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
- Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV, ORCFILE and other compressed file formats.
- Strong understanding of Partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Involved in converting Map Reduce programs into Spark transformations using Spark RDD's using Scala and Python.
- Developed PIG UDFs for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.
- Experience in report writing using SQL Server Reporting Services (SSRS) and creating various types of reports like drill down, Parameterized, Cascading, Conditional, Table, Matrix, Chart and Sub Reports.
- Used DataStax Spark connector which is used to store the data into Cassandra database or get the data from Cassandra database.
- Wrote oozie scripts and setting up workflow using Apache Oozie workflow engine for managing and scheduling Hadoop jobs.
- Created PySpark code that uses Spark SQL to generate dataframes from avro formatted raw layer and writes them to data service layer internal tables as orc format.
- In charge of PySpark code, creating dataframes from tables in data service layer and writing them to a Hive data warehouse.
- Worked on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
- Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature
- Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
- Developed data pipeline programs with Spark Scala APIs, data aggregations with Hive, and formatting data (JSON) for visualization, and generating.
Environment: AWS, Cassandra, PySpark, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, FLUME, Apache oozie, Zookeeper, ETL, UDF, Map Reduce, Snowflake, Apache Pig, Python, Java, SSRS.
Confidential
Big data/Java Developer
Responsibilities:
- Standardized practices for data acquisition & analysis to deliver new products using Big Data technologies.
- Directed data streaming in Kafka & deployed Scrum methodologies for data management and analytics in Jira
- Involved in complete project Life Cycle, i.e., Design, Implementation, Unit Testing.
- Extensively used agile development methodology and involved in sprint planning.
- Designed and modified User Interfaces using JSP, JavaScript, HTML5, Angular JS and jQuery with the help of several design patterns like Singleton, Factory and MVC.
- Involved in migrating legacy projects to latest versions of spring and hibernate.
- Used DAO to handle connection and to retrieve data from data storage elements
- Written Microservices to export/import data and task scheduling using Spring Boot, Spring and Hibernate. Also Used Swagger API tools while developing the microservices.
- Implemented Hibernate to persist the data into Database and wrote HQL based queries to implement CRUD operations on the data.
- Annotated POJOs are created using Hibernate annotations. Familiarized with Named Queries and Parameterized Queries in Hibernate.
- Also Worked on SQL, PL/SQL using SQL Developer for Oracle database.
- Involved in deploying the application under Apache Tomcat and maintained application logs Using Log4j. Involved in unit testing using JUnit.
- Used MAVEN to define the dependencies / plug-in and build the application.
- Used Java 8 features like Lambda expressions and Stream API for Bulk data operations on Collections which would increase the performance of the Application.
- Used SVN version Control tools.
- Used Jenkins for deploying the application to test and production environments.
- Designed and Developed Web services using SOAP to make submissions.
- Created and maintained various Message Queues and Message brokers that were a part of the application. JMS is used extensively in the application for sending budget related alerts through SMS, email etc.
- Extensively used Spark to read data from S3 and process it and write it to final Hive tables in HDFS.
- Experience withAWS Cloud IAM, Data pipeline,EMR, S3, EC2.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations.
Environment: Hadoop, Kafka, AWS, Java 7, Spring 4, Spring MVC, Spring AOP, Spring Data, JPA, Hibernate 3, SQL, Microservices, Spring Boot, RESTful web services, JSON, JUnit 4, SVN, Java script, Log4J, Jenkins, Tomcat, Jira.
Confidential
Java/J2EE Developer
Responsibilities:
- Used JSP pages through Servlets Controller for client-side view.
- Created jQuery, JavaScript plug-ins for UI.
- Always used the best practices of Java/J2EE to minimize the unnecessary object creation.
- Implement RESTful web services with the Struts framework.
- Verify them with the JUnit testing framework.
- Working experience in using Oracle 10g backend Database.
- Used JMS Queues to develop Internal Messaging System.
- Developed the UML Use Cases, Activity, Sequence and Class diagrams using Rational Rose.
- Developed Java, JDBC, and Java Beans using JBuilder IDE.
- Developed JSP pages and Servlets for customer maintenance.
- Apache Tomcat Server was used to deploy the application.
- Involving in Building the modules in Linux environment with ant script.
- Used Resource Manager to schedule the job in Unix server.
- Performed Unit testing, Integration testing for all the modules of the system.
- Developed JAVA BEAN components utilizing AWT and SWING classes.
Environment: Java, JDK, Servlets, JSP, HTML, JBuilder, HTML, JavaScript, CSS, Tomcat, Apache HTTP Server, XML, JUNIT, EJB, Restful, Oracle.