Senior Big Data Engineer Resume
West Lake, TX
SUMMARY
- Having 8+ years of IT industry experience in all phases of Software Development Life Cycle (SDLC) with skills in Hadoop Development, Big Data Engineering/ Data Engineering, design, development, testing and deployment of software systems
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong Knowledge on Architecture of Distributed systems and Parallel processing, In - depth understanding of MapReduce programing paradigm and Spark execution framework.
- Strong background in various Data modeling tools using ERWIN, ER Studio and Power Designer.
- Extensively usedSpark Data Frames APIoverCloudera platformto perform analytics on Hive data and also usedSpark Data Frame Operationsto perform required Validations in the data.
- Expertise in Data Reporting, Ad-hocReporting, Graphs, Scales, PivotTables and OLAP reporting.
- Solid experience in usingHadoop ecosystemcomponents likeHadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, Scala and Hue.
- Good Knowledge on architecture and components ofSpark,and efficient in working withSpark Core, Spark SQL, Spark streamingandexpertise in buildingPySparkandSpark-Scalaapplications forinteractive analysis, batch processing and stream processing.
- Extensive knowledge in working withAzurecloud platform(HDInsight, Data Lake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Ingested data intoSnowflakecloud data warehouse usingSnow pipe.
- Good knowledge in Technologies on systems which comprises of massive amount of data running in highly distributive mode inCloudera, Hortonworks Hadoop distributions and Amazon AWS.
- Extensive experience in working with micro batching to ingest millions of files onSnowflakecloud when files arrives to staging area.
- Capable of understanding and knowledge of jobworkflow schedulingand locking tools/services likeOozie, Zookeeper, Airflow and Apache NiFi.
- DesignedUNIX Shell Scriptingfor automating deployments and other routine tasks.
- Experience working with Amazon's AWS services likeEC2, EMR, S3, KMS, Kinesis, Lambda, API gateways, IAM etc.
- Accomplished complexHiveQL queriesfor required data extraction fromHive tablesand writtenHive User Defined Functions (UDF's)as required.
- Strong knowledge in working withETL methodsfor data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
- Experience in configuring theZookeeperto coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
- Hands on Experience in usingVisualization toolslikeTableau, Power BI.
- Experience in configuringSpark Streamingto receive real time data from theApache Kafkaand store the stream data toHDFSand expertise in usingspark-SQLwith various data sources likeJSON, Parquet and Hive.
- Proficient in convertingHive/SQL queriesintoSpark transformationsusingData frames and Data sets.
- Experience in importing and exporting the data usingSqoopfromHDFSto Relational Database Systems and from Relational Database Systems toHDFS.
- Proficient in relational databases likeOracle, MySQLandSQL Server. Extensive experience in working withNO SQLdatabases and its integrationDynamo DB, Cosmos DB, Mongo DB, Cassandra and HBase
- Knowledge in using Integrated Development environments likeEclipse, NetBeans, IntelliJ, STS.
- Determined, committed and hardworking individual with strong communication, interpersonal and organizational skills.
TECHNICAL SKILLS
Big Data/Hadoop Environment: Cloudera Distribution, HDFS, Yarn, Data Node, Name Node, Resource Manager, Node Manager, MapReduce, PIG, SQOOP, Kafka, Hbase, Hive, Flume, Cassandra, Spark, Storm, Scala, Impala
Programming: Python, PySpark, Scala, Java, C, C++, Shell script, Perl script, SQL, PL/SQL
Cloud Technologies: AWS, Microsoft Azure, Snowflake
Databases: Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL
Frameworks: Django REST framework, MVC, Hortonworks
Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering.
ETL/Reporting: Ab Initio, Informatica, Tableau
Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman
Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling
Visualization/ Reporting: Tableau, ggplot2, Matplotlib, SSRS and Power BI
Web/App Server: UNIX server, Apache Tomcat
Operating System: UNIX, Windows, Linux, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, West Lake, TX
Senior Big Data Engineer
Responsibilities:
- Work in a fast-paced agile development environment to quickly analyze, develop, and test potential use cases for the business.
- Developed application to refresh Power BI reports using automated trigger API
- Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG.
- UsedCassandraCQL with Java API’s to retrieve data from Cassandra tables.
- Created various Parser programs to extract data from Autosys, Tibco Business Objects, XML, Informatica, Java, and database views using Scala.
- Integrated and automated data workloads to Snowflake Warehouse.
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- ImplementedSparkusingScalaandSpark SQLfor faster testing and processing of data.
- Experience implementing machine learning back-end pipeline with Pandas, NumPy
- Worked in creating POCs for multiple business user stories using Hadoop ecosystem.
- Develop programs in Spark to use on application for faster data processing than standard MapReduce programs.
- Develop near real time data pipeline usingspark
- Imported documents into HDFS, HBase and creating HAR files.
- Utilized Apache Spark with Python to develop and execute Big Data Analytics.
- Hands on coding - Write and test the code for the Ingest automation process - Full and Incremental Loads. Design the solution and develop the program for data ingestion using - Sqoop, map reduce, Shell script & python
- Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
- Implement One Time Data Migration of Multistate level data from SQL server to Snowflake by using Python and SnowSQL.
- Day to-day responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL queries in snowflake.
- Worked on developing Oozie jobs to create HAR files
- Responsible for designing and developing data ingestion from Kroger using Apache NiFi/Kafka.
- Implemented Spark Scripts usingScala,Spark SQLto access hive tables into spark for faster processing of data
- Developed various automated scripts for DI (Data Ingestion) and DL (Data Loading) using python map reduce.
- Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions
- Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data.
- Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs
- Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business valueusing Azure Data Factory
- Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
- Design/Implement large scale pub-sub message queues using Apache Kafka
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Worked on UDFS using Python for data cleansing
- Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing the data in in Azure Databricks
- Worked with hundreds of terabytes of data collections from different loan applications into HDFS.
- Developed and designed system to collect data from multiple portal using Kafka and then process it using spark.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
- Develop near real time data pipeline using flume, Kafka and spark stream to ingest client data from their web log server and apply transformation
- Responsible for analyzing large data sets and derive customer usage patterns by developing new MapReduce programs using Java
- Encoded and decoded Json objects using PySpark to create and modify the dataframes in Apache Spark
- Written an Impala Meta store sync up in Scala for two separate clusters sharing the same metadata.
- Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
Environment: Agile, Power BI, Azure, Azure Data Bricks, Azure Data Factory, Azure Data Lake, Hadoop, Hortonworks, Snowflake, HDFS, Solr, HAR, HBase, Oozie, Scala, Python, SOAP API webservices, Java, Weblogic, Tableau, Apache airflow, Jira
Confidential, Atlanta, GA
Big Data Engineer
Responsibilities:
- Used python, the ETL pipeline was developed and programmed to collect data from Redshift data warehouse.
- Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
- Installed Docker Registry for local upload and download of Docker images and from Docker Hub and created Docker files to automate the process of capturing and using the images.
- Programmatically created CICD Pipelines in Jenkins using Groovy scripts, Jenkins file, integrating a variety of Enterprise tools and Testing Frameworks into Jenkins for fully automated pipelines to move code from Dev Workstations to all the way to Prod environment.
- Experience on AWS cloud services such as EC2, S3, RDS, ELB, EBS, VPC, Route53, auto scaling groups, Cloud watch, Cloud Front, IAM to build configuration and troubleshooting for server migration from physical to cloud on various Amazon photos.
- Responsible for ingesting large volumes of IOT data to Kafka.
- Migrate data into RV Data Pipeline using Databricks, Spark SQL and Scala.
- Used Databricks for encrypting data using server-side encryption.
- Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
- Involved in Trouble Shooting,Performance tuning of reportsand resolving issues with inTableau Server and Reports.
- Working on Docker Hub, Docker Swarm, Docker Container network, creating Image files primarily for middleware installations & domain configurations. Evaluated Kubernetes for Docker Container Orchestration.
- Developed Oozie work processes for planning and arranging the ETL cycle. Associated with composing Python scripts to computerize the way towards extricating weblogs utilizing Airflow DAGs.
- Wrote Kafka producers to stream the data from external rest API to Kafka topics.
- Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
- Experience withSnowflake Multi-Cluster Warehouses.
- Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Streamed real time data by integrating Kafka with Spark for dynamic price surging using machine learning algorithm.
- Experience in integrating Jenkins with various tools like Maven (Build tool), Git (Repository), SonarQube (code verification), Nexus (Artifactory) and implementing CI/CD automation for creating Jenkins pipelines programmatically architecting Jenkins Clusters, and scheduled builds day and overnight to support development needs.
- Designed, documented operational problems by following standards and procedures using JIRA.
- Experience withSnowflake Virtual Warehouses.
- Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
- Worked on SSIS creating all the interfaces between front end application and SQL Server database, then from legacy database to SQL Server Database and vice versa.
- Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
- Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
- Worked extensively with Sqoop for importing data from Oracle.
- Involved in creating Hive tables, loading, and analysing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Leveraged AWS cloud services such as EC2, auto-scaling and VPC to build secure, highly scalable and flexible systems that handled expected and unexpected load bursts.
- Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store.
- Used control flow tasks and container as well as Transformations in a complex design to build an algorithm to cleanse and consolidate data.
Environment: AWS, EC2, S3, Lambda, Cloud Watch, Auto Scaling, EMR, Redshift, Kafka, HBase, Docker, Kubernetes, Jenkins, ETL, Spark, Micro Services, Snowflake, Hive, Athena, Sqoop, Pig, Oozie, Spark Streaming, Hue, Scala, Python, Databricks, Apache NIFI, GIT
Confidential, California, CA
Hadoop Developer/Data Engineer
Responsibilities:
- Involved in architecture design, development and implementation of Hadoop deployment, backup and recovery systems.
- Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
- Reviewed the HDFS usage and system design for future scalability and fault-tolerance. Installed and configured Hadoop HDFS, MapReduce, Pig, Hive, Sqoop.
- Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
- Running Hadoop streaming jobs to process terabytes of xml format data.
- Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
- Implementing and Managing ETL solutions and automating operational processes.
- Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
- Managing and reviewing Hadoop log files.
- Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
- Created various complex SSIS/ETL packages to Extract, Transform and Load data
- Advanced knowledge on Confidential Redshift and MPP database concepts.
- Strong understanding of AWS components such as EC2 and S3
- Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
- Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
- Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Connected to Amazon Redshift through Tableau to extract live data for real time analysis.
- Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
Environment: Hadoop, AWS,EC2, S3, SQL Server, Erwin, Oracle, Redshift, Informatica, RDS, NOSQL, MySQL, Dynamo DB, PostgreSQL, Tableau, Git Hub.
Confidential
Hadoop Developer
Responsibilities:
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Flume, Oozie Zookeeper and Sqoop.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala
- Design and implement MapReduce jobs to support distributed processing using java, Hive and Apache Pig.
- Involved in HDFS maintenance and administering it through Hadoop-Java API.
- Involved in creating Hive tables & working on them using HiveQL and perform data analysis using Hive and Pig.
- Wrote MapReduce jobs using Java API and Pig Latin.
- Used QlikView and D3 for visualization of query required by BI team.
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Loaded the load ready files from mainframes to Hadoop and files were converted to ASCII format.
- Configured Hive Server (HS2) to enable analytical tools like Tableau, QlikView and SAS to interact with Hive tables
- Developing UDFs in java for hive and pig and worked on reading multiple data formats on HDFS using Scala.
- Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the map reduce jobs that extract
- Analysed the SQL scripts and designed the solution to implement using Scala.
- Developed analytical component using Scala, Spark and Spark Stream.
- Created POC to store Server Log data in MongoDB to identify System Alert Metrics.
- Monitored Hadoop cluster job performance, performed capacity planning and managed nodes on Hadoop cluster.
- Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Sqoop, ZooKeeper, Teradata, PL/SQL, MySQL, Hbase, Datastage, ETL(Informatica/SSIS).
Confidential
ETL Developer
Responsibilities:
- Gathered requirements from Business and documented for project development.
- Involved in understanding the legacy applications & data relationships.
- Created data maps in Informatica to extract data from Sequential files.
- Scheduled processes in ESP Job Scheduler.
- Performed Unit, Integration and System testing of various jobs.
- Coordinated design reviews, ETL code reviews with teammates.
- Interacted with key users and assisted them with various data issues, understood data needs and assisted them with Data analysis.
- Prepared and maintained documentation for on-going projects.
- Worked with Informatica Power Center for data processing and loading files.
- Extensively worked with Informatica transformations.
- Developed mappings using Informatica to load data from sources such as Relational tables, Sequential files into the target system.
- Attended user design sessions, studied user requirements, completed detail design analysis and wrote design specs.
- Extensively worked on UNIX Shell Scripting for file transfer and error logging.
Environment: Informatica Power Center, Oracle 10g, SQL Server, UNIX Shell Scripting, ESP job schedule